Comment modifier mon Web Crawler pour récupérer les SRC et HREF de l'icône et du Body ?

11 nov. 2019, 09:15

Bonjour à tous.

Excusez-moi du dérangement. J'ai un Web Scrapper (un robbot web) qui me permet de télécharger seulement les Textes (title <title>, Description <meta Description> et url).

Code : Tout sélectionner

function get_details($url) {
 
	// The array that we pass to stream_context_create() to modify our User Agent.
	$options = array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: chegBot/0.1\n"));
	// Create the stream context.
	$context = stream_context_create($options);
	// Create a new instance of PHP's DOMDocument class.
	$doc = new DOMDocument();
	// Use file_get_contents() to download the page, pass the output of file_get_contents()
	// to PHP's DOMDocument class.
	@$doc->loadHTML(@file_get_contents($url, false, $context));
 
	// Create an array of all of the title tags.
	$title = $doc->getElementsByTagName("title");
	// There should only be one <title> on each page, so our array should have only 1 element.
	$title = $title->item(0)->nodeValue;
	// Give $description and $keywords no value initially. We do this to prevent errors.
	$description = "";
	$keywords = "";
	// Create an array of all of the pages <meta> tags. There will probably be lots of these.
	$metas = $doc->getElementsByTagName("meta");
	// Loop through all of the <meta> tags we find.
	for ($i = 0; $i < $metas->length; $i++) {
		$meta = $metas->item($i);
		// Get the description and the keywords.
		if (strtolower($meta->getAttribute("name")) == "description")
			$description = $meta->getAttribute("content");
		if (strtolower($meta->getAttribute("name")) == "keywords")
			$keywords = $meta->getAttribute("content");
 
	}
	// Return our JSON string containing the title, description, keywords and URL.
	return '{ "Title": "'.str_replace("\n", "", $title).'", "Description": "'.str_replace("\n", "", $description).'", "Keywords": "'.str_replace("\n", "", $keywords).'", "URL": "'.$url.'"},';
 
}

1 - Je veux que vous m'aidiez à le modifier de sorte à récupérer aussi dans le Document (Balise <body>) tous les fichiers ou extensions (.docx, .pdf, .jpeg, .png, .svg, .mp3, .mp4, etc... En gros tous les Fichiers Textes ou Vidéos ou encore Images possibles dans la Balise <body>) possible et disponibles dans ce body. Tous ces Fichiers dans une Variable PHP: $file.

2 - Aidez moi aussi à récupérer dans une variable $icon, tous les Href des Icônes

<link rel="icon" type="image/png" href="favicon.png" />

disponible dans la balise link ayant une valeur icon dans l'attribut rel.

Donc, pour être plus clair, je veux récupérer tous les Liens des Attributs Href et src disponibles de la balise <Body> de la Page Web dans un premier temps ET dans un second temps, le Href de <link> avec l'attribut <rel> avec comme valeur icon.

AIDEZ-MOI DONC S'IL VOUS PLAÎT.

Merci d'avance.

11 nov. 2019, 16:58

si tu veux faire autant de recherches fines sur DOMDocument, aussi bien lui créer un DOMXPath

<pre><?php

function get_details($url) {
 
  // The array that we pass to stream_context_create() to modify our User Agent.
  $options = array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: chegBot/0.1\n"));
  // Create the stream context.
  $context = stream_context_create($options);
  // Create a new instance of PHP's DOMDocument class.
  $doc = new DOMDocument();
  // Use file_get_contents() to download the page, pass the output of file_get_contents()
  // to PHP's DOMDocument class.
  @$doc->loadHTML(@file_get_contents($url, false, $context));

  $xpath = new DOMXPath($doc);
  $title = ($i = $xpath->query('//title'))->length ? $i[0]->nodeValue : '';
  $description = ($i = $xpath->query('//meta[@name="description"]/@content'))->length ? $i[0]->nodeValue : '';
  $keywords = ($i = $xpath->query('//meta[@name="keywords"]/@content'))->length ? $i[0]->nodeValue : '';
  $icon = ($i = $xpath->query('//link[contains(@rel,"icon")]/@href'))->length ? $i[0]->nodeValue : '';
  $liens = [];
  foreach($xpath->query('//body//*[@href]/@href|//body//*[@src]/@src') as $lien) $liens[] = $lien->nodeValue;

  var_dump($icon, $liens);

  return '{ "Title": "'.str_replace("\n", "", $title).'", "Description": "'.str_replace("\n", "", $description).'", "Keywords": "'.str_replace("\n", "", $keywords).'", "URL": "'.$url.'"},';
}

var_dump(get_details('https://www.php.net/manual/fr/class.domxpath.php'));

j'ai fait condensé ce que tu n'es pas obligé de faire. une explication pour $liens:

//body => rechercher dans body peu importe la parenté // au lieu de /
//* => tout tag peu importe la parenté mais descendant de body
[@href] => ayant un attribut href
/@href => et on veut ce href sur ->nodeValue() sinon il y a aussi ->getAttribute()
| => ou bien
//body => rechercher dans body
//* => tout tag peu importe la parenté mais descendant de body
[@src] => ayant un attribut src
/@src => et on veut ce src sur ->nodeValue() sinon il y a aussi ->getAttribute()

finalement, j'ai utilisé contains() pour l'icon car on peut aussi écrire:
<link rel="shortcut icon" href="https://www.php.net/favicon.ico">
comme c'est le cas sur le site de php.net car sinon
$xpath->query('//link[@rel="icon"]/@href') aurait suffit.

il est aussi possible de faire ses propres fonctions xpath en php
https://www.php.net/manual/fr/domxpath. ... ctions.php
bonne chance.

11 nov. 2019, 23:12

Merci beaucoup pour la réponse.

Comment modifier mon Web Crawler pour récupérer les SRC et HREF de l'icône et du Body ?

Qui est en ligne