Forum d'entraide PHPFrance

Bonjour.

Voulant utiliser la Librairie https://github.com/marcushat/RollingCurlX pour modifier mon Web Scrapper de base utilisant par défaut file_get_contents de sorte à télécharger 10 000 URLs à la fois et tous ces 10 000 URLs en une journée, j'ai dû panacher en remplçant le

Ligne 19:

$options = array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: chegSpider/0.1\n"))

PAR LA LIGNE 14:

$RollingCurlX->setHeaders(array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: chegSpider/0.1\n")));

Ligne 25:

@$doc->loadHTML(@file_get_contents($url, false, $context));

PAR LA LIGNE 26:

@$doc->loadHTML(@$RollingCurlX->execute());

DANS LE CODE MODIFIE CI-APRES:

Code : Tout sélectionner

<?php

require_once 'rollingcurlx.class.php';

function get_details($url) {
	
    $post_data = null;
    //$user_data = null;
    $options = array(CURLOPT_SSL_VERIFYPEER => FALSE, CURLOPT_SSL_VERIFYHOST => FALSE);
	// $headers = array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: chegSpider/0.1\n"));
    $RollingCurlX = new RollingCurlX(10000);
    $RollingCurlX->setOptions($options);
    $RollingCurlX->setTimeout(86400000) //86400 milliseconds => 1 jour;
    $RollingCurlX->setHeaders(array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: chegSpider/0.1\n")));
	
	$RollingCurlX->addRequest($url, $post_data);
/*
	// The array that we pass to stream_context_create() to modify our User Agent.
	$options = array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: chegSpider/0.1\n"));
	// Create the stream context.
	$context = stream_context_create($options);
*/
	// Create a new instance of PHP's DOMDocument class.
	$doc = new DOMDocument();
	// @$doc->loadHTML(@file_get_contents($url, false, $context));
	@$doc->loadHTML(@$RollingCurlX->execute());
	$pageDownloadedHtml = @$doc->saveHTML();
	// fread($pageDownloadedHtml);

	// Get all of the lang Attribute in HTML tag.
			$langPage = $doc->getElementsByTagName("html");
			$lang = $langPage->getAttribute("lang");
			
			// Create an array of all of the title tags.
			$title = $doc->getElementsByTagName("title");
			// There should only be one <title> on each page, so our array should have only 1 element.
			$title = $title->item(0)->nodeValue;
			// Give $description and $keywords no value initially. We do this to prevent errors.
			$description = "";
			$keywords = "";
			// Create an array of all of the pages <meta> tags. There will probably be lots of these.
			$metas = $doc->getElementsByTagName("meta");
			// Loop through all of the <meta> tags we find.
			for ($i = 0; $i < $metas->length; $i++) {
				$meta = $metas->item($i);
				// Getthe keywords.
				if (strtolower($meta->getAttribute("name")) == "keywords")
					$keywords = $meta->getAttribute("content");

			}

}

Veuillez m'aider à parfaire mon Code qui que je sache, est actuellement très brouillon de sorte à:

1 - Télécharger les 10 000 URLs à la fois et en une journée avec la Librairie https://github.com/marcushat/RollingCurlX.

2 - Combiner avec DOMDocument de sorte à récupérer la Langue, le Title et les Keywords de Chacun des URLs téléchargés que j'ai essayé de récupérer de la Ligne 30 à 49.

Merci d'avance.

Forum d'entraide PHPFrance

Aidez-moi à corriger (revoir) et parfaire mon Code de sorte à combiner la libraire RollingCurlX et DOMDocument

Aidez-moi à corriger (revoir) et parfaire mon Code de sorte à combiner la libraire RollingCurlX et DOMDocument