Scrape information from page (where info is the result of a javascript command) *$20

Discussion in 'PHP' started by x11joex11, Jan 5, 2008.

  1. #1
    Hey there. Scrapping is usually really easy these days with the use of Regular Expressions and CURL with php, however I ran into a problem recently with a client in which I couldn't grab the some values because it was being generated via JavaScript.

    I figured okay, I'll try DOM, but because DOM doesn't let you set header functions I couldn't get the info from the the following site because the DOM didn't know how to correctly handle the page request. I'll post how I got CURL to properly give me return results at the bottom of this so you can experiment and to save you time.

    http://equestrian.en.alibaba.com/trustpass_profile.html

    On that page is an example of a company, I've got all the other information recorded fine, but take a look at the source code for that page and look for 'Selling Leads (171)' and 'Products (130)'. I'm trying to capture those numbers. You will notice that the numbers are generated by Javascript, and the source shows the Javascript instead of the numbers =(, If you can find a way to do it or point me in the right direction I don't mind paying you for your help (by pay-pal preferably).

    Best,
    - Joe

    Code to help you connect to there pages as promised below. It works by making it think you are a googlebot.

    function getResultFromURL($url)
    {
    	//This function needs to be like this because it disguises the URL as the googlebot so it can read from any site
    	$curl = curl_init();
    
    	$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    	$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    	$header[] = "Cache-Control: max-age=0";
    	$header[] = "Connection: keep-alive";
    	$header[] = "Keep-Alive: 300";
    	$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    	$header[] = "Accept-Language: en-us,en;q=0.5";
    	$header[] = "Pragma: "; // browsers keep this blank.
    	
    	curl_setopt($curl, CURLOPT_URL, $url);
    	curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)');
    	curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
    	curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com');
    	curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
    	curl_setopt($curl, CURLOPT_AUTOREFERER, true);
    	curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    	curl_setopt($curl, CURLOPT_TIMEOUT, 10);
    	
    	$html = curl_exec($curl); // execute the curl command
    	curl_close($curl); // close the connection
    	
    	return $html; // and finally, return $html
    }
    PHP:
     
    x11joex11, Jan 5, 2008 IP
  2. Barti1987

    Barti1987 Well-Known Member

    Messages:
    2,703
    Likes Received:
    115
    Best Answers:
    0
    Trophy Points:
    185
    #2
    You don't need to do anything. The information is right on the page:

    
    var sellLeadsCount = ""+171;
    var productsCount = ""+28;
    
    Code (markup):
    Peace,
     
    Barti1987, Jan 5, 2008 IP
  3. x11joex11

    x11joex11 Peon

    Messages:
    106
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Interesting, didn't see that =P, thanks.
     
    x11joex11, Jan 5, 2008 IP