getting vbulletin captcha image with curl

Discussion in 'PHP' started by xenon2010, Apr 27, 2010.

  1. #1
    hi
    I need to download Vbulletin captcha images on my HDD using curl and PHP. I really need to get samples of captcha images from several VBulletin boards. I'm collecting these samples for some kind of research. anyway, here is what I done with curl till now.
    1- download register.php page.
    2- parse the downloaded page to get captcha image url.
    3- download that image.

    now I have done step 1 and 2 correctly. but when I try to download the captcha image I don't get the captcha. I just get either a very tiny blank gif picture. or I get a png picture with vbulletin word on it. I really don't know what i'm doing wrong. I tried to output the html and push it to the browser the image shows correctly. but thats not what I want. I want to download the image and save it on my HDD.

    here are some codes I've been working on:

    //get contents with curl
    function get_content($url)  
    { 
    	$theString = parse_url($url);
    	$cookieName = $theString['host'];
    		
    	$ch = curl_init();  
    	curl_setopt($ch, CURLOPT_URL, $url."register.php");  
    	curl_setopt($ch, CURLOPT_REFERER, $url."register.php");
    	curl_setopt($ch, CURLOPT_HEADER, 0); 
    	curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)');  
    	curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies/cookie.txt"); //saved cookies
    	curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies/cookie.txt"); //saved cookies
    	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    	curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    	   
    	$string = curl_exec ($ch);  
    	//print_r(curl_getinfo($ch));
    	curl_close ($ch);   
    	return $string;      
    }
    
    //vbulletin main page
    $url = 'http://blavbulletin.com/';
    
    //get the page
    $results = get_content($url); 
    
    if (preg_match_all('/<img[^>]*id\=\"imagereg\"[^>]*src\=\"([^\"]*)\"[^>]*>/s', $results , $captchaimages))
    	{
    	  $captcha = $captchaimages[1][0];
    		  
    		  echo "<img src='$url"."$captcha'>"; //when echoed the pic shows correctly
              
    		  //now get the pic
              $file = get_content("$url"."$captcha");
    
              //save the pic on HDD
    	  file_put_contents("captcha.jpg",  $file);
    
    	}
    
    PHP:
    any help would be appreciated..
    regards,
     
    xenon2010, Apr 27, 2010 IP
  2. bytes

    bytes Peon

    Messages:
    39
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Here is the strategy I'd like to suggest when working with curl:
    1. Install tamper data addon for FF or any proxy server, anything that would allow you to see headers and raw request/response
    2. Launch tamper data and visit the page you're interested in with the browser
    3. in your case: find the particular request line for the captcha image, select it
    4. In the request area (bottom left area in tamper) you see all request headers
    5. copy each of them by right clicking and selecting copy and paste it in your source code as:

    
    $headers = array(
    'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.2.3) Gecko/20100401 MRA 5.6 (build 03278) Firefox/3.6.3 (.NET CLR 3.5.30729)',
    'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language: ru,en-us;q=0.7,en;q=0.3',
    'Accept-Charset: windows-1251,utf-8;q=0.7,*;q=0.7',
    'Content-Type: application/x-www-form-urlencoded; charset=UTF-8',
    'X-Requested-With: XMLHttpRequest',
    'Pragma: no-cache',
    'Cache-Control: no-cache'
    );
    
    PHP:
    be sure to replace Accept-Charset=.... like Tamper copies with Accept-Charset: .... (= with : )
    each row should be the element of the headers array.

    Add $headers to your surl request as:
    
    curl_setopt($ch,CURLOPT_HTTPHEADER,$headers);
    
    PHP:
    so you should get something like:

    
    $headers = array(
    'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.2.3) Gecko/20100401 MRA 5.6 (build 03278) Firefox/3.6.3 (.NET CLR 3.5.30729)',
    'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language: ru,en-us;q=0.7,en;q=0.3',
    'Accept-Charset: windows-1251,utf-8;q=0.7,*;q=0.7',
    'Content-Type: application/x-www-form-urlencoded; charset=UTF-8',
    'X-Requested-With: XMLHttpRequest',
    'Pragma: no-cache',
    'Cache-Control: no-cache'
    );
    
    $ch = curl_init("http://.....");
            
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch,CURLOPT_HTTPHEADER,$headers);
            
    $res = curl_exec($ch);
    
    PHP:
    The idea of all this is that the server won't be able to tell you from the same request made from browser if you copy the url and headers correctly.

    As a side note: I'd advise you to use Zend_Dom_Query from Zend Framework rather than PCRE for this purpose - it handles even not valid XHTML and it is very handy to do $dom->query('.captcha p')
    instead of these complex preg_match()'es
     
    bytes, Apr 27, 2010 IP
  3. xenon2010

    xenon2010 Peon

    Messages:
    237
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #3
    actually i'm using Live HTTP headers extension..
    so when I run the page I got these headers for captcha image:

    $header = array(
    "Host: blabla.com",
    "Accept: image/png,image/*;q=0.8,*/*;q=0.5",
    "Accept-Language: en-us,en;q=0.5",
    "Accept-Encoding: gzip,deflate",
    "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
    "Keep-Alive: 115",
    "Connection: keep-alive",
    "Referer: http://blabla.com/register.php",
    "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)"
    );

    there is also a cookie header I just removed it because I'm using cookies from curl options.
    anyway, I ran the script and all i get is a jpg image with vbulletin word on it.
    any ideas?
     
    xenon2010, Apr 27, 2010 IP
  4. bytes

    bytes Peon

    Messages:
    39
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #4
    do you mean you're using:
    
        curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies/cookie.txt"); //saved cookies
        curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies/cookie.txt"); //saved cookies
    
    PHP:
    what is in the file? Most likely session ID is stored in the cookie you've removed and vBulletin could check for a session flag - I don't know if it does this, but it can, so you should try using the ccokie with your session ID rather than some from samples. Try to remove these two options (cookiejar and cookiefile) and add only Cookie: header from live http headers. If you don't see it after this, could you PM me the URL ?
     
    bytes, Apr 27, 2010 IP
  5. xenon2010

    xenon2010 Peon

    Messages:
    237
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #5
    okay let me show you my 2 functions:
    function get_content($url)  
    { 
    	$ch = curl_init();  
    	curl_setopt($ch, CURLOPT_URL, $url);  
    	curl_setopt($ch, CURLOPT_REFERER, $url);
    	curl_setopt($ch, CURLOPT_HEADER, 0); 
    	curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)');  
    	curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies/cookie.txt"); //save cookies
    	curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies/cookie.txt"); //saved cookies
    	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    	   
    	$string = curl_exec ($ch);   
    	curl_close ($ch);   
    	return $string;      
    }
    PHP:
    this function used to get the register page...
    as you can see I added COOKIEJAR file to save the cookies the cookie file contains:
    bb_lastvisit
    bb_lastactivity
    bb_sessionhash
    those values will be used in the second call for my second function..

    function get_image($url, $site)  
    { 
    	$header = array( 
    		"Host: $site",
    		"User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)",
    		"Accept-Language: en-us,en;q=0.5",
    		"Accept-Encoding: gzip,deflate",
    		"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
    		"Keep-Alive: 115",
    		"Connection: keep-alive",
    		"Referer: $site"."register.php"
    	         );	
     
    	$ch = curl_init();  
    	curl_setopt ($ch, CURLOPT_URL, $url);  
            curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
    	curl_setopt ($ch, CURLOPT_HEADER, 0); 
    	curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies/cookie.txt"); //read saved cookies 
    	curl_setopt($ch, CURLOPT_VERBOSE, 1); 
    	curl_setopt($ch, CURLOPT_STDERR, $sd); //output details
    	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    	curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    	   
    	$string = curl_exec ($ch);  
    	curl_close ($ch);   
    	return $string;      
    }
    PHP:
    this function should get the image..
    it works but the image I get is not the captcha image..
     
    xenon2010, Apr 27, 2010 IP
  6. bytes

    bytes Peon

    Messages:
    39
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #6
    ok, can you install Charles shareware proxy and make another request from your browser, then add
    
    curl_setopt($ch, CURLOPT_PROXY,'127.0.0.1:8888');
    
    PHP:
    and make abother request from your script. Then look in Charles and compare these two requests, both for the main contnt and the image. They should be exactly the same in headers. If there is a difference, try fixing that using curl (headers or cookies whatever) or post here the difference
     
    bytes, Apr 27, 2010 IP
  7. xenon2010

    xenon2010 Peon

    Messages:
    237
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #7
    holy shit I found the solution and it was really silly thing.
    for starters, all of the headers were correct. and my work was correct from the beginning the only thing I didn't see was the amp; part inside the image url. so when I used regex:

    if (preg_match_all('/<img[^>]*id\=\"imagereg\"[^>]*src\=\"([^\"]*)\"[^>]*>/s', $data , $captchaimages))
    {
    $captcha = $captchaimages[1][0]; //this returns image url with amp;
    echo "<img src='$fixedUrl/$captcha'>";
    $captcha = str_replace("amp;","",$captcha); //remove amp; from the string

    // vouaaalaaaa now its working like charm
    $file = get_image("$fixedUrl/$captcha");
    file_put_contents("captcha.jpg", "$file");

    }
    all I did is removing the amp; from captcha url using str_replace().
    damn I spent 2 days for this silly amp;.. I didn't notice that it was causing the problem.
    anyway, thanks for the help :)
     
    xenon2010, Apr 28, 2010 IP