hi I need to download Vbulletin captcha images on my HDD using curl and PHP. I really need to get samples of captcha images from several VBulletin boards. I'm collecting these samples for some kind of research. anyway, here is what I done with curl till now. 1- download register.php page. 2- parse the downloaded page to get captcha image url. 3- download that image. now I have done step 1 and 2 correctly. but when I try to download the captcha image I don't get the captcha. I just get either a very tiny blank gif picture. or I get a png picture with vbulletin word on it. I really don't know what i'm doing wrong. I tried to output the html and push it to the browser the image shows correctly. but thats not what I want. I want to download the image and save it on my HDD. here are some codes I've been working on: //get contents with curl function get_content($url) { $theString = parse_url($url); $cookieName = $theString['host']; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url."register.php"); curl_setopt($ch, CURLOPT_REFERER, $url."register.php"); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)'); curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies/cookie.txt"); //saved cookies curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies/cookie.txt"); //saved cookies curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); $string = curl_exec ($ch); //print_r(curl_getinfo($ch)); curl_close ($ch); return $string; } //vbulletin main page $url = 'http://blavbulletin.com/'; //get the page $results = get_content($url); if (preg_match_all('/<img[^>]*id\=\"imagereg\"[^>]*src\=\"([^\"]*)\"[^>]*>/s', $results , $captchaimages)) { $captcha = $captchaimages[1][0]; echo "<img src='$url"."$captcha'>"; //when echoed the pic shows correctly //now get the pic $file = get_content("$url"."$captcha"); //save the pic on HDD file_put_contents("captcha.jpg", $file); } PHP: any help would be appreciated.. regards,
Here is the strategy I'd like to suggest when working with curl: 1. Install tamper data addon for FF or any proxy server, anything that would allow you to see headers and raw request/response 2. Launch tamper data and visit the page you're interested in with the browser 3. in your case: find the particular request line for the captcha image, select it 4. In the request area (bottom left area in tamper) you see all request headers 5. copy each of them by right clicking and selecting copy and paste it in your source code as: $headers = array( 'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.2.3) Gecko/20100401 MRA 5.6 (build 03278) Firefox/3.6.3 (.NET CLR 3.5.30729)', 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language: ru,en-us;q=0.7,en;q=0.3', 'Accept-Charset: windows-1251,utf-8;q=0.7,*;q=0.7', 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8', 'X-Requested-With: XMLHttpRequest', 'Pragma: no-cache', 'Cache-Control: no-cache' ); PHP: be sure to replace Accept-Charset=.... like Tamper copies with Accept-Charset: .... (= with : ) each row should be the element of the headers array. Add $headers to your surl request as: curl_setopt($ch,CURLOPT_HTTPHEADER,$headers); PHP: so you should get something like: $headers = array( 'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.2.3) Gecko/20100401 MRA 5.6 (build 03278) Firefox/3.6.3 (.NET CLR 3.5.30729)', 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language: ru,en-us;q=0.7,en;q=0.3', 'Accept-Charset: windows-1251,utf-8;q=0.7,*;q=0.7', 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8', 'X-Requested-With: XMLHttpRequest', 'Pragma: no-cache', 'Cache-Control: no-cache' ); $ch = curl_init("http://....."); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch,CURLOPT_HTTPHEADER,$headers); $res = curl_exec($ch); PHP: The idea of all this is that the server won't be able to tell you from the same request made from browser if you copy the url and headers correctly. As a side note: I'd advise you to use Zend_Dom_Query from Zend Framework rather than PCRE for this purpose - it handles even not valid XHTML and it is very handy to do $dom->query('.captcha p') instead of these complex preg_match()'es
actually i'm using Live HTTP headers extension.. so when I run the page I got these headers for captcha image: $header = array( "Host: blabla.com", "Accept: image/png,image/*;q=0.8,*/*;q=0.5", "Accept-Language: en-us,en;q=0.5", "Accept-Encoding: gzip,deflate", "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Keep-Alive: 115", "Connection: keep-alive", "Referer: http://blabla.com/register.php", "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)" ); there is also a cookie header I just removed it because I'm using cookies from curl options. anyway, I ran the script and all i get is a jpg image with vbulletin word on it. any ideas?
do you mean you're using: curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies/cookie.txt"); //saved cookies curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies/cookie.txt"); //saved cookies PHP: what is in the file? Most likely session ID is stored in the cookie you've removed and vBulletin could check for a session flag - I don't know if it does this, but it can, so you should try using the ccokie with your session ID rather than some from samples. Try to remove these two options (cookiejar and cookiefile) and add only Cookie: header from live http headers. If you don't see it after this, could you PM me the URL ?
okay let me show you my 2 functions: function get_content($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_REFERER, $url); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)'); curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies/cookie.txt"); //save cookies curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies/cookie.txt"); //saved cookies curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $string = curl_exec ($ch); curl_close ($ch); return $string; } PHP: this function used to get the register page... as you can see I added COOKIEJAR file to save the cookies the cookie file contains: bb_lastvisit bb_lastactivity bb_sessionhash those values will be used in the second call for my second function.. function get_image($url, $site) { $header = array( "Host: $site", "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)", "Accept-Language: en-us,en;q=0.5", "Accept-Encoding: gzip,deflate", "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Keep-Alive: 115", "Connection: keep-alive", "Referer: $site"."register.php" ); $ch = curl_init(); curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HTTPHEADER, $header); curl_setopt ($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies/cookie.txt"); //read saved cookies curl_setopt($ch, CURLOPT_VERBOSE, 1); curl_setopt($ch, CURLOPT_STDERR, $sd); //output details curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); $string = curl_exec ($ch); curl_close ($ch); return $string; } PHP: this function should get the image.. it works but the image I get is not the captcha image..
ok, can you install Charles shareware proxy and make another request from your browser, then add curl_setopt($ch, CURLOPT_PROXY,'127.0.0.1:8888'); PHP: and make abother request from your script. Then look in Charles and compare these two requests, both for the main contnt and the image. They should be exactly the same in headers. If there is a difference, try fixing that using curl (headers or cookies whatever) or post here the difference
holy shit I found the solution and it was really silly thing. for starters, all of the headers were correct. and my work was correct from the beginning the only thing I didn't see was the amp; part inside the image url. so when I used regex: if (preg_match_all('/<img[^>]*id\=\"imagereg\"[^>]*src\=\"([^\"]*)\"[^>]*>/s', $data , $captchaimages)) { $captcha = $captchaimages[1][0]; //this returns image url with amp; echo "<img src='$fixedUrl/$captcha'>"; $captcha = str_replace("amp;","",$captcha); //remove amp; from the string // vouaaalaaaa now its working like charm $file = get_image("$fixedUrl/$captcha"); file_put_contents("captcha.jpg", "$file"); } all I did is removing the amp; from captcha url using str_replace(). damn I spent 2 days for this silly amp;.. I didn't notice that it was causing the problem. anyway, thanks for the help