There is this function: function disguise_curl($url) { $curl = curl_init(); // setup headers - used the same headers from Firefox version 2.0.0.6 // below was split up because php.net said the line was too long. :/ $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; $header[] = "Cache-Control: max-age=0"; $header[] = "Connection: keep-alive"; $header[] = "Keep-Alive: 300"; $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; $header[] = "Accept-Language: en-us,en;q=0.5"; $header[] = "Pragma: "; //browsers keep this blank. curl_setopt($curl, CURLOPT_URL, $url); curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3'); curl_setopt($curl, CURLOPT_HTTPHEADER, $header); curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com'); curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate'); curl_setopt($curl, CURLOPT_AUTOREFERER, true); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_TIMEOUT, 10); $html = curl_exec($curl); //execute the curl command if (!$html) { echo "cURL error number:" .curl_errno($ch); echo "cURL error:" . curl_error($ch); exit; } curl_close($curl); //close the connection return $html; //and finally, return $html } Code (markup): ...that several people seem to use to scrape content off a website (to state the obvious, you would do "echo disguise_curl($url)"). Is there any way to detect if someone is doing that to my site, and block access to them or show a page with a specific message? I've experimented with some sites to see if they manage to block access this way, and found http://london.vivastreet.co.uk manages to do that. I haven't been able to figure out how, but maybe someone can. A second query: Why would someone write a complicated function like that when get_file_contents($url) does the same? Is it to avoid suspicion? Thank you very much for your time.
Its file_get_contents. With that function you can only get a page. However with curl you can set a referrer, accept and manage cookies (this means you can log into websites), send data via post and much much more. So you can do alot more with curl. The only way to stop curl access would be by using loads of ajax and javascript but then your website is not really seo friendly.
actually, you can do pretty much the same basic things you mentioned above, with file_get_contents also, if you use stream contexts. but in the end curl is always faster and easier in many ways, + it does have a lot extra stuff too surely if you wanted to block access, just disable curl? otherwise you'll have a hell of a time trying to block specific curl requests (assuming it's even possible!).
well that's a bit difficult isn't it...how do you know if someone is using a browser or curl? it's practically impossible, at least from my point of view.
Thanks, stephan and zalinski. I have a better idea now. I gather from your posts that it is not possible to block requests from cURL. As for the example site I gave above, I just realized that the specific URL I am not able to access with that function is http://london.vivastreet.co.uk/cars+london. Since that link wasn't accessible, I assumed the entire site wasn't accessible, and so posted the home page URL, which appears to be accessible through this function. Any idea why this is happening? Is the "+" in that URL doing something, or is there a way to block URLs from cURL?
I got the problem. I was sending the url to the disguise_curl() function after doing a url_decode first, and so the "+" in the "cars+london" part was becoming " " (space), resulting in the error page I was seeing. So you are right, cURL works for this page, too.
I wouldn't recommend blocking cURL at all, it might affect search engines too. The only differences between "real" visitors and cURL are in useragent header (can be spoofed), browsing/crawling speed (can also be lowered in cURL to match that of a normal visitor) and javascript. Since JS works only on client side (when it's enabled), it can't be used to block cURL access, so the only good solution would be to use some server-side checking, but then you can't see if the user has JS enabled or not. It's a pretty hard one.
Previous posters to this topic have entirely missed the point. This is a great routine. The responses about using cURL instead apparently did not even look at the posting. It uses cURL! What it does is replicate what FireFox does, which no rational website will disallow. If you don't like that, change the header to replicate IE. Either way, it uses cURL and is much faster, and puts a far lighter load on the client server, than does file_get_contents. As the posting stands now, some websites will still return a 503, but if you change a couple of values in the header to update to 2012, it works fine with all websites we have tested. Great posting!!! Saved me hours of digging through FireFox to see how they communicated. Thank you!!!
You always can block the site's ip if you know the site that scraping datas. If you block the ip they can not scrape from your site, unless they use proxy server. That's what makes curl amazing. I love it
As said above, you can not stop people from scraping your site. You can make it harder. The above site is probably checking for referral and/or browser agent. Try adding the following two options: curl_setopt($curl, CURLOPT_REFERER, 'http://london.vivastreet.co.uk'); curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'); PHP:
Properly formed, there should be NO way to distinguish PHP or any other UA from a 'legitimate' user agent like a browser. HTTP and HTML are open formats, as such blocking ANY user agent isn't just bad practice, it's effectively impossible as anything you do can easily be slapped aside in moments. It's a bit like the people who want to 'obfuscate' their code -- the web isn't designed for it and as such ANYTHING you try to do to pull it off is 100% grade A farm fresh manure. Anyone tells you otherwise they're packing you so full of sand you could change your name to Sahara. Really, if the data is sensitive enough you want to block access to ANY user agent, don't put it online in the first place!