Blocking php curl from scraping website content

knkk Peon

Messages:: 43

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

There is this function:
function disguise_curl($url) 
{ 
	$curl = curl_init(); 

	// setup headers - used the same headers from Firefox version 2.0.0.6
	// below was split up because php.net said the line was too long. :/
	$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; 
	$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
	$header[] = "Cache-Control: max-age=0"; 
	$header[] = "Connection: keep-alive"; 
	$header[] = "Keep-Alive: 300"; 
	$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
	$header[] = "Accept-Language: en-us,en;q=0.5"; 
	$header[] = "Pragma: "; //browsers keep this blank. 

	curl_setopt($curl, CURLOPT_URL, $url); 
	curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3'); 
	curl_setopt($curl, CURLOPT_HTTPHEADER, $header); 
	curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com'); 
	curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate'); 
	curl_setopt($curl, CURLOPT_AUTOREFERER, true); 
	curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
	curl_setopt($curl, CURLOPT_TIMEOUT, 10); 

	$html = curl_exec($curl); //execute the curl command 
	if (!$html) 
	{
		echo "cURL error number:" .curl_errno($ch);
		echo "cURL error:" . curl_error($ch);
		exit;
	}
  
	curl_close($curl); //close the connection 

	return $html; //and finally, return $html 
}
Code (markup):
...that several people seem to use to scrape content off a website (to state the obvious, you would do "echo disguise_curl($url)").

Is there any way to detect if someone is doing that to my site, and block access to them or show a page with a specific message?

I've experimented with some sites to see if they manage to block access this way, and found http://london.vivastreet.co.uk manages to do that. I haven't been able to figure out how, but maybe someone can.

A second query: Why would someone write a complicated function like that when get_file_contents($url) does the same? Is it to avoid suspicion?

Thank you very much for your time.

knkk, Jun 9, 2010 IP

stephan2307 Well-Known Member

Messages:: 1,277

Likes Received:: 33

Best Answers:: 7

Trophy Points:: 150

#2

Its file_get_contents.

With that function you can only get a page. However with curl you can set a referrer, accept and manage cookies (this means you can log into websites), send data via post and much much more. So you can do alot more with curl.

The only way to stop curl access would be by using loads of ajax and javascript but then your website is not really seo friendly.

stephan2307, Jun 9, 2010 IP

szalinski Peon

Messages:: 341

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#3

actually, you can do pretty much the same basic things you mentioned above, with file_get_contents also, if you use stream contexts. but in the end curl is always faster and easier in many ways, + it does have a lot extra stuff too

surely if you wanted to block access, just disable curl? otherwise you'll have a hell of a time trying to block specific curl requests (assuming it's even possible!).

szalinski, Jun 9, 2010 IP

stephan2307 Well-Known Member

Messages:: 1,277

Likes Received:: 33

Best Answers:: 7

Trophy Points:: 150

#4

szalinski said: ↑

actually, you can do pretty much the same basic things you mentioned above, with file_get_contents also, if you use stream contexts. but in the end curl is always faster and easier in many ways, + it does have a lot extra stuff too

surely if you wanted to block access, just disable curl? otherwise you'll have a hell of a time trying to block specific curl requests (assuming it's even possible!).
Click to expand...

I think he wants to stop other people using curl to scrape his website and not blocking curl itself.

stephan2307, Jun 9, 2010 IP

szalinski Peon

Messages:: 341

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#5

well that's a bit difficult isn't it...how do you know if someone is using a browser or curl? it's practically impossible, at least from my point of view.

szalinski, Jun 9, 2010 IP

knkk Peon

Messages:: 43

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#6

Thanks, stephan and zalinski. I have a better idea now. I gather from your posts that it is not possible to block requests from cURL.

As for the example site I gave above, I just realized that the specific URL I am not able to access with that function is http://london.vivastreet.co.uk/cars+london. Since that link wasn't accessible, I assumed the entire site wasn't accessible, and so posted the home page URL, which appears to be accessible through this function. Any idea why this is happening? Is the "+" in that URL doing something, or is there a way to block URLs from cURL?

knkk, Jun 9, 2010 IP

knkk Peon

Messages:: 43

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#7

I got the problem. I was sending the url to the disguise_curl() function after doing a url_decode first, and so the "+" in the "cars+london" part was becoming " " (space), resulting in the error page I was seeing. So you are right, cURL works for this page, too.

knkk, Jun 9, 2010 IP

Gray Fox Well-Known Member

Messages:: 196

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 130

#8

I wouldn't recommend blocking cURL at all, it might affect search engines too. The only differences between "real" visitors and cURL are in useragent header (can be spoofed), browsing/crawling speed (can also be lowered in cURL to match that of a normal visitor) and javascript. Since JS works only on client side (when it's enabled), it can't be used to block cURL access, so the only good solution would be to use some server-side checking, but then you can't see if the user has JS enabled or not. It's a pretty hard one.

Gray Fox, Jun 10, 2010 IP

knkk Peon

Messages:: 43

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#9

Thanks, Gray Fox. That was a useful insight. I'm guessing it's tough to avoid visits through cURL...

knkk, Jun 10, 2010 IP

jackvance Peon

Messages:: 1

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#10

Previous posters to this topic have entirely missed the point. This is a great routine. The responses about using cURL instead apparently did not even look at the posting. It uses cURL! What it does is replicate what FireFox does, which no rational website will disallow. If you don't like that, change the header to replicate IE. Either way, it uses cURL and is much faster, and puts a far lighter load on the client server, than does file_get_contents.

As the posting stands now, some websites will still return a 503, but if you change a couple of values in the header to update to 2012, it works fine with all websites we have tested.

Great posting!!! Saved me hours of digging through FireFox to see how they communicated.

Thank you!!!

jackvance, Aug 24, 2012 IP

sabato Member

Messages:: 407

Likes Received:: 6

Best Answers:: 1

Trophy Points:: 43

#11

You always can block the site's ip if you know the site that scraping datas. If you block the ip they can not scrape from your site, unless they use proxy server. That's what makes curl amazing. I love it

sabato, Sep 21, 2012 IP

ThePHPMaster Well-Known Member

Messages:: 737

Likes Received:: 52

Best Answers:: 33

Trophy Points:: 150

#12

knkk said: ↑

Thanks, stephan and zalinski. I have a better idea now. I gather from your posts that it is not possible to block requests from cURL.

As for the example site I gave above, I just realized that the specific URL I am not able to access with that function is http://london.vivastreet.co.uk/cars+london. Since that link wasn't accessible, I assumed the entire site wasn't accessible, and so posted the home page URL, which appears to be accessible through this function. Any idea why this is happening? Is the "+" in that URL doing something, or is there a way to block URLs from cURL?
Click to expand...

As said above, you can not stop people from scraping your site. You can make it harder. The above site is probably checking for referral and/or browser agent. Try adding the following two options:
curl_setopt($curl, CURLOPT_REFERER, 'http://london.vivastreet.co.uk');
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
PHP:

ThePHPMaster, Sep 21, 2012 IP

deathshadow Acclaimed Member

Messages:: 9,732

Likes Received:: 1,999

Best Answers:: 253

Trophy Points:: 515

#13

Properly formed, there should be NO way to distinguish PHP or any other UA from a 'legitimate' user agent like a browser. HTTP and HTML are open formats, as such blocking ANY user agent isn't just bad practice, it's effectively impossible as anything you do can easily be slapped aside in moments. It's a bit like the people who want to 'obfuscate' their code -- the web isn't designed for it and as such ANYTHING you try to do to pull it off is 100% grade A farm fresh manure. Anyone tells you otherwise they're packing you so full of sand you could change your name to Sahara.

Really, if the data is sensitive enough you want to block access to ANY user agent, don't put it online in the first place!

deathshadow, Sep 22, 2012 IP

Log in or Sign up

Blocking php curl from scraping website content

knkk Peon

stephan2307 Well-Known Member

szalinski Peon

stephan2307 Well-Known Member

szalinski Peon

knkk Peon

knkk Peon

Gray Fox Well-Known Member

knkk Peon

jackvance Peon

sabato Member

ThePHPMaster Well-Known Member

deathshadow Acclaimed Member

Useful Searches