1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Blocking php curl from scraping website content

Discussion in 'PHP' started by knkk, Jun 9, 2010.

  1. #1
    There is this function:

    
    function disguise_curl($url) 
    { 
    	$curl = curl_init(); 
    
    	// setup headers - used the same headers from Firefox version 2.0.0.6
    	// below was split up because php.net said the line was too long. :/
    	$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; 
    	$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
    	$header[] = "Cache-Control: max-age=0"; 
    	$header[] = "Connection: keep-alive"; 
    	$header[] = "Keep-Alive: 300"; 
    	$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
    	$header[] = "Accept-Language: en-us,en;q=0.5"; 
    	$header[] = "Pragma: "; //browsers keep this blank. 
    
    	curl_setopt($curl, CURLOPT_URL, $url); 
    	curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3'); 
    	curl_setopt($curl, CURLOPT_HTTPHEADER, $header); 
    	curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com'); 
    	curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate'); 
    	curl_setopt($curl, CURLOPT_AUTOREFERER, true); 
    	curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
    	curl_setopt($curl, CURLOPT_TIMEOUT, 10); 
    
    	$html = curl_exec($curl); //execute the curl command 
    	if (!$html) 
    	{
    		echo "cURL error number:" .curl_errno($ch);
    		echo "cURL error:" . curl_error($ch);
    		exit;
    	}
      
    	curl_close($curl); //close the connection 
    
    	return $html; //and finally, return $html 
    }
    
    Code (markup):
    ...that several people seem to use to scrape content off a website (to state the obvious, you would do "echo disguise_curl($url)").

    Is there any way to detect if someone is doing that to my site, and block access to them or show a page with a specific message?

    I've experimented with some sites to see if they manage to block access this way, and found http://london.vivastreet.co.uk manages to do that. I haven't been able to figure out how, but maybe someone can.

    A second query: Why would someone write a complicated function like that when get_file_contents($url) does the same? Is it to avoid suspicion?

    Thank you very much for your time.
     
    knkk, Jun 9, 2010 IP
  2. stephan2307

    stephan2307 Well-Known Member

    Messages:
    1,277
    Likes Received:
    33
    Best Answers:
    7
    Trophy Points:
    150
    #2
    Its file_get_contents.

    With that function you can only get a page. However with curl you can set a referrer, accept and manage cookies (this means you can log into websites), send data via post and much much more. So you can do alot more with curl.

    The only way to stop curl access would be by using loads of ajax and javascript but then your website is not really seo friendly.
     
    stephan2307, Jun 9, 2010 IP
  3. szalinski

    szalinski Peon

    Messages:
    341
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #3
    actually, you can do pretty much the same basic things you mentioned above, with file_get_contents also, if you use stream contexts. but in the end curl is always faster and easier in many ways, + it does have a lot extra stuff too :D

    surely if you wanted to block access, just disable curl? otherwise you'll have a hell of a time trying to block specific curl requests (assuming it's even possible!).
     
    szalinski, Jun 9, 2010 IP
  4. stephan2307

    stephan2307 Well-Known Member

    Messages:
    1,277
    Likes Received:
    33
    Best Answers:
    7
    Trophy Points:
    150
    #4
    I think he wants to stop other people using curl to scrape his website and not blocking curl itself.
     
    stephan2307, Jun 9, 2010 IP
  5. szalinski

    szalinski Peon

    Messages:
    341
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #5
    well that's a bit difficult isn't it...how do you know if someone is using a browser or curl? it's practically impossible, at least from my point of view.
     
    szalinski, Jun 9, 2010 IP
  6. knkk

    knkk Peon

    Messages:
    43
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Thanks, stephan and zalinski. I have a better idea now. I gather from your posts that it is not possible to block requests from cURL.

    As for the example site I gave above, I just realized that the specific URL I am not able to access with that function is http://london.vivastreet.co.uk/cars+london. Since that link wasn't accessible, I assumed the entire site wasn't accessible, and so posted the home page URL, which appears to be accessible through this function. Any idea why this is happening? Is the "+" in that URL doing something, or is there a way to block URLs from cURL?
     
    knkk, Jun 9, 2010 IP
  7. knkk

    knkk Peon

    Messages:
    43
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    I got the problem. I was sending the url to the disguise_curl() function after doing a url_decode first, and so the "+" in the "cars+london" part was becoming " " (space), resulting in the error page I was seeing. So you are right, cURL works for this page, too.
     
    knkk, Jun 9, 2010 IP
  8. Gray Fox

    Gray Fox Well-Known Member

    Messages:
    196
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    130
    #8
    I wouldn't recommend blocking cURL at all, it might affect search engines too. The only differences between "real" visitors and cURL are in useragent header (can be spoofed), browsing/crawling speed (can also be lowered in cURL to match that of a normal visitor) and javascript. Since JS works only on client side (when it's enabled), it can't be used to block cURL access, so the only good solution would be to use some server-side checking, but then you can't see if the user has JS enabled or not. It's a pretty hard one.
     
    Gray Fox, Jun 10, 2010 IP
  9. knkk

    knkk Peon

    Messages:
    43
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Thanks, Gray Fox. That was a useful insight. I'm guessing it's tough to avoid visits through cURL...
     
    knkk, Jun 10, 2010 IP
  10. jackvance

    jackvance Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Previous posters to this topic have entirely missed the point. This is a great routine. The responses about using cURL instead apparently did not even look at the posting. It uses cURL! What it does is replicate what FireFox does, which no rational website will disallow. If you don't like that, change the header to replicate IE. Either way, it uses cURL and is much faster, and puts a far lighter load on the client server, than does file_get_contents.

    As the posting stands now, some websites will still return a 503, but if you change a couple of values in the header to update to 2012, it works fine with all websites we have tested.

    Great posting!!! Saved me hours of digging through FireFox to see how they communicated.

    Thank you!!!
     
    jackvance, Aug 24, 2012 IP
  11. sabato

    sabato Member

    Messages:
    407
    Likes Received:
    6
    Best Answers:
    1
    Trophy Points:
    43
    #11
    You always can block the site's ip if you know the site that scraping datas. If you block the ip they can not scrape from your site, unless they use proxy server. That's what makes curl amazing. :) I love it
     
    sabato, Sep 21, 2012 IP
  12. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #12
    As said above, you can not stop people from scraping your site. You can make it harder. The above site is probably checking for referral and/or browser agent. Try adding the following two options:

    
    curl_setopt($curl, CURLOPT_REFERER, 'http://london.vivastreet.co.uk');
    curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
    
    PHP:
     
    ThePHPMaster, Sep 21, 2012 IP
  13. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,998
    Best Answers:
    253
    Trophy Points:
    515
    #13
    Properly formed, there should be NO way to distinguish PHP or any other UA from a 'legitimate' user agent like a browser. HTTP and HTML are open formats, as such blocking ANY user agent isn't just bad practice, it's effectively impossible as anything you do can easily be slapped aside in moments. It's a bit like the people who want to 'obfuscate' their code -- the web isn't designed for it and as such ANYTHING you try to do to pull it off is 100% grade A farm fresh manure. Anyone tells you otherwise they're packing you so full of sand you could change your name to Sahara.

    Really, if the data is sensitive enough you want to block access to ANY user agent, don't put it online in the first place!
     
    deathshadow, Sep 22, 2012 IP