So, I have scripts that scrape various websites, and I'm wondering if it's possible to pretend I'm a browser so that I don't stand out as a script and get banned/blocked. Is there something I can do with headers maybe? Thanks
Chrome : User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.36 Safari/525.19 Referer: **** Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Encoding: gzip,deflate,bzip2,sdch Accept-Language: ro-RO,ro,en-US,en Accept-Charset: ISO-8859-2,*,utf-8 Host: ****** Connection: Keep-Alive Code (markup): IE 7 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/xaml+xml, application/vnd.ms-xpsdocument, application/x-ms-xbap, application/x-ms-application, application/x-silverlight, application/x-silverlight-2-b2, application/x-shockwave-flash, */* Accept-Language: en-US Ua-Cpu: x86 Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; Embedded Web Browser from: .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 1.1.4322) Host: ****** Connection: Keep-Alive Code (markup): Firefox Host: ******** User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ro; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: ro-ro,ro;q=0.8,en-us;q=0.6,en-gb;q=0.4,en;q=0.2 Accept-Encoding: gzip,deflate Accept-Charset: UTF-8,* Keep-Alive: 300 Connection: keep-alive Code (markup): I think you can use one of them.
With this you only will "be a browser" for very simple systems... but for google for example, you need to set more headers, for example, like ExtremeData says.
Minimal set of headers for a text-only browser called Lynx. Accept: text/html, text/plain, text/css, text/sgml, */*;q=0.01 Accept-Language: en User-Agent: Lynx/2.8.6rel.4 libwww-FM/2.14 Code (markup):
i am using it to scrape google serps atm but am just starting out so thanks for that it will save me headaches later i am sure. Sav
Not only can you specify headers with cURL pretending to be a browser client, but also you can use referer to further your lie. See this: http://www.kavoir.com/2008/12/pretend-your-scraper-script-as-a-browser-when-scraping-in-php.html
With fopen, it looks like you are stuck with simple: ini_set('user_agent','your browser of choice'); PHP:
ExtremeData, your examples have: Host: ****** Is this valid or does ****** mean put the host in there?