How do I pretend to be a browser when scraping using PHP?

Discussion in 'PHP' started by rchampion, Dec 19, 2008.

  1. #1
    So, I have scripts that scrape various websites, and I'm wondering if it's possible to pretend I'm a browser so that I don't stand out as a script and get banned/blocked.

    Is there something I can do with headers maybe?

    Thanks
     
    rchampion, Dec 19, 2008 IP
  2. ExtremeData

    ExtremeData Well-Known Member

    Messages:
    450
    Likes Received:
    21
    Best Answers:
    0
    Trophy Points:
    125
    #2
    Chrome :
    User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.36 Safari/525.19
    Referer: ****
    Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    Accept-Encoding: gzip,deflate,bzip2,sdch
    Accept-Language: ro-RO,ro,en-US,en
    Accept-Charset:	ISO-8859-2,*,utf-8
    Host:	******
    Connection: Keep-Alive
    Code (markup):
    IE 7
    Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/xaml+xml, application/vnd.ms-xpsdocument, application/x-ms-xbap, application/x-ms-application, application/x-silverlight, application/x-silverlight-2-b2, application/x-shockwave-flash, */* 
    Accept-Language: en-US 
    Ua-Cpu: x86 
    Accept-Encoding: gzip, deflate 
    User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; Embedded Web Browser from:  .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 1.1.4322) 
    Host: ******
    Connection: Keep-Alive 
    
    Code (markup):
    Firefox
    Host:  ********
    User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ro; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language: ro-ro,ro;q=0.8,en-us;q=0.6,en-gb;q=0.4,en;q=0.2
    Accept-Encoding: gzip,deflate
    Accept-Charset: UTF-8,*
    Keep-Alive: 300
    Connection: keep-alive
    Code (markup):
    I think you can use one of them.
     
    ExtremeData, Dec 19, 2008 IP
  3. atlantaazfinest

    atlantaazfinest Peon

    Messages:
    389
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #3
    yea using curl u have to set the useragent
     
    atlantaazfinest, Dec 19, 2008 IP
  4. Savo

    Savo Peon

    Messages:
    157
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #4
    This is what i am using atm.

    ini_set('user_agent','MSIE 4\.0b2;');

    Sav
     
    Savo, Dec 19, 2008 IP
  5. onlywin

    onlywin Greenhorn

    Messages:
    97
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    18
    #5
    With this you only will "be a browser" for very simple systems... but for google for example, you need to set more headers, for example, like ExtremeData says.
     
    onlywin, Dec 19, 2008 IP
  6. joebert

    joebert Well-Known Member

    Messages:
    2,150
    Likes Received:
    88
    Best Answers:
    0
    Trophy Points:
    145
    #6
    Minimal set of headers for a text-only browser called Lynx.

    Accept: text/html, text/plain, text/css, text/sgml, */*;q=0.01
    Accept-Language: en
    User-Agent: Lynx/2.8.6rel.4 libwww-FM/2.14
    Code (markup):
     
    joebert, Dec 20, 2008 IP
  7. Savo

    Savo Peon

    Messages:
    157
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #7
    i am using it to scrape google serps atm but am just starting out so thanks for that it will save me headaches later i am sure.

    Sav
     
    Savo, Dec 20, 2008 IP
  8. rchampion

    rchampion Peon

    Messages:
    65
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Thanks guys :)
     
    rchampion, Dec 28, 2008 IP
  9. yangyang

    yangyang Banned

    Messages:
    757
    Likes Received:
    26
    Best Answers:
    0
    Trophy Points:
    0
    #9
    yangyang, Dec 29, 2008 IP
  10. rchampion

    rchampion Peon

    Messages:
    65
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Any idea how to do it with fopen?
     
    rchampion, Jan 6, 2009 IP
  11. forkaya

    forkaya Peon

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #11
    With fopen, it looks like you are stuck with simple:
    
    ini_set('user_agent','your browser of choice');
    
    PHP:
     
    forkaya, Jan 6, 2009 IP
  12. rchampion

    rchampion Peon

    Messages:
    65
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #12
    I'm toying with fsockopen now. Looks like you can change headers with that and fwrite.
     
    rchampion, Jan 6, 2009 IP
  13. rchampion

    rchampion Peon

    Messages:
    65
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #13
    ExtremeData, your examples have:

    Host: ******

    Is this valid or does ****** mean put the host in there?
     
    rchampion, Jan 6, 2009 IP
  14. ExtremeData

    ExtremeData Well-Known Member

    Messages:
    450
    Likes Received:
    21
    Best Answers:
    0
    Trophy Points:
    125
    #14
    I have replaced the host with *****
     
    ExtremeData, Jan 6, 2009 IP