1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Web Scraping Blocked

Discussion in 'PHP' started by PaIntR, Oct 21, 2013.

  1. #1
    Hi there,
    I'm currently working on a project where I need to extract content from a php generated online site. So it's a simple web scraping thing. The site looks like this:
    [link removed]

    I have everything ready, the data scraping etc., and it works if I download the page and then start my php script to read it. The problem is, I cannot read it directly from the source, because their system somehow detects spiders and blocks them. So I can download the page, upload it to my page, then work through it, but I cannot work through it directly. How can I bypass this?

    Help is much appreciated! Thanks
     
    Solved! View solution.
    Last edited: Oct 22, 2013
    PaIntR, Oct 21, 2013 IP
  2. PoPSiCLe

    PoPSiCLe Illustrious Member

    Messages:
    4,623
    Likes Received:
    725
    Best Answers:
    152
    Trophy Points:
    470
    #2
    Since you provide no code, noe explanation of what you're trying to do, we can't really help you.

    Besides, Web scraping shouldn't be hard to do - just talk to the admins on the page you're trying to scrape, and tell them to let your spider through. You of course have an agreement with them already?
     
    PoPSiCLe, Oct 21, 2013 IP
    matt_62 likes this.
  3. samyak

    samyak Active Member

    Messages:
    280
    Likes Received:
    7
    Best Answers:
    4
    Trophy Points:
    90
    #3
    I tried to do file_get_contents() on that URL and got this message: "GO AWAY!!! Robots are not supposed to visit this page!" :)

    May be this site is checking the user-agent to serve the pages. try using CUrl with some known user agent.
     
    samyak, Oct 21, 2013 IP
  4. EricBruggema

    EricBruggema Well-Known Member

    Messages:
    1,740
    Likes Received:
    28
    Best Answers:
    13
    Trophy Points:
    175
    #4
    And their right! stop using there data!!! get it your selve! or use CURL! :)

    php.net/curl, its way better then file_get_contents and if you use multiple curl connections, it would even go much and much faster! :)
     
    EricBruggema, Oct 21, 2013 IP
  5. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,998
    Best Answers:
    253
    Trophy Points:
    515
    #5
    Given that they are intentionally blocking you, you're trying to STEAL their hard work why exactly?
     
    deathshadow, Oct 21, 2013 IP
  6. jaran

    jaran Greenhorn

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    21
    #6
    They're blocking the URL because they dont want someone stealing their bandwith.
     
    jaran, Oct 22, 2013 IP
  7. #7
    The above information is correct. You need to use curl for this. Set a user agent string to something like chrome or firefox. If you are trying to access images then you also need to set the referer to the main domain. Also set the followlocation flag - just in case.

    If this still doesn't work then you need to use the debug console in chrome or other browser and inspect what files are being downloaded during each request and check if they set any cookies or sessions. if they do then you need to download those files as well and accept the cookies.

    That should do it in 99% of cases
     
    stephan2307, Oct 22, 2013 IP
  8. PaIntR

    PaIntR Greenhorn

    Messages:
    9
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    11
    #8
    Thanks guys for the many answers!

    And no, I'm not stealing/copying anyone's hard work, don't worry about that.
     
    PaIntR, Oct 22, 2013 IP
  9. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #9
    There are three things that you need to do in CURL, which should bypass most of the "do not programmatically visit this site restriction", using curl:

    1) Insure that you have a user agent.
    2) Insure that you have a referral link, usually I set it to something like google.com or the actual domain name.
    3) Insure that you have followdirects on.

    In some unique cases I encountered, some sites will use cookies (sessions) to limit access to which you will then need to store and send the cookies as well.
     
    ThePHPMaster, Oct 22, 2013 IP
  10. Gangsta

    Gangsta Active Member

    Messages:
    145
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    53
    #10
    1. do not use file_get_contents or simple request. use curl or sockets with proper headers
    2. use proxies
     
    Gangsta, Oct 22, 2013 IP
  11. stephan2307

    stephan2307 Well-Known Member

    Messages:
    1,277
    Likes Received:
    33
    Best Answers:
    7
    Trophy Points:
    150
    #11
    Thanks for rewriting what I said earlier
     
    stephan2307, Oct 23, 2013 IP
  12. ezprint2008

    ezprint2008 Well-Known Member

    Messages:
    611
    Likes Received:
    15
    Best Answers:
    2
    Trophy Points:
    140
    Digital Goods:
    1
    #12

    The sign said ="KEEP OUT!" yet, I'm trying to see ...how do I tunnel under that? anyone? ...
    :D
     
    ezprint2008, Oct 30, 2013 IP
  13. itsyssolutions

    itsyssolutions Member

    Messages:
    46
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    38
    #13
    Use proxies and rewrite the script using CURL. If you wish to outsource the job let me know. We specialize in web scraping and provide our clients with data in CSV format.
     
    itsyssolutions, Nov 2, 2013 IP