Web Scraping Blocked

PaIntR Greenhorn

Messages:: 9

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 11

#1

Hi there,
I'm currently working on a project where I need to extract content from a php generated online site. So it's a simple web scraping thing. The site looks like this:
[link removed]

I have everything ready, the data scraping etc., and it works if I download the page and then start my php script to read it. The problem is, I cannot read it directly from the source, because their system somehow detects spiders and blocks them. So I can download the page, upload it to my page, then work through it, but I cannot work through it directly. How can I bypass this?

Help is much appreciated! Thanks

Solved! View solution.

Last edited: Oct 22, 2013

PaIntR, Oct 21, 2013 IP

PoPSiCLe Illustrious Member

Messages:: 4,623

Likes Received:: 725

Best Answers:: 152

Trophy Points:: 470

#2

Since you provide no code, noe explanation of what you're trying to do, we can't really help you.

Besides, Web scraping shouldn't be hard to do - just talk to the admins on the page you're trying to scrape, and tell them to let your spider through. You of course have an agreement with them already?

PoPSiCLe, Oct 21, 2013 IP

matt_62 likes this.

samyak Active Member

Messages:: 280

Likes Received:: 7

Best Answers:: 4

Trophy Points:: 90

#3

I tried to do file_get_contents() on that URL and got this message: "GO AWAY!!! Robots are not supposed to visit this page!"

May be this site is checking the user-agent to serve the pages. try using CUrl with some known user agent.

samyak, Oct 21, 2013 IP

EricBruggema Well-Known Member

Messages:: 1,740

Likes Received:: 28

Best Answers:: 13

Trophy Points:: 175

#4

And their right! stop using there data!!! get it your selve! or use CURL!

php.net/curl, its way better then file_get_contents and if you use multiple curl connections, it would even go much and much faster!

EricBruggema, Oct 21, 2013 IP

deathshadow Acclaimed Member

Messages:: 9,732

Likes Received:: 1,999

Best Answers:: 253

Trophy Points:: 515

#5

Given that they are intentionally blocking you, you're trying to STEAL their hard work why exactly?

deathshadow, Oct 21, 2013 IP

Arick unirow, matt_62, malky66 and 1 other person like this.

jaran Greenhorn

Messages:: 1

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 21

#6

They're blocking the URL because they dont want someone stealing their bandwith.

jaran, Oct 22, 2013 IP

stephan2307 Well-Known Member Best Answer

Messages:: 1,277

Likes Received:: 33

Best Answers:: 7

Trophy Points:: 150

#7

The above information is correct. You need to use curl for this. Set a user agent string to something like chrome or firefox. If you are trying to access images then you also need to set the referer to the main domain. Also set the followlocation flag - just in case.

If this still doesn't work then you need to use the debug console in chrome or other browser and inspect what files are being downloaded during each request and check if they set any cookies or sessions. if they do then you need to download those files as well and accept the cookies.

That should do it in 99% of cases

stephan2307, Oct 22, 2013 IP

PaIntR Greenhorn

Messages:: 9

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 11

#8

Thanks guys for the many answers!

And no, I'm not stealing/copying anyone's hard work, don't worry about that.

PaIntR, Oct 22, 2013 IP

ThePHPMaster Well-Known Member

Messages:: 737

Likes Received:: 52

Best Answers:: 33

Trophy Points:: 150

#9

There are three things that you need to do in CURL, which should bypass most of the "do not programmatically visit this site restriction", using curl:

1) Insure that you have a user agent.
2) Insure that you have a referral link, usually I set it to something like google.com or the actual domain name.
3) Insure that you have followdirects on.

In some unique cases I encountered, some sites will use cookies (sessions) to limit access to which you will then need to store and send the cookies as well.

ThePHPMaster, Oct 22, 2013 IP

Gangsta Active Member

Messages:: 145

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 53

#10

1. do not use file_get_contents or simple request. use curl or sockets with proper headers
2. use proxies

Gangsta, Oct 22, 2013 IP

stephan2307 Well-Known Member

Messages:: 1,277

Likes Received:: 33

Best Answers:: 7

Trophy Points:: 150

#11

ThePHPMaster said: ↑

There are three things that you need to do in CURL, which should bypass most of the "do not programmatically visit this site restriction", using curl:

1) Insure that you have a user agent.
2) Insure that you have a referral link, usually I set it to something like google.com or the actual domain name.
3) Insure that you have followdirects on.

In some unique cases I encountered, some sites will use cookies (sessions) to limit access to which you will then need to store and send the cookies as well.
Click to expand...

Thanks for rewriting what I said earlier

stephan2307, Oct 23, 2013 IP

EricBruggema, ThePHPMaster and ryan_uk like this.

ezprint2008 Well-Known Member

Messages:: 611

Likes Received:: 15

Best Answers:: 2

Trophy Points:: 140

Digital Goods:: 1

#12

PaIntR said: ↑

Thanks guys for the many answers!

And no, I'm not stealing/copying anyone's hard work, don't worry about that.
Click to expand...

The sign said ="KEEP OUT!" yet, I'm trying to see ...how do I tunnel under that? anyone? ...

ezprint2008, Oct 30, 2013 IP

itsyssolutions Member

Messages:: 46

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 38

#13

Use proxies and rewrite the script using CURL. If you wish to outsource the job let me know. We specialize in web scraping and provide our clients with data in CSV format.

itsyssolutions, Nov 2, 2013 IP

Log in or Sign up

Web Scraping Blocked

PaIntR Greenhorn

PoPSiCLe Illustrious Member

samyak Active Member

EricBruggema Well-Known Member

deathshadow Acclaimed Member

jaran Greenhorn

stephan2307 Well-Known Member Best Answer

PaIntR Greenhorn

ThePHPMaster Well-Known Member

Gangsta Active Member

stephan2307 Well-Known Member

ezprint2008 Well-Known Member

itsyssolutions Member

Useful Searches