Hi there, I'm currently working on a project where I need to extract content from a php generated online site. So it's a simple web scraping thing. The site looks like this: [link removed] I have everything ready, the data scraping etc., and it works if I download the page and then start my php script to read it. The problem is, I cannot read it directly from the source, because their system somehow detects spiders and blocks them. So I can download the page, upload it to my page, then work through it, but I cannot work through it directly. How can I bypass this? Help is much appreciated! Thanks
Since you provide no code, noe explanation of what you're trying to do, we can't really help you. Besides, Web scraping shouldn't be hard to do - just talk to the admins on the page you're trying to scrape, and tell them to let your spider through. You of course have an agreement with them already?
I tried to do file_get_contents() on that URL and got this message: "GO AWAY!!! Robots are not supposed to visit this page!" May be this site is checking the user-agent to serve the pages. try using CUrl with some known user agent.
And their right! stop using there data!!! get it your selve! or use CURL! php.net/curl, its way better then file_get_contents and if you use multiple curl connections, it would even go much and much faster!
The above information is correct. You need to use curl for this. Set a user agent string to something like chrome or firefox. If you are trying to access images then you also need to set the referer to the main domain. Also set the followlocation flag - just in case. If this still doesn't work then you need to use the debug console in chrome or other browser and inspect what files are being downloaded during each request and check if they set any cookies or sessions. if they do then you need to download those files as well and accept the cookies. That should do it in 99% of cases
Thanks guys for the many answers! And no, I'm not stealing/copying anyone's hard work, don't worry about that.
There are three things that you need to do in CURL, which should bypass most of the "do not programmatically visit this site restriction", using curl: 1) Insure that you have a user agent. 2) Insure that you have a referral link, usually I set it to something like google.com or the actual domain name. 3) Insure that you have followdirects on. In some unique cases I encountered, some sites will use cookies (sessions) to limit access to which you will then need to store and send the cookies as well.
1. do not use file_get_contents or simple request. use curl or sockets with proper headers 2. use proxies
Use proxies and rewrite the script using CURL. If you wish to outsource the job let me know. We specialize in web scraping and provide our clients with data in CSV format.