I'm working on a small project .. It's like a search engine.. Is there any way that I can scrap sites from Google / MSN / YAhoo without getting blocked.. Ane second question is : Which is the best library for PHP like cURL etc.. Sorry for my english.. I hope some will help
There are APIs that you can use to access most search engines. If you just want to scrape, you might want to sleep x seconds between each scrape to avoid ban (not sure of how many seconds, but you will get it via trial and error). I would use CURL and randomize the headers (like referrals, sessions, etc..).
@ThePHPMaster Thanks Buddy.. But there are many limitations on these API.. Like 5000/Day or Like 500/Per-Term etc.. But I'm trying to get a whole list of a kind of site.. Google only allow to gather 600-700 result at all..
You need proxy support in your script or you will be disappointed. And since I 've already done things like that, here 's my advice to you: Make your life easy and use Zend Framework especially for the network part (to fetch the pages) and the parsing part (to use the DOM and not regular expressions).