I want to make a crawler in php that will crawl google search results for given keywords. Procedure will be something like this 1. There will be a list of thousands keywords in file with csv or other format. 2. Crawler will crawl google.co.in for each keywords in the file. 3. Top 10 results title, description and the url will be collected and stored in MySQL database. 4. Now crawler will crawl for next keyword after some delay and loop will go on unless reach to daily limit of keywords to crawl. Then next day it will start again. I need some suggestion on 1. How to crawl pages without using any addons. (Because I am going to run this from an free server not my machine so I will only have php, mysql and general features. ) 2. What kind of parsing I should use to extract title, description and urls from HTML code. 3. What should be the delay and daily crawl limit. ( I don't want to get banned by google for automatic query. ) I will be really thankful for any kind of help. Link to some kind of article most welcome.
i hit enter to fast... google gives you api's and samplecodes of how to incorparte their search results in your site. As does Yahoo http://developer.yahoo.com/search/ Take a look at how to use php in combination with xml, learn a bit about curl and REST...it will take you a long way. A free webhost that supports the functions you will need in php..i don't know but it wouldn't ssurprise me if you will have a hard time finding one
Initially I have tried that but didn't find Google Ajax search api helpful in getting sponsored results. As Ajax search based on JavaScript so I never get the HTML code for sponsored results. So I didn't find any way to get the sponsored result, because when i view source there is not code for search result. If there is anyway to get the sponsored result code from Google Ajax search let me know.
I dont think you can get google sponsored results using the above method (Ajax). You can fetch the search result page using curl or fopen and can extract the sponsored results using a regular expression match. I am too trying for the same. If any one else can suggest better alternatives, it would be helpful
Just a heads up, but after a while you may hit a brick wall where you will be temporarily banned for querying there pages too many times. If your serious about doing this for thousands of keywords consider finding a list of working proxies...