Crawl Google search result pages

Discussion in 'PHP' started by rishirajsingh, Oct 15, 2007.

  1. #1
    I want to make a crawler in php that will crawl google search results for given keywords. Procedure will be something like this


    1. There will be a list of thousands keywords in file with csv or other format.

    2. Crawler will crawl google.co.in for each keywords in the file.

    3. Top 10 results title, description and the url will be collected and stored in MySQL database.

    4. Now crawler will crawl for next keyword after some delay and loop will go on unless reach to daily limit of keywords to crawl. Then next day it will start again.





    I need some suggestion on

    1. How to crawl pages without using any addons.
    (Because I am going to run this from an free server not my machine so I will only have php, mysql and general features. )

    2. What kind of parsing I should use to extract title, description and urls from HTML code.

    3. What should be the delay and daily crawl limit. ( I don't want to get banned by google for automatic query. :rolleyes:)

    I will be really thankful for any kind of help. Link to some kind of article most welcome.
     
    rishirajsingh, Oct 15, 2007 IP
  2. rishirajsingh

    rishirajsingh Banned

    Messages:
    286
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #2
    If anyone have any crawler script (let it be for any kind of crawl), please post here or pm me.
     
    rishirajsingh, Oct 15, 2007 IP
  3. rishirajsingh

    rishirajsingh Banned

    Messages:
    286
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Any suggestions please, I hope I am at right section of DP forums.
     
    rishirajsingh, Oct 16, 2007 IP
  4. Edynas

    Edynas Peon

    Messages:
    796
    Likes Received:
    24
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Edynas, Oct 16, 2007 IP
  5. Edynas

    Edynas Peon

    Messages:
    796
    Likes Received:
    24
    Best Answers:
    0
    Trophy Points:
    0
    #5
    i hit enter to fast...
    google gives you api's and samplecodes of how to incorparte their search results in your site. As does Yahoo http://developer.yahoo.com/search/
    Take a look at how to use php in combination with xml, learn a bit about curl and REST...it will take you a long way.

    A free webhost that supports the functions you will need in php..i don't know but it wouldn't ssurprise me if you will have a hard time finding one
     
    Edynas, Oct 16, 2007 IP
  6. rishirajsingh

    rishirajsingh Banned

    Messages:
    286
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Initially I have tried that but didn't find Google Ajax search api helpful in getting sponsored results.
    As Ajax search based on JavaScript so I never get the HTML code for sponsored results.
    So I didn't find any way to get the sponsored result, because when i view source there is not
    code for search result. If there is anyway to get the sponsored result code from Google Ajax search let me know.
     
    rishirajsingh, Oct 16, 2007 IP
  7. indianseo

    indianseo Peon

    Messages:
    208
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #7
    I dont think you can get google sponsored results using the above method (Ajax).

    You can fetch the search result page using curl or fopen and can extract the sponsored results using a regular expression match.

    I am too trying for the same. If any one else can suggest better alternatives, it would be helpful
     
    indianseo, Oct 30, 2007 IP
  8. rishirajsingh

    rishirajsingh Banned

    Messages:
    286
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #8
    I done it with PHP curl and regex,
    Thanks folks for help.
     
    rishirajsingh, Nov 12, 2007 IP
  9. LazyD

    LazyD Peon

    Messages:
    425
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Just a heads up, but after a while you may hit a brick wall where you will be temporarily banned for querying there pages too many times. If your serious about doing this for thousands of keywords consider finding a list of working proxies...
     
    LazyD, Nov 12, 2007 IP
  10. indianseo

    indianseo Peon

    Messages:
    208
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Thats cool..
    It would be great if you can share it, thank you.
     
    indianseo, Nov 13, 2007 IP