Use PHP to scrapout sites from Google/MSN/Yahoo.

Discussion in 'PHP' started by EFSolution, Nov 10, 2010.

  1. #1
    I'm working on a small project .. It's like a search engine..
    Is there any way that I can scrap sites from Google / MSN / YAhoo without getting blocked..

    Ane second question is : Which is the best library for PHP like cURL etc..

    Sorry for my english.. I hope some will help :)
     
    EFSolution, Nov 10, 2010 IP
  2. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #2
    There are APIs that you can use to access most search engines.

    If you just want to scrape, you might want to sleep x seconds between each scrape to avoid ban (not sure of how many seconds, but you will get it via trial and error).

    I would use CURL and randomize the headers (like referrals, sessions, etc..).
     
    ThePHPMaster, Nov 10, 2010 IP
  3. EFSolution

    EFSolution Member

    Messages:
    20
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    38
    #3
    @ThePHPMaster

    Thanks Buddy.. But there are many limitations on these API..
    Like 5000/Day or Like 500/Per-Term etc..

    But I'm trying to get a whole list of a kind of site..
    Google only allow to gather 600-700 result at all.. :(
     
    EFSolution, Nov 12, 2010 IP
  4. i.am.a.pro

    i.am.a.pro Peon

    Messages:
    251
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I can give you a link to the tutorial of google api
     
    i.am.a.pro, Nov 12, 2010 IP
  5. jazzcho

    jazzcho Peon

    Messages:
    326
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #5
    You need proxy support in your script or you will be disappointed.

    And since I 've already done things like that, here 's my advice to you: Make your life easy and use Zend Framework especially for the network part (to fetch the pages) and the parsing part (to use the DOM and not regular expressions).
     
    jazzcho, Nov 14, 2010 IP