Crawling script..

Discussion in 'PHP' started by RFlame, Aug 2, 2008.

  1. #1
    How complicated would it be to write one?
    Basically a content grabber from a specific site with a static layout but varying title/content. It would grab only certain pieces of information related to what I want, more or less extracting 2 strings from each page.
    Can anyone give me ballpark difficulty/time required? Has anyone done something like this before?
     
    RFlame, Aug 2, 2008 IP
  2. Danltn

    Danltn Well-Known Member

    Messages:
    679
    Likes Received:
    36
    Best Answers:
    0
    Trophy Points:
    120
    #2
    Very simple, just grab the page up to the bit you need and use a Regular Expression to get the string in question.

    Dan
     
    Danltn, Aug 2, 2008 IP
  3. RFlame

    RFlame Peon

    Messages:
    129
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #3
    I would somehow like to modify it though, like it crawls automatically, grabs URLs linked to within a page and can identify whether or not the URLs are valid/relevant or not and stores everything for me to check up on when I want.
     
    RFlame, Aug 2, 2008 IP
  4. Pos1tron

    Pos1tron Peon

    Messages:
    95
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Time - depending on what you want, and how familiar you are with needed functions, perhaps an hour or few.

    I used to create the php side of the rating tag images for an RTS Game (Rise of Legends), and that worked in a similar way - fetch the user's official stats page, and find it in the page using preg_match().

    Making the actual thing isn't too difficult - use file() or file_get_contents() (I used the former), and serialize() the stats into a txt file with the user's id and a timestamp. For an hour, it reads that text file and unserialize() it to get the users stats, and after an hour it then reads the site again, this stops you unfairly abusing a site's resources.

    So yeah, get the file, use preg_match() or something similar, serialize() the content and save it in a file (or just save the individual pieces of info in a database if you want) with a timestamp and then an hour later you redo all that at the first page load needing it after an hour, but before then you just either unserialize() the text file or request it from the database.

    You'll probably want to find something you can use to easily manage the crawling it - at a guess, check PEAR for a suitable package.
     
    Pos1tron, Aug 2, 2008 IP
  5. RFlame

    RFlame Peon

    Messages:
    129
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Thanks Pos1tron, will use that.
    More the issue is that the content on the site is always being updated. I would want to index all the relevant pages into a linklist every time I crawl to search for new/updated content.
     
    RFlame, Aug 2, 2008 IP