1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Adding a Spider to a directory?

Discussion in 'Directories' started by Ernster, Aug 10, 2005.

  1. #1
    I was thinking about adding a spider/bot to my directory where it will just spider the internal pages of websites that I have accepted in to my directory.

    I have no clue how to do this, can anyone help me? Thanks.
     
    Ernster, Aug 10, 2005 IP
  2. Nitin M

    Nitin M White/Gray/Black Hat

    Messages:
    640
    Likes Received:
    93
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Do you have some programming skills? There may be some tools that work totally out of the box without programming but the ones we've been evaluating all require a fair amount of coding expertise to get working well.

    You'll need to integrate 2 different components: 1) a spider to collect and index the pages, and 2) a search interface to allow the end user to submit queries against the collection.

    Almost always the spider will come with the search but just poining out it will likely take some effort in both areas to pull off what you're attempting.

    We've been working with many of these types of tools and weren't impressed with any of the freebies out there. On the real money side of the house here are 3 we really, really like:

    dtsearch: $999 per domain I think (www.dtsearch.com)

    ISYS: $10k per server (www.isys-search.com)

    DocLinx: $10k per CPU (www.doclinx.com)

    Spiders and indexing have been my life for several months now so feel free to follow up with me as you get more into it. Good luck.
     
    Nitin M, Aug 10, 2005 IP
  3. dvduval

    dvduval Notable Member

    Messages:
    3,369
    Likes Received:
    356
    Best Answers:
    1
    Trophy Points:
    260
    #3
    My directory script has this built in to spider the links and validate them.
    Are you trying to actually capture a cache of the pages, or just validate?
     
    dvduval, Aug 10, 2005 IP
  4. Ernster

    Ernster Peon

    Messages:
    535
    Likes Received:
    38
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I have no programming skills at all and would not be able to pay a programmer to do this for me. I dont even understand what Id do If I bought a script or how it works, I just really would like to make this happen with my directory. I am using WSN Links for my directory. Id like a FREE or a very cheap solution to this.

    Any additional help would be apreciated (I will also ask the WSN Links forum).
     
    Ernster, Aug 10, 2005 IP
  5. dvduval

    dvduval Notable Member

    Messages:
    3,369
    Likes Received:
    356
    Best Answers:
    1
    Trophy Points:
    260
    #5
    Well, we have converted several WSN Links over to php link directory. Take a look and let me know if it has the features you need. You never answered my question. Are just trying to validate the links to make sure they work? If yes, this is built into the admin area, and you can validate for the whole site or individual categories.
     
    dvduval, Aug 10, 2005 IP
  6. Ernster

    Ernster Peon

    Messages:
    535
    Likes Received:
    38
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Sorry. No I dont think I am. Ill try to explain it better.

    When someone submits a site to my directory (just the main url) I review it and accept it, no probs there...

    What I want to do is have a spider, crawl every accepted site and than automatically submit the internal pages to my directory.
     
    Ernster, Aug 10, 2005 IP
  7. dvduval

    dvduval Notable Member

    Messages:
    3,369
    Likes Received:
    356
    Best Answers:
    1
    Trophy Points:
    260
    #7
    Ok, I got you. That is no simple request. It would easily run in the hundreds of dollars, because there are so many considerations, including but not limited to:
    1) Session IDs in urls
    2) Depth of crawl
    3) domain name with and without www
    4) Bandwidth and care to only consume so much at a time
    5) Link Validation
    6) Respecting robots.txt
    7) Respecting "no follow" attribute
    Once you start running a crawler, it becomes a full time job, and a crawl engineer could add more to my list surely. At this point, I am not aware of any "simple" spiders, except where you are pulling down just one page. If anyone know of a "simple" spider that crawls a site, let me know.

    One final note: It might be possible to do something simple like set the limit to 10 pages, and just grab the URL and page title if it exists, but even then there will be quality concerns.
     
    dvduval, Aug 11, 2005 IP
  8. egdcltd

    egdcltd Peon

    Messages:
    691
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #8
    egdcltd, Aug 11, 2005 IP