how do you make your own crawler?

Discussion in 'All Other Search Engines' started by nhoss2, Sep 6, 2008.

  1. #1
    how do you make your own crawler? is there any type of software that does it or is it something else?
     
    nhoss2, Sep 6, 2008 IP
  2. einfoway

    einfoway Member

    Messages:
    83
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    26
    #2
    mainly its done via scripts if i am not wrong.
     
    einfoway, Sep 7, 2008 IP
  3. searchia

    searchia Well-Known Member

    Messages:
    1,179
    Likes Received:
    16
    Best Answers:
    0
    Trophy Points:
    115
    #3
    you can use sphider, link sphider.eu
     
    searchia, Sep 7, 2008 IP
  4. firmaterra

    firmaterra Peon

    Messages:
    756
    Likes Received:
    16
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Depends what you want the crawler to do.

    If you want to index your own site or maybe some documents etc. take a look at lucene, or nutch. Nutch can start crawling the web for you, so long as you have the hardward to support it :D

    Its not hard to write a crawler in any language. A crawler simply trawls the web / directory / whatever you've told it to do and downloads the urls etc. The hard part is processing the results and analyzing the pages you've crawled.

    If you go anywhere near search engine development you'll find most of your time is spent figuring out what to do with the data you've obtained, rather than obtaining that data :)
     
    firmaterra, Sep 7, 2008 IP
  5. seohelp

    seohelp Peon

    Messages:
    529
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #5

    Nice, I like you reply.
     
    seohelp, Sep 8, 2008 IP
  6. ASPMachine

    ASPMachine Peon

    Messages:
    723
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Making a crawler is not an easy work as some people think. It's really hard work and takes lots of time to develop it. When we crawl a website, we got lots of problem with data refreshing and processing.

    You can index sites easily if you think to do for Meta only. Take two point of index tag (1) <meta and (2) Description and after end of the ("), start reading the section and stop at (>"). You'll get Meta description. Title reading is also easy... just read the section between <title> and </title>.

    The above description is only for Meta index. But if you found any website that has no any Meta Description or Keywords, how you can index the site. For these reason, lots of company are made their with Meta Search Engine. Here you need to index from body also.

    I've also found the lots of other ready-made software company are taking guaranteed to index from <b>, <h>, <h1>, <i> etc. But some of them are getting failed to keep their words. I also found lots of source code instead of actual page contents. Even our crawler had indexed error and some source code data. Due to these types of fault, we need to develop our crawler every time.

    Our crawler simple history:

    (1) We first crawled sites from Meta Description. But we found lots of pages have same Meta description. So, we'd thought to do index from page body.
    (2) In the next we've given important on page body. For it, we start indexing sites/pages from <p> to </p> tag. But these results were not satisfied. Because within the <p> and </p> there've lots code can implement like <table>, <font> etc. which was very difficult to remove from indexed data. If you search through BoroLook.com, we'll find some result where source codes are visible.
    (3) Now the final stage and we're proud our self that we've made our crawler very powerful and now it has ability to index from any site whether the site or page has any Meta Description or Not without making any mistake or capturing source code.

    From last month, we only indexing from page body and if we don't get any page contents then only we go for Meta Description. You can also notice the result of indexed sites....... all the site description is collected from page body only and it's randomly indexed.

    So, web owner who want to make own crawler often faces these types of difficulties.

    If you're using some script and retrieving data from Yahoo, MSN, Google like dogpile, then you don't need to make any database or crawler. Or if you really want to make search engine with own crawler and database, then need to do hard work.

    The other problem is QUERY. How you'll show the result when any user types a keyword at your search box. Google, Yahoo and MSN search results are based on PR/Backlink/Update Content etc. and their query technology is also very good. So, if you make your Result Page Query properly, you'll be on the road of success.

    Another think is DATABASE. Your search result showing speed will depend on the database and on the retrieve query that you're using. One week before, our search result was very means too slow and takes 3-4 seconds to open the search page. Now we've just transferring the data to MSSql Server. MSSql server is very good to use and the system work fast. MSSql Server loads your data on the RAM every time and so when you retrieve the result, it delivers quickly.

    With my own experience with Database (MSSql), if you really want to fast the process, you'll need to keep open the database. Don't use (rs.Close) for your same next page query.

    From my knowledge, only Yahoo, Google, MSN, AOL etc. very few search engines are using own database and all other sponsored Search Engine and Premium Google Custom Search Engines or other script that showing result from search engines (Yahoo, Google, MSN etc.) are retrieving data from these four big search engine. So, it's never mind on which search engine (except Google, Yahoo, MSN and AOL) you're typing the keyword for information, the result in front of you’ll be the same. If a website is using Meta description, you'll get same result at all the search engines.

    So, it's my aim to show result totally different from these Search Engines. If I take Meta description, then the result will be same. So, we're taking description from page body. And so I am hopping, our result will not match with any other search engine.


    Anyway, I always want to help any people who think about making own Search engine (Database & Crawler). I hope you'll be success.

    Thanks & Regards

    Arun
     
    ASPMachine, Sep 8, 2008 IP
  7. firmaterra

    firmaterra Peon

    Messages:
    756
    Likes Received:
    16
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Hi Arun,

    have you encountered any problems with mysql and the volume of data you're processing? I've used flat file databases, mainly targeted towards write once, read many operations. MySql is more write often type of database so thats why i'm curious.

    I've also had to leave databases open until the crawling is complete, to save opening/closing overhead. When you say you transfer the data to RAM, how much data can you transfer?

    Our databases were getting so big, we had to introduce 'shards' that get written to the main database after crawling. We've just further split those shards into smaller shards to increase the speed.

    We find a lot of our time is spent backing up the data. This is done automatically over the network, so 'shard a' gets written to different nodes to ensure that shard 'a' always exists somewhere in case a server ever goes down.
     
    firmaterra, Sep 8, 2008 IP
  8. nhoss2

    nhoss2 Peon

    Messages:
    310
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #8
    wow thanks for your reply, you make it sound easy..
     
    nhoss2, Sep 9, 2008 IP
  9. Swapnil

    Swapnil Peon

    Messages:
    115
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #9
    :eek: :eek: :eek:

    To develop a crawler you should have
    1) An end server system.
    2)C Programming Knowledge
    3)A group of people who manages the index.
    4)Writing an efficient INDEX ALOGORITHM

    There are lots more factor which are needed in efficient indexing cum crawler development.
     
    Swapnil, Sep 9, 2008 IP