How do you create a search engine?

Discussion in 'All Other Search Engines' started by TriKri, Jul 9, 2007.

  1. LifeIsRisky

    LifeIsRisky Peon

    Messages:
    38
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #21
    THanks for the info.
     
    LifeIsRisky, Jul 28, 2007 IP
  2. ajitjc

    ajitjc Banned

    Messages:
    1,069
    Likes Received:
    40
    Best Answers:
    0
    Trophy Points:
    0
    #22
    I am also searching for this same question.
     
    ajitjc, Jul 28, 2007 IP
  3. TriKri

    TriKri Peon

    Messages:
    19
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #23
    Suppose I would like to make my web crawler a C++ program. The program should be able to get the entire html code of a page, how do I get it if I only have the adress for the web page? Can I make a request for a file over the internet?

    Then to the next question: How can I make a search page work together with another C++ program for performing the search query?
     
    TriKri, Oct 14, 2007 IP
  4. aidantrent

    aidantrent Peon

    Messages:
    19
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #24
    First, C++ is a terrible language for this sort of task. A web crawler is essentially a text parser, and C/C++ are some of the worst languages for text manipulation imaginable. A web crawler will always be bottlenecked by network speed, so using a compiled language is a stunning waste of resources. If you really want to do this, use Perl or some other string-friendly language that supports regular expressions.

    That said, you would use a network library to handle the HTTP connections. Which one you use depends on your platform and your preferences. Personally, I like libcurl when I write network code in C, but I'm not sure if they have C++ bindings. Search around, there are tons of options available. That will get you the HTML page; it's up to you to parse it and store it in your index.

    As for communicating between a search daemon and your web interface, it will depend on your platform. I run a search engine on Linux and have a search daemon setup to listen on a special port, then have the web interface connect to that port to execute queries. That's probably the most common architecture for this sort of problem, but you could handle the IPC using any other method you feel comfortable with.

    If you're interested in this sort of thing, W. Richard Steven's classic UNIX Network Programming books are must-reads. Volume 1 deals with typical network transactions, and will tell you everything you need to know to fetch a web page using a C program. Volume 2 is about Inter-Process Communication, and explains how two applications (such as a search daemon and a web interface) can exchange information.
     
    aidantrent, Oct 14, 2007 IP
  5. aidantrent

    aidantrent Peon

    Messages:
    19
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #25
    Again, here are the two books you will absolutely need to read in order to implement something like this from scratch:

    Networking APIs
    IPC

    Obviously, this only handles the communication aspects of a search engine. You still need to know how to write an efficient full-text search/sort algorithm ;)

    If that sounds like too much, just search around for an off-the-shelf indexer and crawler. There are plenty to choose from.
     
    aidantrent, Oct 14, 2007 IP
  6. TriKri

    TriKri Peon

    Messages:
    19
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #26
    Hm, I don't know any other compilable programming language than C, C++ and Visual Basic. Is Perl somewhat like C++? I think I've heard that google uses Python much? Don't know if that's true or if I remember wrong.

    And what is a search daemon? Is it used for laying some of the work at the client's computer? Else I thought of using a php page at the server side, which then called a compiled program, but maybe that's one step more than necessary, maybe you could execute the program directly instead of calling a php page first. Is a daemon used for comunicating directly from a web page with a program that is already running?

    Thanks
     
    TriKri, Oct 14, 2007 IP
  7. firmaterra

    firmaterra Peon

    Messages:
    756
    Likes Received:
    16
    Best Answers:
    0
    Trophy Points:
    0
    #27
    Trikri,

    Google does indeed use Python.

    If you're looking for their system, as was originally built have a read of this page:
    http://infolab.stanford.edu/~backrub/google.html


    If you're looking to build a serious search engine capable of crawling the entier web (something like 150 Billion individual webpages you'll have to store somewhere) you need to forget about Sql. It would just be horribly slow, even on the most basic disk writes. You'll be able to iterate over data once and once only when crawling and writing links, or it'll take you forever and a day to build any kind of map of the internet. Try having a look at http://www.oracle.com/database/berkeley-db/je/index.html as an example of the type of database you'll require, and be sure to read the licencing requirements! You'll be using these embeddable databases for the size of the data you require. In fact you'll have several such databases covering areas such as URL to DocumentID (the ID of the document you downloaded), A Links Database, A URL information database. You will index those databases and invert them to basically create a searchable database of data that you can search quickly.

    If all you want to do is test an algo that you think is better, then I would recomment you play around with http://lucene.apache.org/nutch/
    This is an opensource search engine and for testing an algo, would be much easier to alter and edit the existing algo, rather than building a search engine from the ground up. Nutch also includes Lucene, an open souce indexer that will automattically create hit lists of words in all your documents (ie the html files you downloaded when crawling!).
     
    firmaterra, Oct 15, 2007 IP
  8. SE_Researcher

    SE_Researcher Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #28

    C++ a terrible language for this sort of task? i'm not sure if that is very accurate, given that google
    was created with a mixture of c++ and python..

    compiled language = network speed bottleneck?
    i don't agree with that either...and i'm not sure how using a compiled language is a stunning waste of resources...

    if u want to build a search engine, you have to decide whether you are going to storing webpages, or if you are only going to be referencing webpages

    you can either

    - download parse the website at the same time, storing the websites in memory
    - download the websites on hd and have a separate parser program to parse and index

    what parsing does is take out all of the words for a particular webpage and indexes it so that it cna be queried by someone...

    parsing should be fast, indexing sohuld be fast...and querying the indexed data should be fast..

    this becomes increasingly difficult with large data sets..so database design is important

    the google model uses several low end computers to download and parse websites, the data is then stored on a number of distributed databases; this architecture allows google to crawl hundreds and thousands of pages per second easily

    however, if your'e starting with one system that has to do everything, its advised to keep the dataset small...parsing can take up vital processing time...processing time that is needed when a user queries the database...
     
    SE_Researcher, Oct 17, 2007 IP
  9. bluegrass special

    bluegrass special Peon

    Messages:
    790
    Likes Received:
    50
    Best Answers:
    0
    Trophy Points:
    0
    #29
    I don't think that he is saying that compiled languages = network bottle necks. I think what he is saying is that network speeds are such a bottle neck that the normal benefits from a compiled language are lost due to the fact that the compiled code will always be waiting on input rather than processing. I think this is a result of how the poster wrote his search engine, as opposed to being generally true. As far as C++ being a terrible language, given how I think this poster might have written his search engine it would be true. However, there is more than one way to skin a cat.
     
    bluegrass special, Oct 17, 2007 IP
  10. marco786

    marco786 Banned

    Messages:
    42
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #30
    I hv gone through all the reply to the post but i am going to explain the technical portion of a web crawler and the search engine.
    A web crawler can be made in any high level programming language. You can use Java, C++, python to make any crawler. But i must suggest you go for Java with much more of flexibility and functionality options available to make a crawler.
    Now a crawler, first pick any random site and records all its out going links. Then it applies either depth first or all first algorithm to refer to those links. In depth first algorithm pick any one outgoing link of any page and follow them as to a page and record all its outgoing URLs, then do same thing again with a URL out of current page up to certain predefined limit.
    In another algorithm, crawler just follow all the links and its corresponding link page with all links on respective page. The crawler is also called as robot or just a bot.
    If and when everybody get clear with this then in my next post i'll explain how search engine algorithm works.
     
    marco786, Sep 27, 2008 IP
  11. Stellarchase

    Stellarchase Guest

    Messages:
    121
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #31
    My friend built a search engine, with the spider being a php file that would continue executing after his cron job opened it. Then it would send an exec or whatever to restart the spider once it was done. I'm sure that this was a waste of resources having to go back and forth between php and c (the language php is written in if i'm not mistaken?) but it spidered in a fairly timely manner, and when it was live, the search engine was pretty accurate..?
     
    Stellarchase, Sep 27, 2008 IP