Suppose I would like to make my web crawler a C++ program. The program should be able to get the entire html code of a page, how do I get it if I only have the adress for the web page? Can I make a request for a file over the internet? Then to the next question: How can I make a search page work together with another C++ program for performing the search query?
First, C++ is a terrible language for this sort of task. A web crawler is essentially a text parser, and C/C++ are some of the worst languages for text manipulation imaginable. A web crawler will always be bottlenecked by network speed, so using a compiled language is a stunning waste of resources. If you really want to do this, use Perl or some other string-friendly language that supports regular expressions. That said, you would use a network library to handle the HTTP connections. Which one you use depends on your platform and your preferences. Personally, I like libcurl when I write network code in C, but I'm not sure if they have C++ bindings. Search around, there are tons of options available. That will get you the HTML page; it's up to you to parse it and store it in your index. As for communicating between a search daemon and your web interface, it will depend on your platform. I run a search engine on Linux and have a search daemon setup to listen on a special port, then have the web interface connect to that port to execute queries. That's probably the most common architecture for this sort of problem, but you could handle the IPC using any other method you feel comfortable with. If you're interested in this sort of thing, W. Richard Steven's classic UNIX Network Programming books are must-reads. Volume 1 deals with typical network transactions, and will tell you everything you need to know to fetch a web page using a C program. Volume 2 is about Inter-Process Communication, and explains how two applications (such as a search daemon and a web interface) can exchange information.
Again, here are the two books you will absolutely need to read in order to implement something like this from scratch: Networking APIs IPC Obviously, this only handles the communication aspects of a search engine. You still need to know how to write an efficient full-text search/sort algorithm If that sounds like too much, just search around for an off-the-shelf indexer and crawler. There are plenty to choose from.
Hm, I don't know any other compilable programming language than C, C++ and Visual Basic. Is Perl somewhat like C++? I think I've heard that google uses Python much? Don't know if that's true or if I remember wrong. And what is a search daemon? Is it used for laying some of the work at the client's computer? Else I thought of using a php page at the server side, which then called a compiled program, but maybe that's one step more than necessary, maybe you could execute the program directly instead of calling a php page first. Is a daemon used for comunicating directly from a web page with a program that is already running? Thanks
Trikri, Google does indeed use Python. If you're looking for their system, as was originally built have a read of this page: http://infolab.stanford.edu/~backrub/google.html If you're looking to build a serious search engine capable of crawling the entier web (something like 150 Billion individual webpages you'll have to store somewhere) you need to forget about Sql. It would just be horribly slow, even on the most basic disk writes. You'll be able to iterate over data once and once only when crawling and writing links, or it'll take you forever and a day to build any kind of map of the internet. Try having a look at http://www.oracle.com/database/berkeley-db/je/index.html as an example of the type of database you'll require, and be sure to read the licencing requirements! You'll be using these embeddable databases for the size of the data you require. In fact you'll have several such databases covering areas such as URL to DocumentID (the ID of the document you downloaded), A Links Database, A URL information database. You will index those databases and invert them to basically create a searchable database of data that you can search quickly. If all you want to do is test an algo that you think is better, then I would recomment you play around with http://lucene.apache.org/nutch/ This is an opensource search engine and for testing an algo, would be much easier to alter and edit the existing algo, rather than building a search engine from the ground up. Nutch also includes Lucene, an open souce indexer that will automattically create hit lists of words in all your documents (ie the html files you downloaded when crawling!).
C++ a terrible language for this sort of task? i'm not sure if that is very accurate, given that google was created with a mixture of c++ and python.. compiled language = network speed bottleneck? i don't agree with that either...and i'm not sure how using a compiled language is a stunning waste of resources... if u want to build a search engine, you have to decide whether you are going to storing webpages, or if you are only going to be referencing webpages you can either - download parse the website at the same time, storing the websites in memory - download the websites on hd and have a separate parser program to parse and index what parsing does is take out all of the words for a particular webpage and indexes it so that it cna be queried by someone... parsing should be fast, indexing sohuld be fast...and querying the indexed data should be fast.. this becomes increasingly difficult with large data sets..so database design is important the google model uses several low end computers to download and parse websites, the data is then stored on a number of distributed databases; this architecture allows google to crawl hundreds and thousands of pages per second easily however, if your'e starting with one system that has to do everything, its advised to keep the dataset small...parsing can take up vital processing time...processing time that is needed when a user queries the database...
I don't think that he is saying that compiled languages = network bottle necks. I think what he is saying is that network speeds are such a bottle neck that the normal benefits from a compiled language are lost due to the fact that the compiled code will always be waiting on input rather than processing. I think this is a result of how the poster wrote his search engine, as opposed to being generally true. As far as C++ being a terrible language, given how I think this poster might have written his search engine it would be true. However, there is more than one way to skin a cat.
I hv gone through all the reply to the post but i am going to explain the technical portion of a web crawler and the search engine. A web crawler can be made in any high level programming language. You can use Java, C++, python to make any crawler. But i must suggest you go for Java with much more of flexibility and functionality options available to make a crawler. Now a crawler, first pick any random site and records all its out going links. Then it applies either depth first or all first algorithm to refer to those links. In depth first algorithm pick any one outgoing link of any page and follow them as to a page and record all its outgoing URLs, then do same thing again with a URL out of current page up to certain predefined limit. In another algorithm, crawler just follow all the links and its corresponding link page with all links on respective page. The crawler is also called as robot or just a bot. If and when everybody get clear with this then in my next post i'll explain how search engine algorithm works.
My friend built a search engine, with the spider being a php file that would continue executing after his cron job opened it. Then it would send an exec or whatever to restart the spider once it was done. I'm sure that this was a waste of resources having to go back and forth between php and c (the language php is written in if i'm not mistaken?) but it spidered in a fairly timely manner, and when it was live, the search engine was pretty accurate..?