nutch

dcole07 Peon

Messages:: 135

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 0

#1

How does Nutch's datebase work?

does it save the data into an array in MySQL like

36633 http://forum.digitalpoint.com forum SEO Digital point tracker search ...
85093 http://digitalpoint.com forum SEO Digital point tracker search engine...
...

then it will have a new line under that with another number URL and list of words??

dcole07, Jun 25, 2006 IP

wheel Peon

Messages:: 477

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#2

No, not even close. The crawled pages are kept in a 'segment' which is a file or two containing the downloaded pages. Then an index is built from the segment, where the index is the actual file accessed when someone runs a search. I believe the file format is proprietary (perhaps unique is a better word) but don't call me on that . At the very least it's not a mysql table nor is the data accessed through a database program like mysql - the search program accesses the index directly.

If you're looking for a small search engine of about the mysql level, look at something like aspseek. Nutch is very heavy duty stuff IMO, but also comes with the associated heavy duty learning curve. It's not a case of install, change the config file and start crawling.

wheel, Jun 25, 2006 IP

dcole07 Peon

Messages:: 135

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 0

#3

Well what is this proprietary way?

I'm making a search engine and it doesn't store data like any search engine that you can download and make your own DB. The Search engine in theory can hold at min. 5000 times as much data as other web lanuage search engine that I've seen, while still performing as fast or faster on the same machine. But also the basic files that do the tasks (as in everthing but the DB) is half the size as a common small search engine.

At first it doesn't seam like that would be true but then if you were to see the way it accesses the DB you would see why most search engines don't use this method and why it's so much better. It's like Easy and slow or hard and fast..

dcole07, Jun 26, 2006 IP

wheel Peon

Messages:: 477

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#4

dcole07 said: ↑

Well what is this proprietary way?
Click to expand...

I dunno, flat files of some sort. Nutch is open source, download it and find out. It is based on Lucene.

Sure you can create software that indexes and searches fast using proprietary data formats. Google does it, Yahoo does it, MSN does it, and so on. Nutch does it as well.

The smaller stuff that won't handle industrial strength applications use databases. The big ones (including nutch) all use their own approach. But they all boil down to downloading the pages, then creating an index of words and terms on each page. Then when you search for a term, it uses the word index to find appropriate pages, builds a score then sorts and displays the results. AFAIK, they *all* work that way. In terms of stuff like disk access and what's stored where, well, there's as many ways to do that as their are developers.

The real test in most of these applications is when you throw 50 million pages into it, then try a search. That's what seperates the big ones from the small ones. Nutch will handle that level seamlessly, most other available apps won't.

{If I'm sounding like a nutch fan, well, I think there are some things it could improve on. But based on what I've seen as available to date nutch is far and away the best alternative for crawler based search engines that need serious indexing).

wheel, Jun 27, 2006 IP

wheel Peon

Messages:: 477

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#5

dcole07 said: ↑

The Search engine in theory can hold at min. 5000 times as much data as other web lanuage search engine that I've seen, while still performing as fast or faster on the same machine.
Click to expand...

The web language thing I'd be suspicious of. If you're talking php type of stuff, I'm not so sure that's the best pick for an SE. PHP is a great language for web development for a variety of reasons. Speed of the final application isn't one of them.

Personally I think this stuff should properly be compiled. Nutch uses tomcat/java and I think if they went to a strict compiled language it'd be a better choice.

wheel, Jun 27, 2006 IP

dcole07 Peon

Messages:: 135

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 0

#6

well I bet I would like nutch but it's writen in a language I don't know... I like to change thing so they are my way... Also by using nutch I would be like everyone else, I was looking at search engines and when I typed in a not so common keyword I use, I can tell what Log. they are using...

I ask about how the DB is set up because that decides what features you can have and how stuff works in the SE. The small search engine what is good for looking at one site but it's slow, the way I'm using is bad for looking at just one side if it's in a heard but it can access very little data to get the same information out.

and don't you have to compile java or is it close source but it looks at open source files... (or is every file open source and change able)

dcole07, Jun 27, 2006 IP

wheel Peon

Messages:: 477

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#7

It's all open source.

And I don't see how you can tell whether someone is using nutch or not, since the serps would depend not just on the algo, but also on the crawled data. Different crawls will given different serps even with the same algo. And there are a number of configuration changes that affect the scoring that are not obvious either - that will also completely change the serps between two search engines even if the data is identical.

I'm not sure specifically whether java is compiled or not as my developer handles this aspect. I think it might be compiled at runtime, but again, I'm basically ignorant of java.

If you know java, you can make all the changes you like because nutch is open source. We've made numerous changes to the scoring algorithm and the crawling algorithm on one site we use nutch on.

I think you'll find building a real search engine is a larger job than it first appears. If you're not sure, download nutch and give it a trial run. Nothing like running your own to give you a good idea what it takes to actually develop one.

wheel, Jun 27, 2006 IP

Log in or Sign up

nutch

dcole07 Peon

wheel Peon

dcole07 Peon

wheel Peon

wheel Peon

dcole07 Peon

wheel Peon

Useful Searches