Checking my stats yesterday i found this: Agent: Mozilla/5.0 (compatible; DBLBot/1.0; +http://www.dontbuylists.com/) anyone know who they are and what they are scanning for?
On their webpage, they said: DontBuyLists is a company search engine and list creation tool. The DBLbot is crawling the web in search of company websites. Company websites are cached and are then searchable on our search engine. Because we structure the information found on websites using semantic technology, you can easily find companies, and create lists of companies for free. My suggestion, just ban them.
Hi jonathon, Hi cormack2009, I am the CEO & Founder of aiHit, the company behind DBL. Very happy to answer your questions. DBL is indeed a company search engine and list creation tool. We are one of the few search engines that is actually crawling the whole web (I think there are some 50 search engines doing this in the world). Yes, we scan many web pages in each domain. We are looking for companies that have a web presence and then try to figure out what companies do, what products, services, and solutions they offer, etc. We then structure this information (think semantic search). You can easily find companies in our search engine. If you go to our website http://www.dontbuylists.com/ and subscribe to our beta testing program by clicking on the green button, then I will give you access to the search engine at the next release, so you can see for yourself what we are up to. Re blocking DBL: We respect robots.txt You can find our instructions on how to configure your robots.txt file so we no longer crawl your site here: http://www.dontbuylists.com/faq.htm Hope the above is useful. Kind regards, Jens
Hello jenslapinski, One question: If you find a site with, lets say 49 k pages, did you scan it all??? The problem , on the webmaster side, is that this kind of spiders take much bandwith with no benefit for the webmaster. Personally im not talking about your spider, but in general. In those cases robots.txt do not work, because we need to know the name of each spider, and that is not possible. In my case, after having a bad experience a week ago, with some unknown (for me) spider that consumes 4,5 Gigas on my site. I develop my own code that don't let anybody (except google) to visit more than x number of pages in 10 minutes on one site of mine.