Third Party Crawler

Bombaywala Peon

Messages:: 1,249

Likes Received:: 30

Best Answers:: 0

Trophy Points:: 0

#1

I ran in to an interesting issue. All this time I have been using GSiteCrawler to crawl one of my sites. Every week I asked GSite to crawl the site, the crawler would come up with 28000 records waiting to be crawled. Whereas, my site has around 7000 pages. What I discovered that one my javascript function was calling URL string 4 times. This made the crawler to think that I have 28000 pages instead of actual 7000. 7000 x 4 = 28K. I then moved the javascript to it's independent file and GSite still did that. What this led to that the crawler would slow down ridiculously and at one point just hang up. Would Google and other SE crawlers perform similar behavior because of the javascript?

Eventually I blocked the crawler to crawl the .js and viola - all is good - GSite finishes crawling 7K in about an hour.

Bombaywala, Mar 28, 2007 IP

sadcox66 Spirit Walker

Messages:: 496

Likes Received:: 16

Best Answers:: 0

Trophy Points:: 0

#2

Spiders typically ignore Javascript so it should not affect google and other spiders. GSiteCrawler appears to use InternetExplorer browser functions and
it would be susceptable to side effects like webbrowsers, but spiders do not
(usually) run on MS windows or use IE inet library so the answer is No...

But there are other ways, such as using links that primitive spiders have a problem with. Look at your search engine and see if bandwidth is hogged by
a spider and if they are visiting the same pages. ( hint Inktomi )

sadcox66, Mar 28, 2007 IP

websitetools Well-Known Member

Messages:: 1,513

Likes Received:: 25

Best Answers:: 4

Trophy Points:: 170

#3

You should perhaps consider posting your website url.
Perhaps you have some very funky Javascript

I think most search engines and sitemap creator programs handle HTML, css and javascript etc. links by:

* Normalize url (e.g. translation of relative addresses etc.)
* Make sure to handle if mutliple pages redirect to the same.
* No page content is downloaded or analyzed twice.

websitetools, Apr 1, 2007 IP

Log in or Sign up

Third Party Crawler

Bombaywala Peon

sadcox66 Spirit Walker

websitetools Well-Known Member

Useful Searches