Mobile Phones - Loans - Personal Loans - Personal Loans - Nature's Sunshine

PDA

View Full Version : Excessive spidering of site


mudnik
Feb 18th 2005, 10:38 am
I have a big problem. The MSN spiderbot and yahoo spiderbot has been spidering a few pages of my sites rather excessively. The pages are pretty useless pages:

http://www.internetmlm.net

/Members_List-index-letter-D-sortby-uname-authid-584ee0d621f48c9ab0b4a7d8241daaf5.html

/Members_List-index-letter-O-sortby-uname-authid-329dd717c93377ceb91190d411e82a0c.html

The spidering is so bad that my webhost even shut down my entire site for consuming too much CPU. I have since relocated my site to another webhost.

Top Process %CPU 17.0 [www.internetmlm.net] [/Members_List-index-letter-L-sortby-url-authid-557f90d9d3aa]
Top Process %CPU 14.0 [www.internetmlm.net] [/Members_List-index-letter-All-sortby-url-authid-c14de1784c]
Top Process %CPU 12.8 [www.internetmlm.net] [/Members_List-index-letter-X-sortby-url-authid-eda30773091f]

How do I stop them from accessing these pages?

Tried disallow members* but it didn't do the trick.

Web Gazelle
Feb 18th 2005, 11:03 am
Disallow the spiders from going to those pages in your robots.txt file.

honey
Feb 18th 2005, 11:08 am
disallowing via robots.txt should work.

mudnik
Feb 18th 2005, 7:29 pm
What's the exact text to use?
I tried disallow members* but it didn't work. Can I use wildcards?

Web Gazelle
Feb 18th 2005, 9:56 pm
You can also disallow spiders from crawling pages using meta tags

daboss
Feb 18th 2005, 10:32 pm
try putting this inside of the header of your webpage...

<meta name="robots" content="noindex">

the above should tell se robots not to index the particular page...

Chrissicom
Feb 19th 2005, 8:18 am
in robots.txt

User-agent: *
disallow: /members/ or members.htm etc.

Sirxl
Feb 27th 2005, 9:37 pm
thanks for info, guys

nfzgrld
Feb 27th 2005, 10:28 pm
/Members_List-index-letter-D-sortby-uname-authid-584ee0d621f48c9ab0b4a7d8241daaf5.html

/Members_List-index-letter-O-sortby-uname-authid-329dd717c93377ceb91190d411e82a0c.html


Are those session ID strings there? If so that could be your problems. Session IDs can sometimes make spiders get stuck, especially if the ID changes every time it hits. Find a way to turn off the session IDs when the bots hit and this problem might just go away that fast.

mudnik
Mar 1st 2005, 4:18 am
How do I turn off session IDs?

Cyclops
Mar 1st 2005, 6:06 am
Sorry if going a little off topic but is the MSN bot the same as the Yahoo bot.
On my Sites Admin stats the Yahoo bot is constantly there under multiple IP addresses gobbling up heaps of bandwidth. I never see any reference to the MSN bot.

However in my Cpanel stats Yahoo doesn't show up at all but MSN does.

The Google bot has been showing up twice a day for the past two months.

Web Gazelle
Mar 1st 2005, 10:30 am
Yahoo and MSN are different bots.