Evil bots consuming all my sites traffic even after i disallowed them in robots.txt

Fking Active Member

Messages:: 257

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 53

#1

I have a few sites on a shared hosting where each has 1GB transfer per month. Those are super low traffic sites and i thought that won't be an issue but, take a look at this:
[IMG]

That's just for the first 5 days of this month, and when the last month I put robots.txt like this:

# Begin block Bad-Robots from robots.txt
User-agent: robot
Disallow:/
User-agent: bot
Disallow:/
User-agent: spider
Disallow:/
User-agent: crawl
Disallow:/
User-agent: spider
Disallow:/
User-agent: asterias
Disallow:/
User-agent: BackDoorBot/1.0
Disallow:/
User-agent: Black Hole
Disallow:/
User-agent: BlowFish/1.0
Disallow:/
User-agent: BotALot
Disallow:/
User-agent: BuiltBotTough
Disallow:/
User-agent: Bullseye/1.0
Disallow:/
User-agent: BunnySlippers
Disallow:/
User-agent: Cegbfeieh
Disallow:/
User-agent: CheeseBot
Disallow:/
User-agent: CherryPicker
Disallow:/
User-agent: CherryPickerElite/1.0
Disallow:/
User-agent: CherryPickerSE/1.0
Disallow:/
User-agent: CopyRightCheck
Disallow:/
User-agent: cosmos
Disallow:/
User-agent: Crescent
Disallow:/
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow:/
User-agent: DittoSpyder
Disallow:/
User-agent: EmailCollector
Disallow:/
User-agent: EmailSiphon
Disallow:/
User-agent: EmailWolf
Disallow:/
User-agent: EroCrawler
Disallow:/
User-agent: ExtractorPro
Disallow:/
User-agent: Foobot
Disallow:/
User-agent: Harvest/1.5
Disallow:/
User-agent: hloader
Disallow:/
User-agent: httplib
Disallow:/
User-agent: humanlinks
Disallow:/
User-agent: InfoNaviRobot
Disallow:/
User-agent: JennyBot
Disallow:/
User-agent: Kenjin Spider
Disallow:/
User-agent: Keyword Density/0.9
Disallow:/
User-agent: LexiBot
Disallow:/
User-agent: libWeb/clsHTTP
Disallow:/
User-agent: LinkextractorPro
Disallow:/
User-agent: LinkScan/8.1a Unix
Disallow:/
User-agent: LinkWalker
Disallow:/
User-agent: LNSpiderguy
Disallow:/
User-agent: lwp-trivial
Disallow:/
User-agent: lwp-trivial/1.34
Disallow:/
User-agent: Mata Hari
Disallow:/
User-agent: Microsoft URL Control - 5.01.4511
Disallow:/
User-agent: Microsoft URL Control - 6.00.8169
Disallow:/
User-agent: MIIxpc
Disallow:/
User-agent: MIIxpc/4.2
Disallow:/
User-agent: Mister PiX
Disallow:/
User-agent: moget
Disallow:/
User-agent: moget/2.1
Disallow:/
User-agent: NetAnts
Disallow:/
User-agent: NICErsPRO
Disallow:/
User-agent: Offline Explorer
Disallow:/
User-agent: Openfind
Disallow:/
User-agent: Openfind data gathere
Disallow:/
User-agent: ProPowerBot/2.14
Disallow:/
User-agent: ProWebWalker
Disallow:/
User-agent: QueryN Metasearch
Disallow:/
User-agent: RepoMonkey
Disallow:/
User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow:/
User-agent: RMA
Disallow:/
User-agent: SiteSnagger
Disallow:/
User-agent: SpankBot
Disallow:/
User-agent: spanner
Disallow:/
User-agent: suzuran
Disallow:/
User-agent: Szukacz/1.4
Disallow:/
User-agent: Teleport
Disallow:/
User-agent: TeleportPro
Disallow:/
User-agent: Telesoft
Disallow:/
User-agent: The Intraformant
Disallow:/
User-agent: TheNomad
Disallow:/
User-agent: TightTwatBot
Disallow:/
User-agent: Titan
Disallow:/
User-agent: toCrawl/UrlDispatcher
Disallow:/
User-agent: True_Robot
Disallow:/
User-agent: True_Robot/1.0
Disallow:/
User-agent: turingos
Disallow:/
User-agent: URLy Warning
Disallow:/
User-agent: VCI
Disallow:/
User-agent: VCI WebViewer VCI WebViewer Win32
Disallow:/
User-agent: Web Image Collector
Disallow:/
User-agent: WebAuto
Disallow:/
User-agent: WebBandit
Disallow:/
User-agent: WebBandit/3.50
Disallow:/
User-agent: WebCopier
Disallow:/
User-agent: WebEnhancer
Disallow:/
User-agent: WebmasterWorldForumBot
Disallow:/
User-agent: WebSauger
Disallow:/
User-agent: Website Quester
Disallow:/
User-agent: Webster Pro
Disallow:/
User-agent: WebStripper
Disallow:/
User-agent: WebZip
Disallow:/
User-agent: WebZip/4.0
Disallow:/
User-agent: Wget
Disallow:/
User-agent: Wget/1.5.3
Disallow:/
User-agent: Wget/1.6
Disallow:/
User-agent: WWW-Collector-E
Disallow:/
User-agent: Xenu's
Disallow:/
User-agent: Xenu's Link Sleuth 1.1c
Disallow:/
User-agent: Zeus
Disallow:/
User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow:/
# Begin Exclusion From Directories from robots.txt
Disallow: /cgi-bin/

Code (markup):

it's basically blocking all known spider bots but google, bing and yahoo
and as you can see
robot, bot, spider, crawl are not respecting that at all

so few questions here

1. Does anybody know who runs those bots and why they don't respect robots.txt
2. What's up with googlebot consuming 660mb for 5 days? Aren't they supposed to NOT be aggressive like that. There was a video where Matt Cuts explains how they are extra careful to not crawl sites too fast and aggressive since this might cause problems to smaller hosts.
3. if i add the line:
Disallow:/
User-agent: *bot

since this is the ID of one of the bots, will that also disallow "Googlebot" or in robots.txt * is literally * not a catch all symbol?

answering on any of the 3 questions will be appreciated

Solved! View solution.

Fking, Nov 6, 2012 IP

GMF Well-Known Member Best Answer

Messages:: 855

Likes Received:: 113

Best Answers:: 19

Trophy Points:: 145

#2

Fking said: ↑

1. Does anybody know who runs those bots and why they don't respect robots.txt
Click to expand...

Because they, as you said, are evil. They want to make money (or something else), so they just ignore it,

2. What's up with googlebot consuming 660mb for 5 days? Aren't they supposed to NOT be aggressive like that. There was a video where Matt Cuts explains how they are extra careful to not crawl sites too fast and aggressive since this might cause problems to smaller hosts.
Click to expand...

From a quick search: Check your website for broken links, 404, unnecessary huge images etc...
If your site is fine: Googlebot is hungry - no other way around then blocking it

3. if i add the line:
Disallow:/
User-agent: *bot

since this is the ID of one of the bots, will that also disallow "Googlebot" or in robots.txt * is literally * not a catch all symbol?
Click to expand...

Can't answer that for certain, but why not just dissalow ALL other bots, and only allow the ones of your liking?
# Allowed robots
User-agent: Google
Allow: /

User-agent: Yahoo
Allow: /

User-agent: Bing
Allow: /

# All other robots
User-agent: *
Disallow: /
Code (markup):
If you want to really block the bad bots, and you can find out their IP addresses, you can block them via .htaccess
Here is a tutorial
http://www.thesitewizard.com/apache/block-bots-with-htaccess.shtml
Code (markup):
Bots are not able to ignore .htaccess files

GMF, Nov 6, 2012 IP

Fking Active Member

Messages:: 257

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 53

#3

all your suggestions are excellent, thank you!

Fking, Nov 6, 2012 IP

blockdos Active Member

Messages:: 96

Likes Received:: 0

Best Answers:: 3

Trophy Points:: 71

#4

Hey is this is on a dedicated server you can do this in iptables which is even better as it will drop all their connection attempts. The way to do is is STRING match. Like this:
iptables -A INPUT -m string --algo bm --string "BADBOTUSERAGENT" -j DROP

That will work IF they are giving correct user agent but you can also block lib-www-perl and just about anything that can be used to make automated requests.

However, bandwidth is cheaper these days and usually the average webmaster wont mind any bots crawling the site. I guess it depends what kind and what they are doing. I remember seeing posts like these back in 2004-5 but not much since then as bandwidth has gotten a lot cheaper.

blockdos, Nov 6, 2012 IP

Log in or Sign up

Evil bots consuming all my sites traffic even after i disallowed them in robots.txt

Fking Active Member

GMF Well-Known Member Best Answer

Fking Active Member

blockdos Active Member

Useful Searches