1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

how to reduce bandwidth usage for a site?

Discussion in 'Apache' started by westhaven, May 29, 2005.

  1. compar

    compar Peon

    Messages:
    2,705
    Likes Received:
    169
    Best Answers:
    0
    Trophy Points:
    0
    #21
    That's very useful JD. I'm not sure we are utilizing it on our server, but I've asked my tech to talk to me about it. I can think of several sites that we host that would really benefit from this.
    SEMrush
     
    compar, May 30, 2005 IP
    SEMrush
  2. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #22
    You'd usually benefit from this if your visitors view more than just one-two pages or if you expect a lot of corporate users (which usually share a caching proxy). One thing to keep in mind with regards to this is that if you configure content to expire over an extended period of time (e.g. a day or greater), you will need to lower this time before the deployment, so that all caches would get the new stuff sooner.

    J.D.
     
    J.D., May 30, 2005 IP
  3. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #23
    Except you also want your pages to load fast for your viewers...

    Another thing that helps is take out any comments and whitespace from your HTML. (It's not much of a savings, but it does add up.)
     
    exam, May 30, 2005 IP
  4. markkk

    markkk Well-Known Member

    Messages:
    1,144
    Likes Received:
    49
    Best Answers:
    0
    Trophy Points:
    140
    #24
    optimize your Html page..

    another thing I did is,, I put my HTML pages to my own paid hosting then upload all my images/swf to Free hosting site that has unlimited transfer.



    Cheers! :)
     
    markkk, May 31, 2005 IP
  5. westhaven

    westhaven Well-Known Member

    Messages:
    3,936
    Likes Received:
    452
    Best Answers:
    0
    Trophy Points:
    195
    #25
    lst day i put this meta tag in mt html file
    <META NAME="revisit-after" CONTENT="7 days">
    but today when i saw the logs their again msn,google robots are coming,i don know why....
     
    westhaven, May 31, 2005 IP
  6. yfs1

    yfs1 User Title Not Found

    Messages:
    13,798
    Likes Received:
    922
    Best Answers:
    0
    Trophy Points:
    0
    #26
    Westhaven - The major bots ignore that
     
    yfs1, May 31, 2005 IP
  7. nevetS

    nevetS Evolving Dragon

    Messages:
    2,544
    Likes Received:
    211
    Best Answers:
    0
    Trophy Points:
    135
    #27
    If the bots are not major SE's and they take up a lot of bandwidth, send them away
    
    RewriteEngine on
    RewriteBase /
    RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^AWSM [OR]
    RewriteCond %{HTTP_USER_AGENT} ^sna [OR]
    RewriteCond %{HTTP_USER_AGENT} ^aipbot
    RewriteRule !^http://[^/.]\.yoursite.com.* - [F]
    
    PHP:
    There are a lot more bots out there, but those are ones that were attacking one of my sites.
     
    nevetS, May 31, 2005 IP
  8. Perrow

    Perrow Well-Known Member

    Messages:
    1,306
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    140
    #28
    Here's a link to someone who doesn't like the revisit-after meta tags. :D
     
    Perrow, May 31, 2005 IP
  9. TechEvangelist

    TechEvangelist Guest

    Messages:
    919
    Likes Received:
    140
    Best Answers:
    0
    Trophy Points:
    133
    #29
    The revisit-after Meta tag has been ignored by spiders since the 1990s.

    There are dozens of Meta tags and the majority are completely ignored by search engines. They are just worthless overhead.
     
    TechEvangelist, May 31, 2005 IP
  10. compar

    compar Peon

    Messages:
    2,705
    Likes Received:
    169
    Best Answers:
    0
    Trophy Points:
    0
    #30
    That is certainly my feeling to.
     
    compar, May 31, 2005 IP
  11. westhaven

    westhaven Well-Known Member

    Messages:
    3,936
    Likes Received:
    452
    Best Answers:
    0
    Trophy Points:
    195
    #31
    where can i add this code
     
    westhaven, May 31, 2005 IP
  12. classifieds

    classifieds Sopchoppy Flash

    Messages:
    825
    Likes Received:
    51
    Best Answers:
    0
    Trophy Points:
    150
    #32
    Those are from his .htaccess file.

    BTW, here's a copy of mine. It came from WMW, myself and contributions from other webmasters.

    -jay

    # These are my ban the nasty bot/harvester list - jw
    #
    # Banning BOTS bellow
    # Address harvesters
    RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider|ExtractorPro) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect|Harvest|Magnet|Reaper|Siphon|Sweeper|Wolf) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent|Email.?Extrac) [NC,OR]
    RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^PlantyNet_WebRobot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^NutchCVS [NC,OR] 
    RewriteCond %{HTTP_USER_AGENT} ^gamekitbot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^ichiro [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^avuk [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^bdfetch [NC,OR] 
    RewriteCond %{HTTP_USER_AGENT} ^AIBOT [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^aibot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Libby [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Jakarta [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^mysearch [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^OmniExplor [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^PHP/4.2.2 [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^POE [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^SearchIndy [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xenu [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^rameda [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Huron [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^LWP [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^spider [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^noxtrum [NC,OR]
    
    
    # Download managers
    RewriteCond %{HTTP_USER_AGENT} ^(Alligator|DA.?[0-9]|DC\-Sakura|Download.?(Demon|Express|Master|Wonder)|FileHound) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(Flash|Leech)Get [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(Fresh|Lightning|Mass|Real|Smart|Speed|Star).?Download(er)? [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(Gamespy|Go!Zilla|iGetter|JetCar|Net(Ants|Pumper)|SiteSnagger|Teleport.?Pro|WebReaper|NutchCVS?) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(My)?GetRight [NC,OR]
    # Image-grabbers
    RewriteCond %{HTTP_USER_AGENT} ^(AcoiRobot|FlickBot|webcollage) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(Express|Mister|Web).?(Web|Pix|Image).?(Pictures|Collector)? [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image.?(fetch|Stripper|Sucker) [NC,OR]
    # "Gray-hats"
    RewriteCond %{HTTP_USER_AGENT} ^(Atomz|BlackWidow|BlogBot|EasyDL|Marketwave|Sqworm|SurveyBot|Webclipping\.com) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} (girafa\.com|gossamer\-threads\.com|grub\-client|Netcraft|Nutch) [NC,OR]
    # Site-grabbers
    RewriteCond %{HTTP_USER_AGENT} ^(eCatch|(Get|Super)Bot|Kapere|HTTrack|JOC|Offline|UtilMind|Xaldon) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web.?(Auto|Cop|dup|Fetch|Filter|Gather|Go|Leach|Mine|Mirror|Pix|QL|RACE|Sauger) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web.?(site.?(eXtractor|Quester)|Snake|ster|Strip|Suck|vac|walk|Whacker|ZIP) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} WebCapture [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR]
    # Tools
    RewriteCond %{HTTP_USER_AGENT} ^(curl|Dart.?Communications|Enfish|htdig|Java|larbin) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} (FrontPage|Indy.?Library|RPT\-HTTPClient) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(libwww|lwp|PHP|Python|www\.thatrobotsite\.com|webbandit|Wget|Zeus) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(Microsoft|MFC).(Data|Internet|URL|WebDAV|Foundation).(Access|Explorer|Control|MiniRedir|Class) [NC,OR]
    # Unknown
    RewriteCond %{HTTP_USER_AGENT} ^(Crawl_Application|Lachesis|Nutscrape) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^[CDEFPRS](Browse|Eval|Surf) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(Demo|Full.?Web|Lite|Production|Franklin|Missauga|Missigua).?(Bot|Locat) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net|hhjhj@yahoo\.com|lerly\.net|mapfeatures\.net|metacarta\.com) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(Industry|Internet|IUFW|Lincoln|Missouri|Program).?(Program|Explore|Web|State|College|Shareware) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(Mac|Ram|Educate|WEP).?(Finder|Search) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa|MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^NaverRobot [NC]
    RewriteRule .* - [F] 
    
    Code (markup):
     
    classifieds, Jun 1, 2005 IP
  13. classifieds

    classifieds Sopchoppy Flash

    Messages:
    825
    Likes Received:
    51
    Best Answers:
    0
    Trophy Points:
    150
    #33
    Here's a command/script that I use to quickly identify new bots / harvesters that don't mimic the IE / Mozilla / Firefox UAs. I can't remember where I originally found it but it has been helpful to me so I hope you get some use out of it too!

    cat /var/log/httpd/access_log | awk -F "\"" {'print $6'} | sort | uniq | grep -v Mozilla
    Code (markup):
     
    classifieds, Jun 1, 2005 IP
  14. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #34
    Keep in mind that this list will be evaluated on every hit. It's a very long list and it will certainly affect your server's performance. You might want to check if the saved bandwidth is really worth it. For example, measure how much bandwidth these robots use per week and also check how much slower your server works with these conditions in place (you'll need a stress tool for this).

    J.D.
     
    J.D., Jun 1, 2005 IP
  15. classifieds

    classifieds Sopchoppy Flash

    Messages:
    825
    Likes Received:
    51
    Best Answers:
    0
    Trophy Points:
    150
    #35
    JD,

    You are correct. I've tested it and in my case it adds about 1-3% to the processor utilization. I'm running on a dedicated machine so it doesn't effect others and to me its worth the trade-off.

    I get deluged by email harvesters, scrapers and misbehaving bots and this was the easiest / quickest way to block them. But I have been seeing a new breed of harvesters that rotate their UAs on each pageview and each UA is a legitimate string AND the harvester software is smart enough now to avoid my spider traps.

    One solution I'm working on is to force user authentication via a character string graphically represented if they view more than x pages - Hopefully this will catch the rest of them and not endanger the user experience on the site.

    Also, if I see a lot of bad activity from a particular ISP/country use apf to block the entire CIDR range.

    If you have suggestions or other tactics for performance or security I would appreciate hearing it.

    Thanks,

    -jay
     
    classifieds, Jun 1, 2005 IP
  16. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #36
    It's not bad. I expected it to be higher - in the 5-10% range (regex can be pretty expensive).

    Legitimate robots are easy to control with robots.txt and don't need to be in the rewrite list. Some user agents (like Java client) do not check for robots.txt and should be blocked at the rewrite level. Evil robots can only be blocked reliably at the IP level. I monitor traffic and if I see unreasinably high activity from a certain range, I usually block the source for some time (sometimes the entire block - depends on the usage pattern).

    The idea of presenting graphics is interesting, but may drive some of your visitors away. I wouldn't use it.

    J.D.
     
    J.D., Jun 1, 2005 IP
  17. optimizehost

    optimizehost Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #37
    Saving bandwidth is a matter of compressing the images which are the biggest factor for bandwidth usages. We have a unique tool that can do the job.
     
    optimizehost, Aug 3, 2005 IP