1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Following Googlebot

Discussion in 'Google' started by marksteve_74, Apr 20, 2004.

  1. #1
    I have a website , I wanted to know what are all pages googlebot is visiting.
    Form which page is which page , etc.,

    Can some one tell me is there is any tools availble for this.

    Regards
    Mark
    Great http://www.buytattoo.com Picture Site
     
    marksteve_74, Apr 20, 2004 IP
  2. Will.Spencer

    Will.Spencer NetBuilder

    Messages:
    14,789
    Likes Received:
    1,040
    Best Answers:
    0
    Trophy Points:
    375
    #2
    Running Apache under FreeBSD, I can always use `grep` to do this:

    bash-2.05a$ grep Googlebot /www/logs/fortliberty.org-access_log
    64.68.82.55 - - [02/Jan/2004:03:14:16 -0700] "GET /robots.txt HTTP/1.0" 200 24 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
    64.68.82.55 - - [02/Jan/2004:03:14:16 -0700] "GET /employment.shtml HTTP/1.0" 200 6168 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
    64.68.82.54 - - [02/Jan/2004:03:19:01 -0700] "GET /militia-faq.shtml HTTP/1.0" 200 13644 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
    64.68.82.168 - - [02/Jan/2004:20:51:01 -0700] "GET /robots.txt HTTP/1.0" 200 24 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
    64.68.82.168 - - [02/Jan/2004:20:51:01 -0700] "GET /index.shtml HTTP/1.0" 200 6156 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
    64.68.82.18 - - [02/Jan/2004:21:23:13 -0700] "GET / HTTP/1.0" 200 6156 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
    64.68.82.167 - - [03/Jan/2004:18:21:00 -0700] "GET /robots.txt HTTP/1.0" 200 24 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
    64.68.82.167 - - [03/Jan/2004:18:21:00 -0700] "GET /index.shtml HTTP/1.0" 200 6158 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
    64.68.82.168 - - [03/Jan/2004:18:35:45 -0700] "GET /quotes/government.shtml HTTP/1.0" 200 30325 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
    ...

    Of course, I usually combine `grep` with `tail`, because the log file are BIG.
     
    Will.Spencer, Apr 20, 2004 IP
  3. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #3
    when you look at such output ...

    then you see:
    different (MANY !) Googlebots come from different IPs
    craswling each from 1 - approx max 12(-20) pages
    the leave again.

    sometimes 10+ Googlebots during same minutes or hour

    BUT
    it appears to me that Googlebot NEVER follows a link on a page BUT a PRE-defind number of pages according to his "working-schedule" to visit clearly pre-defied pages and "return home" first - then the collected data are processed and a new "workschedule" for action is taken for next visits.

    this approach certainly is far more precise and more scientific and far more efficient than just crawl and crawl no matter how many links ... and eventually LOOP for many minutes like other bots sometimes do to finally get lost or timeout or killed.

    files visited more frequently via Google search appear to be craweled more frequently and others less.

    hence the sequence of what files are index when depends ( on my site at least ) on other parameters.

    Of course on each visit of a known page NEW links FOUND on a modified page are noted - BUT that new page never immediately visited by THAT bot - but eventually during the coming hours or day or so ... by that or another Googlebot.

    this method garantees a very high level of STEADY data flow and STEADY cpu usuage and preventing idle cpu or low level cpu usage as well as preventing waiting qeue as. may be unpredictable for the webmaster at first glance - but surely most efficient in any way for all involved.

    you may safely assume that Googlebot IS finding your pages the very fastest possible way IF you follow THEIR published recommendations for webmasters:
    http://www.google.com/webmasters/guidelines.html

    "... Every page should be reachable from at least one static text link ..."
    at least
    best
    - sitemap
    - topic overview
    - news page or your site
    - cross referencing from relevant OTHER pages
    and you can sleep and rest assured that Googlebot does its job ASAP.
     
    hans, Apr 21, 2004 IP