1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Google page count fluctuations

Discussion in 'Google' started by Will.Spencer, Dec 18, 2004.

  1. #1
    Around 28 August, Googlebot started crawling my phpBB newsgroup mirror at the Internet Search Engines FAQ.

    By 12 October, the page count as shown at Google was over 1,800.

    By around 5 December, the page count as shown by Google was below 1,000.

    Around 14 December, the page count spiked up over 2,000.

    Yesterday, the page count was 700 and today it is 742.

    Googlebot can spider all of the pages in my phpBB due to the small SessionID hack which was shown to me by Dodger.

    I am not deleting any old threads or messages.

    phpBB currently shows 1,973 topics. Add another 189 pages for the FAQ itself and you get a rough idea of how many pages should be in the Google index.

    I can understand how pages which Googlebot has not yet visited would not be in the index -- but does anyone know why my pages would be falling out of the Google index?
     
    Will.Spencer, Dec 18, 2004 IP
  2. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #2
    Will,

    I wonder if this is as a result of it being a new site (technically with the forum) coupled with Google doing a massive algo update? I have seen some crazy (and strange) activity on a few sites that are indicative of something major set to happen....
     
    NewComputer, Dec 18, 2004 IP
  3. fryman

    fryman Kiss my rep

    Messages:
    9,604
    Likes Received:
    777
    Best Answers:
    0
    Trophy Points:
    370
    #3
    Something major set to happen???? So, this last update was just the begining???

    I feel really scared now :eek: :eek:
     
    fryman, Dec 18, 2004 IP
  4. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #4
    Are you sure they are falling out of the index? Ignoring the reported page counts, if you search for something on randomly selected pages of the site (or the URL itself), can you find the page?
     
    minstrel, Dec 18, 2004 IP
  5. T0PS3O

    T0PS3O Feel Good PLC

    Messages:
    13,219
    Likes Received:
    777
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Minstrel is right IMO. I don't believe anything that G reports nowadays. It would be silly if it was accurate what they show you. Too sensitive data if they want to prevent manipulation.
     
    T0PS3O, Dec 19, 2004 IP
  6. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Consider this history:

    .15 October 2004...71,980
    .16 October 2004...73,050
    .17 October 2004...74,230
    .18 October 2004...76,840
    .19 October 2004...76,830
    .20 October 2004...78,070
    .21 October 2004...67,230
    .22 October 2004...66,790
    .23 October 2004...61,390
    .24 October 2004...66,470
    .25 October 2004...67,760
    .26 October 2004...67,900
    .27 October 2004...66,620
    .28 October 2004...65,490
    .29 October 2004...65,730
    .30 October 2004...65,730
    .31 October 2004...67,400
    .1 November 2004...67,500
    .2 November 2004...67,500
    .3 November 2004...67,910
    .4 November 2004...70,080
    .5 November 2004...69,650
    .6 November 2004...76,020
    .7 November 2004...73,630
    .8 November 2004...75,290
    .9 November 2004...90,390
    .0 November 2004..119,200
    11 November 2004..150,600
    12 November 2004..155,500
    13 November 2004..160,000
    14 November 2004..161,100
    15 November 2004..161,100
    16 November 2004..243,300
    17 November 2004..248,800
    18 November 2004..248,800
    19 November 2004..242,000
    20 November 2004..236,800
    21 November 2004..232,800
    22 November 2004..232,800
    23 November 2004..226,400
    24 November 2004..216,300
    25 November 2004..209,000
    26 November 2004..209,000
    27 November 2004..149,260
    28 November 2004..100,070
    29 November 2004...71,630
    30 November 2004...65,390
    .1 December 2004...65,600
    .2 December 2004...59,460
    .3 December 2004...41,120
    .4 December 2004...49,363
    .4 December 2004...49,413
    .4 December 2004...49,413
    .5 December 2004...48,500
    .6 December 2004...47,263
    .7 December 2004...44,838
    .8 December 2004...44,838
    .9 December 2004...41,863
    10 December 2004...39,088
    11 December 2004...39,113
    12 December 2004...36,738
    13 December 2004...32,488
    14 December 2004...31,213
    15 December 2004...26,713
    16 December 2004...26,688
    17 December 2004...26,688
    18 December 2004...25,400
    19 December 2004...23,525
    19 December 2004...21,364
    19 December 2004...21,364
    19 December 2004...21,364

    Each count is an average of several datacenters, taken at essentially the same moment, and all were taken at essentially the same time of day (midnight Pacific).

    And yes, spot checks have shown that pages really have gone away.

    In fact, my home page, which shows as a PR 5 in both the Firefox PR-addon and on the search serveices I consult (I don't use M$ products myself), is absent from direct searches on unique phrases on it and from the Google archive.

    My AdSense impressions fell like a rock, but, curiously, are starting to edge back up again the last few days, plus the CPM has really jumped. (That last could be because the pages that remain were all visited at some time by the AdSense bot, and so have relevant ads to show--just a guess, mind.)

    I have read here and elsewhere about G drastically tightening their "duplicate content" filter, which may be fine for many purposes, but has the ugly side effect of hitting things like index-directory pages very, very hard. I have added randomly selected portions of actual pages to the index pages to help differentiate their content (the site has over a million pages, so it needs three levels of indexing, with one level alone having 10,000 index pages).

    It's too bad Google can't find some bright 15-year-olds to exlain to them how the world works.
     
    Owlcroft, Dec 19, 2004 IP
  7. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #7
    What are those counts based in? Google queries?

    As for caching your home page, the SEO Toys page is cached. For OmniKnow, why do you have this tag in your <head> section?

    <meta http-equiv="Expires" content="Mon, 20 Dec 2004 10:00:00 GMT" />
    Code (markup):
    To be honest, I'm not sure what effect that has with googlebot. But if it reads and respects it, aren't you saying, "Don't bother caching this page"? And if so, can you blame Google for doing what you asked it to do?

    Your SEO toys home page, which was cached today, does NOT contain that tag.
     
    minstrel, Dec 19, 2004 IP
  8. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #8
    What are those counts based in? Google queries?

    They are each the averaging of a site: omniknow.com request sent to what I think is a representative selection of G datacenters, one after another, around midnight daily; they are the pagecounts reported back. Now and again I spot-check manually to see if the pages such an inquiry returns are cached or not cached, and they always match (G does return many pages, on a site: command, that are not cached by them, and for which they have only a brief entry, and which cannot be found on a G search for unique content on them). I cannot do an extensive reverse check (see if pages not returned are actually cached), as that would involve getting the complete list returned by a site: inquiry, to see what is missing.


    As for caching your home page, the SEO Toys page is cached. For OmniKnow, why do you have this tag in your <head> section?

    Code:

    <meta http-equiv="Expires" content="Mon, 20 Dec 2004 10:00:00 GMT" />


    The index of available articles is remade daily, and is typically finished by about 10:00 GMT. The "Expires" header supposedly tells browser caches (and searchbots) the latest date/time after which they must reget the actual page; prior to that date/time, it is supposedly OK to use one's current cache of the page instead of fetching it.


    To be honest, I'm not sure what effect that has with googlebot. But if it reads and respects it, aren't you saying, "Don't bother caching this page"? And if so, can you blame Google for doing what you asked it to do?

    I didn't have it there for a good while, but when G started dropping pages I thought I'd best make sure it knows that the page changes daily.


    Your SEO toys home page, which was cached today, does NOT contain that tag.

    Just so, though my understanding, which may be defective, is that it is good practice to always send an "Expires" header for every page, so that caches may be used as effectively as possible to reduce net traffic by avoiding calls that are not necessary if the page is already cached (and also, conversely, to avoid pages in cache being used after they have become stale).

    The many thousands of index pages, now also php-generated from a core datablock, also have expiry times set the same as the index page (since they all get updated at the same time); but the million or so php-generated individual-article pages all have expiry date/times about one minute past their moment of loading, because those articles are subject to change at any instant. (And so that G, as well as browsers, knows that they can change instantly.)

    My further understanding is that the only effect this should have on G is to suggest to it that the page be refreshed daily, or as often as practicable; it should not, so far as I can see, cause it to not archive it. And, again, I only strated doing this after seeing the disappearances (including of the index page itself).
     
    Owlcroft, Dec 20, 2004 IP
  9. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #9
    Well, as I said, I'm not entirely sure, but if I were designing a spider I think I'd say don't bother caching any page with a short expires tag. In fact, I don't think the expires tag has any useful purpose whatsoever except perhaps for an intranet.

    As far as I know, there is no way to tell bots that they MUST do anything, only what they must NOT do (as in noindex or robots.txt). It's like the revisit-after tag: In reality, that doesn't mean, "come back in a week", it means "don't come back for at least a week". As for browsers, they don't need that tag to refresh -- they'll generally do it when the page doesn't match the cached version, depending on browser options settings.

    Was the home page getting cached BEFORE you added this tag?

    Well, as you can see, I'm dubious about this -- I'll do some further research. In the meantime, my suggestion would be to take it out for a few weeks and see if the page gets cached.

    UPDATE:

    http://vancouver-webpages.com/META/metatags.detail.html
    http://www.dwfaq.com/Tutorials/Miscellaneous/more_metas.asp

     
    minstrel, Dec 20, 2004 IP
  10. Will.Spencer

    Will.Spencer NetBuilder

    Messages:
    14,789
    Likes Received:
    1,040
    Best Answers:
    0
    Trophy Points:
    375
    #10
    I have been using the "Pages In URL" column from the DigitalPoint Search Engine Keyword and Backlink Tracker.

    Just this last few days (eerily, right after I posted) my Pages In URL have been swinging wildly -- from the mid 700's to over 2,000 -- and back again.

    Before, the changes were occuring many weeks apart. Now it's not weeks, it seems like every day.

    Google's site command shows 5,370 pages.

    So... my next question is... which number is used to determine Coop Network weight? :)
     
    Will.Spencer, Dec 20, 2004 IP
  11. Clubtitan.org

    Clubtitan.org Peon

    Messages:
    104
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Why when I use the site: command on google do I get 254 pages for site:www.clubtitan.org and 989 pages for site:clubtitan.org Just curious why there would be a difference.
    thanks
    M4ck
     
    Clubtitan.org, Dec 20, 2004 IP
  12. lowrider14044

    lowrider14044 Raider

    Messages:
    260
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #12
    My pages in URL have dropped in DP's back link tracker also. About 30%. I believe this is the number used to calculate weight for the Coop Network. Using the site command shows about 48% more pages then back link tracker. Unfortunately neither is right. The real number of pages is about midway between the two. Wouldn't think it would be that hard to get those numbers right since my sites page count doesn't have a lot of flux.
     
    lowrider14044, Dec 21, 2004 IP