Blocked Pages and Validation

Discussion in 'Co-op Advertising Network' started by Owlcroft, Jan 16, 2005.

  1. #1
    There have recently been several threads, on various parts of this forum, about the robots.txt file. That file is especially important for getting proper validation on the co-op network, in that pages you are trying to exclude from bots with robots.txt will almost surely not have co-op ads on them (as they indeed needn't).

    The big but here is that if it chances that your robots.txt file is defective--as those threads suggest that many people's are--then you can get skunked when the co-op validator finds no co-op ads on a supposedly (but not actually) blocked page.

    I will not here repeat the correct robots.txt page format, which I have recently posted a couple of times on those other threads, but here's a useful tip that I suspect many don't know:

    You can force Google to do a more or less immediate (within 24 hours) update of its archive against your robots.txt file.
    All you have to do is go to--


    --and follow the extremely simple directions there.

    Google will quickly review its cached-file list from that site and drop any that should be blocked. (But that will not force addition of any pages: it's not a site crawl, just a cache review.)

    The robots.txt file seems so simple that many people are careless about constructing it, and don't realize that their version may not be doing much of what they think it is doing. It would be wise to check yours out--for the co-op and for general purposes--and be sure. And if you find it needed fixing, fix it then use the tip above right away.
     
    Owlcroft, Jan 16, 2005 IP
  2. ResaleBroker

    ResaleBroker Active Member

    Messages:
    1,665
    Likes Received:
    50
    Best Answers:
    0
    Trophy Points:
    90
    #2
    Exactly what I needed. Thanks!
     
    ResaleBroker, Jan 16, 2005 IP
  3. kyle422

    kyle422 Peon

    Messages:
    290
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #3
    If I remove my /index.htm page from Google's cache will Google still spider my main URL (without the "/index.htm")? I mistakenly had all my links to my homepage with the /index.htm on the end.
     
    kyle422, Jan 17, 2005 IP
  4. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I'm not 100% sure I understand the question. But whether you mean that you used index.htm when you wanted index.html or index.shtml, or whether you mean you wanted just a root slash, as in--

    --as your link, just use a mod_rewrite in your .htaccess file to change the unwanted to the wanted.

    If you don't know how to use mod_rewrite (I'ma assuming your host runs Apache software), search this site or the web with mod_rewrite and .htaccess as the terms.
     
    Owlcroft, Jan 17, 2005 IP
  5. kyle422

    kyle422 Peon

    Messages:
    290
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I guess my question is confusing. My Menu page links were all pointing to "www.mysite.com/index.htm".

    www mysite com and www mysite com/index.htm both have PR and get cached daily (someone advised me that this is not good). If I submit the www mysite com/index.htm will www mysite com still get cached.
     
    kyle422, Jan 17, 2005 IP
  6. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #6
    I suspect you mean www mysite com/ for the first (note the trailing slash); Google only indexes, and assigns PR to, files, not directories or domains. Indeed, the bare slash is, technically, not a proper URL, because it nominally only points to a directory (your site's root directory). But it is a very common server-software convention that a link to a directory with no actual file specified will look for files named, more or less in this order, index.html, index.htm, index.shtml, and index.php (at least). (What it does if it does not find such a file in the subject directory depends on other settings--it might give a display of the directory contents, or it might give a warning that directories' contents cannot be displayed.)

    The important thing is to make sure that all your inbound links, and thus all your associated Google PR, point to the same URL for each of your pages. Google does not do what servers do--it takes a URL as unique. So, to Google, these---

    www.mywonderfulsite.com/
    mywonderfulsite.com/
    www.mywonderfulsite.com/index.html
    mywonderfulsite.com/index.html

    --are four different pages, and backlinks to one will not accrue as PR credit for any of the others.

    Since you cannot really control how others will form their links to you (even if you suggest forms), it is up to you to take control of the situation. You do that by selecting some one of the four front-page URL forms shown above as your preferred (or "canonical") form for the URL, then set up 301 Moved Permanently redirects for the other three that point to the canonical form. You do that with appropriate statements in your .htaccess file, if your server is running Apache server software; if it is running Very-Tiny-and-Limp software, you have separate problems that I will not address.

    When Google (or any bot, or any browser) tries to follow a link to one of the non-canonical forms of the URL, the server will send back a 301 with the canonical form supplied, so the bot (or browser) knows that a link to the non-canonical form is forever after to be treated as being to the canonical form.

    The necessary mod_rewrite directives for your .htaccess file depend on what you want for your canonical form; you can search this forum, and the web in general, for information on mod_rewrite and .htaccess.
     
    Owlcroft, Jan 18, 2005 IP
    T0PS3O likes this.
  7. kyle422

    kyle422 Peon

    Messages:
    290
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Thanks for the detailed answer. It is exactly what I wanted to know. :)
     
    kyle422, Jan 18, 2005 IP
  8. noumena

    noumena Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Thanks to Owlcroft for the great tip about using robots.txt and Google's Remove URLs tool to get rid of unwanted URLs from GG's cache. However,
    1. Does removing a URL from the GG cache mean it gets removed from GG's index, either immediately or later?
    2. I want GG and other engines to avoid crawling parameterised URLs, and crawl/index only well-formed URLs that have been optimised for my target keywords. I would like to use 'Disallow: /*?' to remove parameterised URLs. But the Remove URLs tool cannot handle GG's own asterix wildcard syntax in robots.txt. Is there a way around this, apart from excluding each unwanted URL explicitly in robots.txt, and one that will work for other engines as well?

    Thanks in advance.
     
    noumena, Aug 27, 2006 IP
  9. jaree

    jaree Well-Known Member

    Messages:
    2,681
    Likes Received:
    438
    Best Answers:
    0
    Trophy Points:
    180
    #9


    1. I am also looking for answer to this and cant find it elsewhere hopefuly someone here can help both of us with it :)
     
    jaree, Oct 5, 2006 IP