1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Google Crawls Non-Hyperlinked URLs

Discussion in 'Google' started by T0PS3O, May 23, 2005.

  1. #1
    After investigating a couple of mysterious index inclusions of purposefully 'hidden' sites (being developed) I'm going to toss the following statement into the DP crowd...

    GOOGLE CRAWLS NON-HYPERLINKED URLS!

    Check this out (and I'm not the only one):

    http://www.google.co.uk/search?hl=en&lr=&sa=G&q=site:buy-a-mattress.co.uk

    Especially the 3d link. A link I left here on DP back when building the site and asking for ideas. It's broken with an * (w*w.....) so VB doesn't hyperlink it and still Google decided to go have a look.

    Assumption - jumping conculsions probably:

    Realizing full well how the link voting system is abused nowadays, Google has decided to factor in-text mentioing of URLs as well.

    Which would be odd since it could have been an article abuot how crap the page was and then Google thinks of it as a vote...

    The indexing could have been down to Toolbar visitors but that doesn't explain the asterisked link in the site: results.

    Anyone seen something similar or able to destroy the theory?
     
    T0PS3O, May 23, 2005 IP
  2. DangerMouse

    DangerMouse Peon

    Messages:
    275
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #2
    google toolbar...
     
    DangerMouse, May 23, 2005 IP
  3. Weirfire

    Weirfire Language Translation Company

    Messages:
    6,979
    Likes Received:
    365
    Best Answers:
    0
    Trophy Points:
    280
    #3
    This information could make all kinds of differences to the weighting of a website then surely?

    Would the weight be distributed equaly to the mentioned URL's as much as the linked URL's? Would leaving out the www stop Google finding it?
     
    Weirfire, May 23, 2005 IP
  4. T0PS3O

    T0PS3O Feel Good PLC

    Messages:
    13,219
    Likes Received:
    777
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Interesting you should mention that...

     
    T0PS3O, May 23, 2005 IP
  5. T0PS3O

    T0PS3O Feel Good PLC

    Messages:
    13,219
    Likes Received:
    777
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Well I haven't really thought about the implications yet to be honest, I was first hoping to establish whether what I think I'm seeing is actually happening.

    I've been trying to discredit the theory myself and the only contra-explanation I can come up with is that I searched site:buy-... (without www) as oppose to site:www.buy- (with the www) and that could perhaps explain the w*w being included.

    But I'm not sure on this yet.
     
    T0PS3O, May 23, 2005 IP
  6. Weirfire

    Weirfire Language Translation Company

    Messages:
    6,979
    Likes Received:
    365
    Best Answers:
    0
    Trophy Points:
    280
    #6
    Well the mcdarians of DP will be experimenting this theory tomorrow. You can count on it :)
     
    Weirfire, May 23, 2005 IP
  7. DangerMouse

    DangerMouse Peon

    Messages:
    275
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #7
    OK.. sorry I didn't finish reading the post - it's late ;)

    But Google didn't crawl the URL - it isn't valid.

    Sounds more like the site: command isn't perfect to me
     
    DangerMouse, May 23, 2005 IP
  8. l234244

    l234244 Peon

    Messages:
    1,225
    Likes Received:
    50
    Best Answers:
    0
    Trophy Points:
    0
    #8
    I read on a search engines conference post some googleguy said they were able to find non linked sites although nothing was mentioned on how they did it. I suspect its the toolbar though.
     
    l234244, May 23, 2005 IP
  9. tresman

    tresman Well-Known Member

    Messages:
    235
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    138
    #9
    Of course it didn't. There is no page at all there, what should google crawl then?
     
    tresman, May 23, 2005 IP
  10. NetMidWest

    NetMidWest Peon

    Messages:
    1,677
    Likes Received:
    151
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Cut-n-paste to the browser address bar of a Google toolbar user, or the search box of the Google toolbar I would guess.

    I have been telling customers for years to make certain they not only block pages off with robots.txt but with the robots meta-tag as well, on anything they do not want in the listings.

    Google will still catch the url and add it to the listings sometimes, though without any description or cache.

    When you visit the url you get a 404. Your server configuration may have something to do with allowing w*w to resolve well enough to give the 404.
     
    NetMidWest, May 23, 2005 IP
  11. jlawrence

    jlawrence Peon

    Messages:
    1,368
    Likes Received:
    81
    Best Answers:
    0
    Trophy Points:
    0
    #11
    If google didn't crawl it, then wtf is it doing in an index.
    site: should return pages in the domain --- Note pages that Gbot has visited in the domain, not pages that it just happens to think might be in it.
    * is not a valid fqdn character, and Gbot should f'in know that.
    If Gbot does this regularly, it's one hell of an easy way to inflate your page count, looks like when buying domains we'd better start checking every page listed.
     
    jlawrence, May 23, 2005 IP
  12. NetMidWest

    NetMidWest Peon

    Messages:
    1,677
    Likes Received:
    151
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Because the server returns a status code (404), Google sees something at w*w.whatever.tld. It should dump out at some point for being a 404.
     
    NetMidWest, May 23, 2005 IP
  13. Jan

    Jan Peon

    Messages:
    129
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #13
    A month or so ago someone linked to one of my pages but made a mistake at the end or right side of the URL. The link resulted in a not found but was listed in Google searches, just like the w*w. After some time it disappeared from the index and now seems to have resolved to the correct page! :eek:
    Looks like Google wants to make sure it's not missing anything that might be relevant.
     
    Jan, May 23, 2005 IP
  14. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #14
    It doesn't actually return a 404 but a 502 error:

     
    minstrel, May 23, 2005 IP
  15. noppid

    noppid gunnin' for the quota

    Messages:
    4,246
    Likes Received:
    232
    Best Answers:
    0
    Trophy Points:
    135
    #15
    But it's not Spyware! :cool:
     
    noppid, May 23, 2005 IP
  16. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #16
    No. And it's nae oatmeal, either!
     
    minstrel, May 23, 2005 IP
  17. Old Welsh Guy

    Old Welsh Guy Notable Member

    Messages:
    2,699
    Likes Received:
    291
    Best Answers:
    0
    Trophy Points:
    205
    #17
    Google is ripping out url's from anywhere and sending them back. it is also resing server logs and pulling urls from there. Sorry but it is late, I am tired, and a little 'relaxed' ;) after watching the British lions v Argentina, so I have not read the entire thread and links, just read through really quickly.
     
    Old Welsh Guy, May 23, 2005 IP
  18. Homer

    Homer Spirit Walker

    Messages:
    2,396
    Likes Received:
    150
    Best Answers:
    0
    Trophy Points:
    0
    #18
    Tops...good post. I wonder about LSI and LSA. If the page was crap, could they understand this?

    Further to that could they possibly assign an appropriate value (0) to that link based on understanding the article as a derogatory article?
     
    Homer, May 23, 2005 IP
  19. T0PS3O

    T0PS3O Feel Good PLC

    Messages:
    13,219
    Likes Received:
    777
    Best Answers:
    0
    Trophy Points:
    0
    #19
    Technically it's possible. Whether it's reliable on a large unsupervised algorithmic scale I doubt.
     
    T0PS3O, May 26, 2005 IP
  20. Old Welsh Guy

    Old Welsh Guy Notable Member

    Messages:
    2,699
    Likes Received:
    291
    Best Answers:
    0
    Trophy Points:
    205
    #20
    Last year in London, Matt Cutts said an odd thing, he said that Google gets domain url's from sources other than indexing links, in order to be the freshest for new sites. He wouldn't be pushed on this (even though we tried). This could mean that G gets domains from page text, even domain registries?

    I know this is nothing solid but I thought I would mention it.
     
    Old Welsh Guy, May 26, 2005 IP
    minstrel likes this.