After investigating a couple of mysterious index inclusions of purposefully 'hidden' sites (being developed) I'm going to toss the following statement into the DP crowd... GOOGLE CRAWLS NON-HYPERLINKED URLS! Check this out (and I'm not the only one): http://www.google.co.uk/search?hl=en&lr=&sa=G&q=site:buy-a-mattress.co.uk Especially the 3d link. A link I left here on DP back when building the site and asking for ideas. It's broken with an * (w*w.....) so VB doesn't hyperlink it and still Google decided to go have a look. Assumption - jumping conculsions probably: Realizing full well how the link voting system is abused nowadays, Google has decided to factor in-text mentioing of URLs as well. Which would be odd since it could have been an article abuot how crap the page was and then Google thinks of it as a vote... The indexing could have been down to Toolbar visitors but that doesn't explain the asterisked link in the site: results. Anyone seen something similar or able to destroy the theory?
This information could make all kinds of differences to the weighting of a website then surely? Would the weight be distributed equaly to the mentioned URL's as much as the linked URL's? Would leaving out the www stop Google finding it?
Well I haven't really thought about the implications yet to be honest, I was first hoping to establish whether what I think I'm seeing is actually happening. I've been trying to discredit the theory myself and the only contra-explanation I can come up with is that I searched site:buy-... (without www) as oppose to site:www.buy- (with the www) and that could perhaps explain the w*w being included. But I'm not sure on this yet.
OK.. sorry I didn't finish reading the post - it's late But Google didn't crawl the URL - it isn't valid. Sounds more like the site: command isn't perfect to me
I read on a search engines conference post some googleguy said they were able to find non linked sites although nothing was mentioned on how they did it. I suspect its the toolbar though.
Cut-n-paste to the browser address bar of a Google toolbar user, or the search box of the Google toolbar I would guess. I have been telling customers for years to make certain they not only block pages off with robots.txt but with the robots meta-tag as well, on anything they do not want in the listings. Google will still catch the url and add it to the listings sometimes, though without any description or cache. When you visit the url you get a 404. Your server configuration may have something to do with allowing w*w to resolve well enough to give the 404.
If google didn't crawl it, then wtf is it doing in an index. site: should return pages in the domain --- Note pages that Gbot has visited in the domain, not pages that it just happens to think might be in it. * is not a valid fqdn character, and Gbot should f'in know that. If Gbot does this regularly, it's one hell of an easy way to inflate your page count, looks like when buying domains we'd better start checking every page listed.
Because the server returns a status code (404), Google sees something at w*w.whatever.tld. It should dump out at some point for being a 404.
A month or so ago someone linked to one of my pages but made a mistake at the end or right side of the URL. The link resulted in a not found but was listed in Google searches, just like the w*w. After some time it disappeared from the index and now seems to have resolved to the correct page! Looks like Google wants to make sure it's not missing anything that might be relevant.
Google is ripping out url's from anywhere and sending them back. it is also resing server logs and pulling urls from there. Sorry but it is late, I am tired, and a little 'relaxed' after watching the British lions v Argentina, so I have not read the entire thread and links, just read through really quickly.
Tops...good post. I wonder about LSI and LSA. If the page was crap, could they understand this? Further to that could they possibly assign an appropriate value (0) to that link based on understanding the article as a derogatory article?
Last year in London, Matt Cutts said an odd thing, he said that Google gets domain url's from sources other than indexing links, in order to be the freshest for new sites. He wouldn't be pushed on this (even though we tried). This could mean that G gets domains from page text, even domain registries? I know this is nothing solid but I thought I would mention it.