Google Crawls Non-Hyperlinked URLs

T0PS3O Feel Good PLC

Messages:: 13,219

Likes Received:: 777

Best Answers:: 0

Trophy Points:: 0

#1

After investigating a couple of mysterious index inclusions of purposefully 'hidden' sites (being developed) I'm going to toss the following statement into the DP crowd...

GOOGLE CRAWLS NON-HYPERLINKED URLS!

Check this out (and I'm not the only one):

http://www.google.co.uk/search?hl=en&lr=&sa=G&q=site:buy-a-mattress.co.uk

Especially the 3d link. A link I left here on DP back when building the site and asking for ideas. It's broken with an * (w*w.....) so VB doesn't hyperlink it and still Google decided to go have a look.

Assumption - jumping conculsions probably:

Realizing full well how the link voting system is abused nowadays, Google has decided to factor in-text mentioing of URLs as well.

Which would be odd since it could have been an article abuot how crap the page was and then Google thinks of it as a vote...

The indexing could have been down to Toolbar visitors but that doesn't explain the asterisked link in the site: results.

Anyone seen something similar or able to destroy the theory?

T0PS3O, May 23, 2005 IP

DangerMouse Peon

Messages:: 275

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#2

google toolbar...

DangerMouse, May 23, 2005 IP

Weirfire Language Translation Company

Messages:: 6,979

Likes Received:: 365

Best Answers:: 0

Trophy Points:: 280

#3

This information could make all kinds of differences to the weighting of a website then surely?

Would the weight be distributed equaly to the mentioned URL's as much as the linked URL's? Would leaving out the www stop Google finding it?

Weirfire, May 23, 2005 IP

T0PS3O Feel Good PLC

Messages:: 13,219

Likes Received:: 777

Best Answers:: 0

Trophy Points:: 0

#4

DangerMouse said:

google toolbar...
Click to expand...

Interesting you should mention that...

T0PS3O said:

The indexing could have been down to Toolbar visitors but that doesn't explain the asterisked link in the site: results.
Click to expand...

T0PS3O, May 23, 2005 IP

T0PS3O Feel Good PLC

Messages:: 13,219

Likes Received:: 777

Best Answers:: 0

Trophy Points:: 0

#5

Weirfire said:

This information could make all kinds of differences to the weighting of a website then surely?

Would the weight be distributed equaly to the mentioned URL's as much as the linked URL's? Would leaving out the www stop Google finding it?
Click to expand...

Well I haven't really thought about the implications yet to be honest, I was first hoping to establish whether what I think I'm seeing is actually happening.

I've been trying to discredit the theory myself and the only contra-explanation I can come up with is that I searched site:buy-... (without www) as oppose to site:www.buy- (with the www) and that could perhaps explain the w*w being included.

But I'm not sure on this yet.

T0PS3O, May 23, 2005 IP

Weirfire Language Translation Company

Messages:: 6,979

Likes Received:: 365

Best Answers:: 0

Trophy Points:: 280

#6

T0PS3O said:

Well I haven't really thought about the implications yet to be honest, I was first hoping to establish whether what I think I'm seeing is actually happening.

I've been trying to discredit the theory myself and the only contra-explanation I can come up with is that I searched site:buy-... (without www) as oppose to site:www.buy- (with the www) and that could perhaps explain the w*w being included.

But I'm not sure on this yet.
Click to expand...

Well the mcdarians of DP will be experimenting this theory tomorrow. You can count on it

Weirfire, May 23, 2005 IP

DangerMouse Peon

Messages:: 275

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#7

OK.. sorry I didn't finish reading the post - it's late

But Google didn't crawl the URL - it isn't valid.

Sounds more like the site: command isn't perfect to me

DangerMouse, May 23, 2005 IP

l234244 Peon

Messages:: 1,225

Likes Received:: 50

Best Answers:: 0

Trophy Points:: 0

#8

I read on a search engines conference post some googleguy said they were able to find non linked sites although nothing was mentioned on how they did it. I suspect its the toolbar though.

l234244, May 23, 2005 IP

tresman Well-Known Member

Messages:: 235

Likes Received:: 20

Best Answers:: 0

Trophy Points:: 138

#9

Of course it didn't. There is no page at all there, what should google crawl then?

tresman, May 23, 2005 IP

NetMidWest Peon

Messages:: 1,677

Likes Received:: 151

Best Answers:: 0

Trophy Points:: 0

#10

Cut-n-paste to the browser address bar of a Google toolbar user, or the search box of the Google toolbar I would guess.

I have been telling customers for years to make certain they not only block pages off with robots.txt but with the robots meta-tag as well, on anything they do not want in the listings.

Google will still catch the url and add it to the listings sometimes, though without any description or cache.

When you visit the url you get a 404. Your server configuration may have something to do with allowing w*w to resolve well enough to give the 404.

NetMidWest, May 23, 2005 IP

jlawrence Peon

Messages:: 1,368

Likes Received:: 81

Best Answers:: 0

Trophy Points:: 0

#11

If google didn't crawl it, then wtf is it doing in an index.
site: should return pages in the domain --- Note pages that Gbot has visited in the domain, not pages that it just happens to think might be in it.
* is not a valid fqdn character, and Gbot should f'in know that.
If Gbot does this regularly, it's one hell of an easy way to inflate your page count, looks like when buying domains we'd better start checking every page listed.

jlawrence, May 23, 2005 IP

NetMidWest Peon

Messages:: 1,677

Likes Received:: 151

Best Answers:: 0

Trophy Points:: 0

#12

Because the server returns a status code (404), Google sees something at w*w.whatever.tld. It should dump out at some point for being a 404.

NetMidWest, May 23, 2005 IP

Jan Peon

Messages:: 129

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#13

A month or so ago someone linked to one of my pages but made a mistake at the end or right side of the URL. The link resulted in a not found but was listed in Google searches, just like the w*w. After some time it disappeared from the index and now seems to have resolved to the correct page!
Looks like Google wants to make sure it's not missing anything that might be relevant.

Jan, May 23, 2005 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#14

It doesn't actually return a 404 but a 502 error:
Bad Gateway
The following error occurred:
 The host name was not found during the DNS lookup. Contact your system administrator if the problem is not found by retrying the URL. [/quote]
Code (DNS_HOST_NOT_FOUND):
Click to expand...

minstrel, May 23, 2005 IP

noppid gunnin' for the quota

Messages:: 4,246

Likes Received:: 232

Best Answers:: 0

Trophy Points:: 135

#15

DangerMouse said:

google toolbar...
Click to expand...

But it's not Spyware!

noppid, May 23, 2005 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#16

noppid said:

But it's not Spyware!
Click to expand...

No. And it's nae oatmeal, either!

minstrel, May 23, 2005 IP

Old Welsh Guy Notable Member

Messages:: 2,699

Likes Received:: 291

Best Answers:: 0

Trophy Points:: 205

#17

Google is ripping out url's from anywhere and sending them back. it is also resing server logs and pulling urls from there. Sorry but it is late, I am tired, and a little 'relaxed' after watching the British lions v Argentina, so I have not read the entire thread and links, just read through really quickly.

Old Welsh Guy, May 23, 2005 IP

Homer Spirit Walker

Messages:: 2,396

Likes Received:: 150

Best Answers:: 0

Trophy Points:: 0

#18

Realizing full well how the link voting system is abused nowadays, Google has decided to factor in-text mentioing of URLs as well.

Which would be odd since it could have been an article abuot how crap the page was and then Google thinks of it as a vote...
Click to expand...

Tops...good post. I wonder about LSI and LSA. If the page was crap, could they understand this?

Further to that could they possibly assign an appropriate value (0) to that link based on understanding the article as a derogatory article?

Homer, May 23, 2005 IP

T0PS3O Feel Good PLC

Messages:: 13,219

Likes Received:: 777

Best Answers:: 0

Trophy Points:: 0

#19

Homer said:

Tops...good post. I wonder about LSI and LSA. If the page was crap, could they understand this?

Further to that could they possibly assign an appropriate value (0) to that link based on understanding the article as a derogatory article?
Click to expand...

Technically it's possible. Whether it's reliable on a large unsupervised algorithmic scale I doubt.

T0PS3O, May 26, 2005 IP

Old Welsh Guy Notable Member

Messages:: 2,699

Likes Received:: 291

Best Answers:: 0

Trophy Points:: 205

#20

Last year in London, Matt Cutts said an odd thing, he said that Google gets domain url's from sources other than indexing links, in order to be the freshest for new sites. He wouldn't be pushed on this (even though we tried). This could mean that G gets domains from page text, even domain registries?

I know this is nothing solid but I thought I would mention it.

Old Welsh Guy, May 26, 2005 IP

minstrel likes this.

Log in or Sign up

Advertising (learn more)

Google Crawls Non-Hyperlinked URLs

T0PS3O Feel Good PLC

DangerMouse Peon

Weirfire Language Translation Company

T0PS3O Feel Good PLC

T0PS3O Feel Good PLC

Weirfire Language Translation Company

DangerMouse Peon

l234244 Peon

tresman Well-Known Member

NetMidWest Peon

jlawrence Peon

NetMidWest Peon

Jan Peon

minstrel Illustrious Member

noppid gunnin' for the quota

minstrel Illustrious Member

Old Welsh Guy Notable Member

Homer Spirit Walker

T0PS3O Feel Good PLC

Old Welsh Guy Notable Member

Log in or Sign up

Advertising (learn more)

Google Crawls Non-Hyperlinked URLs

T0PS3O Feel Good PLC

DangerMouse Peon

Weirfire Language Translation Company

T0PS3O Feel Good PLC

T0PS3O Feel Good PLC

Weirfire Language Translation Company

DangerMouse Peon

l234244 Peon

tresman Well-Known Member

NetMidWest Peon

jlawrence Peon

NetMidWest Peon

Jan Peon

minstrel Illustrious Member

noppid gunnin' for the quota

minstrel Illustrious Member

Old Welsh Guy Notable Member

Homer Spirit Walker

T0PS3O Feel Good PLC

Old Welsh Guy Notable Member

Useful Searches