Since yesterday, doing the site:www.domain.xx - command for my site gives back only htm/html-pages, not a single pdf. What does that mean? Are pdfs no longer considered valid by G! or what?
Minstrel, Sure? Just checked what could be called the mother of all pdf's - the Adobe Reader help file - linked from several PR 10 pages, but not in the index. See: http://www.adobe.com/products/acrobat/pdfs/acrruserguide.pdf Or do a search on adobe reader pdf and see how many pdf's you get...
Well Jan, you have a point here.. http://www.google.com/search?as_q=i...s_occt=any&as_dt=i&as_sitesearch=&safe=images Search 'instructions' and limit file type to PDF. Results 1 - 3 of about 4,640,000 for instructions filetypedf. Not showing anything beyond these three :S
http://www.google.com/search?q=schi...=&newwindow=1&c2coff=1&safe=off&start=50&sa=N http://www.google.com/search?as_q=a...=&as_occt=any&as_dt=i&as_sitesearch=&safe=off http://www.google.com/search?as_q=a...=&as_occt=any&as_dt=i&as_sitesearch=&safe=off http://www.google.com/search?hl=en&...2coff=1&as_qdr=all&q=psychopathy+filetype:pdf
Well Tops you seem to be saying that Googlebot first visits the page just to get the links. Now the only way that I know for the bot to get the links from the page is to get the whole page, return it to the repository, and then parse it for links. So what is the advantage in going back again, when you already have the page in your index? BTW this is how Google say they do it: So it may be that the URL server may decide to put this page or that on a higher or lower priority, but there is really no need to visit the page twice just to get it indexed.
I never said it indexes it twice. It indexes it once, takes out the URL's and add it to the URL queue. That gets sorted and crawled one by one. So it doesn't go back to where it found it, it start from that URL put in the list. So it continues crawling from the URL queue, not from the page it found the URL. That's all I am saying.