Here's a question about robots.txt that I couldn't get a clear answer on. I have a wordpress site like domain.com. I use pagination which gives me pages like domain.com/page/1/. If I add this to robots.txt Disallow: /page/ Will Google still spider domain.com/page/1/, domain.com/page/2/ for links? Or will the spider just ignore it completely? I already have a generated sitemap too (not sure if that will tell google to spider also). If they don't spider, how would I get the spider to find my posts (domain.com/2007/06/post) and not get a dupe content penalty?
Disallow: /page/ tells the spider to ignore everything insiide the /page/ folder. So /page/1/ will be ignored, etc... Not sure how to answer your dupe content question, though, sorry. best, Denise
Well if you disallow:/page then it wont spider that directory and automatically solves your duplicate question. isnt it? Google will only find that data from your normal path 2006/17/xxx.html
Yes, but if I disallow page, how else will the spider find links? Say it hits the homepage, it'll spider the first 8 links to posts and index those. How will it get to page 2 to spider those posts if it's blocked?
well isnt page 2 has a link similar to 2006/17/2 or something like that? it has to be since pagination is something that you build into it. A normal link structure must be there.
The site structure is like this Homepage: domain.com ---- The page links at the bottom domain.com/page/2/ domain.com/page/3/ A post is like so: domain.com/yyyy/dd/post-name/ The homepage will show the last 8 posts which the spider will get to fine. But when it's reading robots.txt, domain.com/page/x/ will be blocked so there will be no way to go to the second page. The /page/ holds the link structure for the spider (if that makes sense).
Do you understand how google bot works? It doesnt matter if your pages are on homepage or not. Just submit a sitemap to google and let the googlebot know your internal link structure. this is enough for googlebot to find each individual page. Just remember to have all inter-linked from homepage for better deep crawling. Good luck
After having gone through this I have a concern as well. I have been advised to use the text for disallow as Disallow: / ? I understand that ? could be disallow dynamic pages but what does / mean? Could someone help please?
if you add disallow:/page/ to your robots.txt file, google will not follow pages like /pages/1/ , /pages/2/ etc
How would one inter-link 900 blog posts from the homepage if robots.txt blocks /page/. The only way the posts will be linked is by hitting the next link which is /page/2/ which is blocked.