Robots.txt spidering question

badrobot Active Member

Messages:: 62

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 91

#1

Here's a question about robots.txt that I couldn't get a clear answer on. I have a wordpress site like domain.com. I use pagination which gives me pages like domain.com/page/1/. If I add this to robots.txt

Disallow: /page/

Will Google still spider domain.com/page/1/, domain.com/page/2/ for links? Or will the spider just ignore it completely? I already have a generated sitemap too (not sure if that will tell google to spider also). If they don't spider, how would I get the spider to find my posts (domain.com/2007/06/post) and not get a dupe content penalty?

badrobot, Aug 13, 2007 IP

emptymirror Well-Known Member

Messages:: 367

Likes Received:: 24

Best Answers:: 0

Trophy Points:: 110

#2

badrobot said: ↑

Here's a question about robots.txt that I couldn't get a clear answer on. I have a wordpress site like domain.com. I use pagination which gives me pages like domain.com/page/1/. If I add this to robots.txt

Disallow: /page/

Will Google still spider domain.com/page/1/, domain.com/page/2/ for links? Or will the spider just ignore it completely?
Click to expand...

Disallow: /page/ tells the spider to ignore everything insiide the /page/ folder. So /page/1/ will be ignored, etc...

Not sure how to answer your dupe content question, though, sorry.

best,
Denise

emptymirror, Aug 13, 2007 IP

godmode Well-Known Member

Messages:: 4,453

Likes Received:: 156

Best Answers:: 0

Trophy Points:: 190

#3

Well if you disallow:/page then it wont spider that directory and automatically solves your duplicate question. isnt it? Google will only find that data from your normal path 2006/17/xxx.html

godmode, Aug 13, 2007 IP

badrobot Active Member

Messages:: 62

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 91

#4

Yes, but if I disallow page, how else will the spider find links? Say it hits the homepage, it'll spider the first 8 links to posts and index those. How will it get to page 2 to spider those posts if it's blocked?

badrobot, Aug 13, 2007 IP

godmode Well-Known Member

Messages:: 4,453

Likes Received:: 156

Best Answers:: 0

Trophy Points:: 190

#5

well isnt page 2 has a link similar to 2006/17/2 or something like that? it has to be since pagination is something that you build into it. A normal link structure must be there.

godmode, Aug 13, 2007 IP

badrobot Active Member

Messages:: 62

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 91

#6

The site structure is like this

Homepage: domain.com

---- The page links at the bottom
domain.com/page/2/
domain.com/page/3/

A post is like so: domain.com/yyyy/dd/post-name/

The homepage will show the last 8 posts which the spider will get to fine. But when it's reading robots.txt, domain.com/page/x/ will be blocked so there will be no way to go to the second page. The /page/ holds the link structure for the spider (if that makes sense).

badrobot, Aug 13, 2007 IP

godmode Well-Known Member

Messages:: 4,453

Likes Received:: 156

Best Answers:: 0

Trophy Points:: 190

#7

Do you understand how google bot works? It doesnt matter if your pages are on homepage or not.

Just submit a sitemap to google and let the googlebot know your internal link structure.

domain.com/yyyy/dd/post-name/
Click to expand...

this is enough for googlebot to find each individual page. Just remember to have all inter-linked from homepage for better deep crawling.

Good luck

godmode, Aug 14, 2007 IP

nasskobar Well-Known Member

Messages:: 77

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 100

#8

After having gone through this I have a concern as well. I have been advised to use the text for disallow as

Disallow: / ?

I understand that ? could be disallow dynamic pages but what does / mean?

Could someone help please?

nasskobar, Aug 14, 2007 IP

trichnosis Prominent Member

Messages:: 13,785

Likes Received:: 333

Best Answers:: 0

Trophy Points:: 300

#9

if you add disallow:/page/ to your robots.txt file, google will not follow pages like /pages/1/ , /pages/2/ etc

trichnosis, Aug 14, 2007 IP

badrobot Active Member

Messages:: 62

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 91

#10

Just remember to have all inter-linked from homepage for better deep crawling.
Click to expand...

How would one inter-link 900 blog posts from the homepage if robots.txt blocks /page/. The only way the posts will be linked is by hitting the next link which is /page/2/ which is blocked.

badrobot, Aug 14, 2007 IP

godmode Well-Known Member

Messages:: 4,453

Likes Received:: 156

Best Answers:: 0

Trophy Points:: 190

#11

just link them from "archives" from homepage no need for page 1, page 2

godmode, Aug 15, 2007 IP

Log in or Sign up

Robots.txt spidering question

badrobot Active Member

emptymirror Well-Known Member

godmode Well-Known Member

badrobot Active Member

godmode Well-Known Member

badrobot Active Member

godmode Well-Known Member

nasskobar Well-Known Member

trichnosis Prominent Member

badrobot Active Member

godmode Well-Known Member

Useful Searches