Robots.txt spidering question

Discussion in 'Google' started by badrobot, Aug 13, 2007.

  1. #1
    Here's a question about robots.txt that I couldn't get a clear answer on. I have a wordpress site like domain.com. I use pagination which gives me pages like domain.com/page/1/. If I add this to robots.txt

    Disallow: /page/

    Will Google still spider domain.com/page/1/, domain.com/page/2/ for links? Or will the spider just ignore it completely? I already have a generated sitemap too (not sure if that will tell google to spider also). If they don't spider, how would I get the spider to find my posts (domain.com/2007/06/post) and not get a dupe content penalty?
     
    badrobot, Aug 13, 2007 IP
  2. emptymirror

    emptymirror Well-Known Member

    Messages:
    367
    Likes Received:
    24
    Best Answers:
    0
    Trophy Points:
    110
    #2
    Disallow: /page/ tells the spider to ignore everything insiide the /page/ folder. So /page/1/ will be ignored, etc...

    Not sure how to answer your dupe content question, though, sorry.

    best,
    Denise
     
    emptymirror, Aug 13, 2007 IP
  3. godmode

    godmode Well-Known Member

    Messages:
    4,453
    Likes Received:
    156
    Best Answers:
    0
    Trophy Points:
    190
    #3
    Well if you disallow:/page then it wont spider that directory and automatically solves your duplicate question. isnt it? Google will only find that data from your normal path 2006/17/xxx.html
     
    godmode, Aug 13, 2007 IP
  4. badrobot

    badrobot Active Member

    Messages:
    62
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    91
    #4
    Yes, but if I disallow page, how else will the spider find links? Say it hits the homepage, it'll spider the first 8 links to posts and index those. How will it get to page 2 to spider those posts if it's blocked?
     
    badrobot, Aug 13, 2007 IP
  5. godmode

    godmode Well-Known Member

    Messages:
    4,453
    Likes Received:
    156
    Best Answers:
    0
    Trophy Points:
    190
    #5
    well isnt page 2 has a link similar to 2006/17/2 or something like that? it has to be since pagination is something that you build into it. A normal link structure must be there.
     
    godmode, Aug 13, 2007 IP
  6. badrobot

    badrobot Active Member

    Messages:
    62
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    91
    #6
    The site structure is like this

    Homepage: domain.com

    ---- The page links at the bottom
    domain.com/page/2/
    domain.com/page/3/

    A post is like so: domain.com/yyyy/dd/post-name/

    The homepage will show the last 8 posts which the spider will get to fine. But when it's reading robots.txt, domain.com/page/x/ will be blocked so there will be no way to go to the second page. The /page/ holds the link structure for the spider (if that makes sense).
     
    badrobot, Aug 13, 2007 IP
  7. godmode

    godmode Well-Known Member

    Messages:
    4,453
    Likes Received:
    156
    Best Answers:
    0
    Trophy Points:
    190
    #7
    Do you understand how google bot works? It doesnt matter if your pages are on homepage or not.

    Just submit a sitemap to google and let the googlebot know your internal link structure.

    this is enough for googlebot to find each individual page. Just remember to have all inter-linked from homepage for better deep crawling.

    Good luck
     
    godmode, Aug 14, 2007 IP
  8. nasskobar

    nasskobar Well-Known Member

    Messages:
    77
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    100
    #8
    After having gone through this I have a concern as well. I have been advised to use the text for disallow as

    Disallow: / ?

    I understand that ? could be disallow dynamic pages but what does / mean?

    Could someone help please?
     
    nasskobar, Aug 14, 2007 IP
  9. trichnosis

    trichnosis Prominent Member

    Messages:
    13,785
    Likes Received:
    333
    Best Answers:
    0
    Trophy Points:
    300
    #9
    if you add disallow:/page/ to your robots.txt file, google will not follow pages like /pages/1/ , /pages/2/ etc
     
    trichnosis, Aug 14, 2007 IP
  10. badrobot

    badrobot Active Member

    Messages:
    62
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    91
    #10
    How would one inter-link 900 blog posts from the homepage if robots.txt blocks /page/. The only way the posts will be linked is by hitting the next link which is /page/2/ which is blocked.
     
    badrobot, Aug 14, 2007 IP
  11. godmode

    godmode Well-Known Member

    Messages:
    4,453
    Likes Received:
    156
    Best Answers:
    0
    Trophy Points:
    190
    #11
    just link them from "archives" from homepage no need for page 1, page 2
     
    godmode, Aug 15, 2007 IP