How can i find those website in which robots.txt disallow google spider?

Discussion in 'robots.txt' started by jiangnancun, May 13, 2014.

  1. #1
    I want to find some website ,it's robots.txt disallow google spider or other search engine spider, so google do not index his page.
    site: Domain.com in google.com,the result is 0.

    It's robots.tx is below:

    User-agent: *
    Disallow: /


    Could you tell me which website is like this,how can I find more website like this?
     
    jiangnancun, May 13, 2014 IP
  2. sarahk

    sarahk iTamer Staff

    Messages:
    28,796
    Likes Received:
    4,531
    Best Answers:
    123
    Trophy Points:
    665
    #2
    I'm not sure why you want to find them but you could always write your own spider which ignores robots.txt and program it to just index the home page of sites and scrape links to other sites and run until you have the desired number of sites.

    What will you be doing with the info?
     
    sarahk, May 13, 2014 IP
  3. neroux

    neroux Active Member

    Messages:
    566
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    60
    #3
    Same question.
     
    neroux, May 13, 2014 IP