I have a very old backup copy of my website and decided to reinstall it on a subdirectory of my current updated website (I need the old site occasionally for some references). I have set a password protected to the subdirectory form cpanel so that no one can see it except me. To my surprise google indexed a page of this subdirectory even though it is protected! and that makes me worry because of the duplicated contents. Any clue about this ?
It's best to exclude any pages you don't want found in your robot.txt file as well as adding "no index" tags to your pages. Google wouln't have spidered your page directly by sending google bot through your login form, but they could have accidently found it some other way such as unsecured sever logs or something.
Google will index password protected directory - but it can't index protected URLs. Because even if you protect the directory, people can still access it via URL - it's a different thing, directory and url.
Yes, Robots.txt is one of the best ways to not allowing Google and other major crawlers to index unwanted or protected urls.
I agree with most of the experts above. Robots.txt is the best way to restrict google from crawling your web pages.