I decided to put a wordpress blog on my site, more as a back-end for adding new content than as the main site. This was about two weeks ago. So, I put it in its own directory (.com/news/) and started posting. My first posts were "TEXT TEXT TEXT ..." ad infinitum, just to get used to the categories, the URI rewriting, how photos would look, etc. I also put up a file in the root directory called "testex.html," which was a dupe of index.html except it had a php rss parser of my latest posts, instead of the main content that was on my index.html. There were NO inbound links to this subdirectory, but there was an outbound link to the site homepage (...com/) in the blogroll on every page, and testex.html had outbounds to almost every page on the site (but no IBL to testex.html). I don't know how it happened, but last night I was in Sitemaps and I clicked on the "pages indexed" and it came up with all these pages I thought would never get indexed. Cached on January 15, about 10 days after I downloaded Wordpress, pages like "Seventh Post - Image test" with "Text text text text..." in the excerpt beneath. All of these pages return a 404 now - I deleted them before I even knew they were indexed. I know, I should've made a robots.txt exclusion for the WP directory, but I am lazy. And I had no idea G would index these pages so quickly (or at all). Anyone see this happen before? More importantly, is there a way to remove pages from the index?
Happens all the time. Its usually caused by viewing the pages with a browser with the Google toolbar built in. Theres a theory that the built in spyware adds any pages you view to Googles crawl list
Neither google toolbar, nor sitemap generator. It is more likely that it was indexed through pingomatic notifications available in all wp scripts by default. Check your blog's options - /wp-admin/options-writing.php
Pingomatic notification is a suspect. So are crawl caching proxy servers. Running adsense would cause it to get indexed. I used to think that the toolbar might be one of the services the proxy servers might use, but this post says otherwise.
Yep, it appears that this is the culprit, since my "sitemap generator" is notepad (I'm still learning ). Thanks for the input, I'll be sure to change options next time I'm launching with WP.
Several years ago, when G toolbar was a novelty, there were many gossips regarding its real mission for G. One of opinions was that G is gathering information about surfers' activity on web and some said that sites without inbound links can be revealed for G through the toolbar. Unfortunately I can neither affirm, nor disprove these gossips.
Google does not use the toolbar to index pages, Matt Cutts talks about this all the time on his blog. the culprit could be one of the automatic pinging services in Wordpress and google reads your sites raw servers logs. I posted the link to his blog post http://www.mattcutts.com/blog/debunking-toolbar-doesnt-lead-to-page-being-indexed/