Firstly hello, this is my first post here! Secondly, I'd like to ask your opinion about something I am working on... A site I am trying to fix up to improve SEO has many pages on the server in a .html extension that are no longer in use, but have the exact same content as .php files in the same place. (I hope that makes sense!) So, basically each document is in two forms, .php and .html only the .php documents are being used now, and are linked in the navigation/sitemaps. Do you think that having both sets on the server will count as duplicate content, even though only one set is physically in use... Should I delete the .html files? (I know I technically should, but there are hundreds and hundreds of files, and I've got a lot to do in a short amount of time. If it's not really going to help, I can save that job for in a few weeks time.) Thanks!
Can anybody help me. Further to the .php / .html duplicate pages, I've found today that there are files on the server named, for example lecture_one.php and lecture one.php (one with space, one with underscore) as far as I can tell, only one file is linked, though sometimes it's a space, sometimes it's an underscore... So will all these 'orphan' pages count as duplicate content on the server?
Not the extensions of files create duplicate content but the links, title and description of pages and its own content. Change the links. lecture_one.php to new_lecture_one.php or whatever..
Sorry, I'm a little slow at times: all pages are exact copies... some are page1.html > page1.php (but only 1 set in use others just exist on server) some are page_1.html > page 1.html (one set in use, but others exist on server) (Don't ask me why, I had nothing to do with this site, I'm just working on the SEO!) So, having this page, replicated exactly on the server, even though it's not linked from anywhere, can google crawl this and see it's duplicate content?
If you are using either .php or .html only like what you've said, I think there is nothing wrong with that.... As long as no same content are linked, there's is nothing you have to worry about... Content duplication are not only based on url but on what you have on your content itself...
Actually you're going to want to either delete the HTML files and forward the links on them to the PHP version with a series of 301 redirects. Another option would be to resave the PHP files as .html files and have Apache parse the HTML files as if they were PHP files.
OK Carly, since no one seems to be answering the question directly, I will. I am a web programmer and I can tell you that if there are NO links to those html files, that Google will not be able to crawl them. And if anybody says otherwise, they don't know what they are talking about. The search engine crawlers follow links. That's how they crawl. They DO NOT search directories in an attempt to create a link to any and all files that exist in the directory. So do answer your duplicate content question, you have nothing to worry about. If there are no links to those files, Google will never crawl or list them.
Thank you, that is as I thought. I presumed this would be the case, just needed to check. I was worried incase the new version (e.g. essay1.php) was exactly the same as an older, cached or previously indexed version (e.g. essay 1.html) RE: the previous message, I already added 301 redirects to the .htaccess.
duplicate content is strictly prohibited by search engines your site could be penalize and the worst could be ban!
Most important on duplicate content is to have big enough part of page unique from search engine perespecitve. I would say you should have more than 60% page different than other page.
Percentage has nothing to do with it, in your theory I could take 10,000 wikipedia entries, change the order and wording around (equaling 60% change) and I'd be fine, sorry it doesn't work that way.
I agree with this quote. Part of the reason many websites sub pages dont rank high for longer tail keywords is because the content from there template has a higher word count than the content on any one page. In other words, google is flagging similar pages in your website with a duplicate content filter because not a high enough % of the page has unique content. One thing you can do is copy and paste all the words from the header, sidebar, and footer on your website into microsoft word, which will than automatically give you the word count for your template. I would aim to exceed this word count on every page of your site that you want ranked with unique content, OR reduce the word count on your template by converting parts from text to images or removing unnecessary links so that % of unique content is higher. Hopefully, this makes sense the way I explained it
^ Thank you Chris. It's amazing that out of 12 (approx) replies only 2 or 3 actually answered the question. Yeah. Thanks. Erm.. that's the actual reason I posted the thread! I wanted to know if having two sets of the same pages on the server, with only one set in use could count against us SEO wise. Maybe I don't make myself very clear!
Yes, but you shouldn't worry about 1 or 2 duplicate only if you are mass submitting or reproducing same content.