Hi: Since more and more SEs relies on the content to determine relevance, it is said META TAG is less and less important in ranking. But many documents are in binary format, such as PDF, Postscript. Can computer recognize such formats? If not, then MetaTag still plays a critical role in ranking. If yes, how does it do that? Using pattern recognition on something like a blackbox (it will be tough)? Or is there an API exposed by Adobe helping machine understand this format? MS word is also binary file, but MS and google can search the content. Is there any open API for word so that anyone can restore the text from the binary? I did not find any info on this topic. Please give me some pointers. Thanks Leon
PDFs, PPTs, DOCs, XLSs, and other types of documents are, uh, well, uh, they're documents. Just like and HTML file, each format has it's signatures. One obvious giveaway is the file name extension. Also, if you look at the source code behind any file you will see that it has a tag, similar to the <HTML></HTML> tag in, uh, HTML. From there it's just a matter of translating the file and indexing it. Translating the file is not difficult once you know the formatting rules. Here's something you may not know. With the right software you can add tags and comments to image formats like JPEG. These in turn can be read by search engines and used to help index those files. Does this help answer your question?
Get a pdf document. Open it. Select Edit --> Select All --> Copy. Bam you got the text ready to paste some where else!! It's not that hard to get the text!!!
Komodo, I think my questions is partially answered. I cannot find the pdf format rules online but I did find some softwares which convert PDF to other format literally, including the pictures. Does Adobe open PDF format rules? Do you know where is it? Or Adobe sells such information to other company so it is not availabe online? I know Adobe gives free IFilter to indexing a PDF file. But I am afraid the info in IFilter for a PDF doc is not complete, for example, it may not have image info. So I deduce the software such as PDF2Word does not use IFilter. Now I am really curious on where ppl can find the PDF format encoding rules. Thanks a lot Leon
Now you are getting beyond my general knowledge into specifics I do not know. Here is what I can tell you: There are generic PDF creators and readers, however, these tend to have basic functionality only. The only complete methods of PDF creation and reading that I am aware of is Adobe Acrobat and Adobe Reader. The PHP programming language has an extensive set of functions for reading, handing and creating PDF documents. (I've never used these) You can find a partially working example here. What exactly are you trying to do?
Hi, Komodo, this is the doc: http://partners.adobe.com/public/developer/en/pdf/PDFReference16.pdf Thanks
I use Acrobat for create PDF's and use all the Meta tags available. The pages show up on the search engines. However, Google does not appear to rank pdf pages. What's up with this?
It depends on how the PDF is created, if it is created "properly" then the text is stored as text but some cheap converters actually simply make an image of the whole document and therefore these wouldnt be able to be read by a search engine spider. I am confused by your comments on the fact that they are just binary, everything on a computer is simply binary as that is ultimately all a computer understands but with software/ programming it is able to interprit the binary into something else be it a HTML file being read by IE/ FF as a webpage or the same file being read by notepad as a text file or a PDF being read by a spider as a text file with embedded images