Can/How computer recognize PDF documents?

jiangok2006 Peon

Messages:: 13

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

Hi:
Since more and more SEs relies on the content to determine relevance, it is said META TAG is less and less important in ranking. But many documents are in binary format, such as PDF, Postscript. Can computer recognize such formats? If not, then MetaTag still plays a critical role in ranking. If yes, how does it do that? Using pattern recognition on something like a blackbox (it will be tough)? Or is there an API exposed by Adobe helping machine understand this format?
MS word is also binary file, but MS and google can search the content. Is there any open API for word so that anyone can restore the text from the binary?
I did not find any info on this topic. Please give me some pointers.

Thanks
Leon

jiangok2006, Nov 20, 2006 IP

Komodo Tale Peon

Messages:: 140

Likes Received:: 12

Best Answers:: 0

Trophy Points:: 0

#2

PDFs, PPTs, DOCs, XLSs, and other types of documents are, uh, well, uh, they're documents. Just like and HTML file, each format has it's signatures. One obvious giveaway is the file name extension. Also, if you look at the source code behind any file you will see that it has a tag, similar to the <HTML></HTML> tag in, uh, HTML. From there it's just a matter of translating the file and indexing it. Translating the file is not difficult once you know the formatting rules.

Here's something you may not know. With the right software you can add tags and comments to image formats like JPEG. These in turn can be read by search engines and used to help index those files.

Does this help answer your question?

Komodo Tale, Nov 20, 2006 IP

Nintendo ♬ King of da Wackos ♬

Messages:: 12,890

Likes Received:: 1,064

Best Answers:: 0

Trophy Points:: 430

#3

Get a pdf document. Open it. Select Edit --> Select All --> Copy. Bam you got the text ready to paste some where else!! It's not that hard to get the text!!!

Nintendo, Nov 20, 2006 IP

jiangok2006 Peon

Messages:: 13

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#4

Komodo Tale said: ↑

PDFs, PPTs, DOCs, XLSs, and other types of documents are, uh, well, uh, they're documents. Just like and HTML file, each format has it's signatures. One obvious giveaway is the file name extension. Also, if you look at the source code behind any file you will see that it has a tag, similar to the <HTML></HTML> tag in, uh, HTML. From there it's just a matter of translating the file and indexing it. Translating the file is not difficult once you know the formatting rules.

Here's something you may not know. With the right software you can add tags and comments to image formats like JPEG. These in turn can be read by search engines and used to help index those files.

Does this help answer your question?
Click to expand...

Komodo, I think my questions is partially answered.
I cannot find the pdf format rules online but I did find some softwares which convert PDF to other format literally, including the pictures.
Does Adobe open PDF format rules? Do you know where is it?
Or Adobe sells such information to other company so it is not availabe online?
I know Adobe gives free IFilter to indexing a PDF file. But I am afraid the info in IFilter for a PDF doc is not complete, for example, it may not have image info.
So I deduce the software such as PDF2Word does not use IFilter.
Now I am really curious on where ppl can find the PDF format encoding rules.

Thanks a lot
Leon

jiangok2006, Nov 21, 2006 IP

Komodo Tale Peon

Messages:: 140

Likes Received:: 12

Best Answers:: 0

Trophy Points:: 0

#5

Now you are getting beyond my general knowledge into specifics I do not know. Here is what I can tell you:

There are generic PDF creators and readers, however, these tend to have basic functionality only.

The only complete methods of PDF creation and reading that I am aware of is Adobe Acrobat and Adobe Reader.

The PHP programming language has an extensive set of functions for reading, handing and creating PDF documents. (I've never used these) You can find a partially working example here.

What exactly are you trying to do?

Komodo Tale, Nov 21, 2006 IP

jiangok2006 Peon

Messages:: 13

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#6

Hi, Komodo, this is the doc:
http://partners.adobe.com/public/developer/en/pdf/PDFReference16.pdf
Thanks

jiangok2006, Nov 21, 2006 IP

gsh1010 Peon

Messages:: 10

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#7

I use Acrobat for create PDF's and use all the Meta tags available.

The pages show up on the search engines.

However, Google does not appear to rank pdf pages.

What's up with this?

gsh1010, Nov 26, 2007 IP

AstarothSolutions Peon

Messages:: 2,680

Likes Received:: 77

Best Answers:: 0

Trophy Points:: 0

#8

It depends on how the PDF is created, if it is created "properly" then the text is stored as text but some cheap converters actually simply make an image of the whole document and therefore these wouldnt be able to be read by a search engine spider.

I am confused by your comments on the fact that they are just binary, everything on a computer is simply binary as that is ultimately all a computer understands but with software/ programming it is able to interprit the binary into something else be it a HTML file being read by IE/ FF as a webpage or the same file being read by notepad as a text file or a PDF being read by a spider as a text file with embedded images

AstarothSolutions, Nov 26, 2007 IP

Log in or Sign up

Can/How computer recognize PDF documents?

jiangok2006 Peon

Komodo Tale Peon

Nintendo ♬ King of da Wackos ♬

jiangok2006 Peon

Komodo Tale Peon

jiangok2006 Peon

gsh1010 Peon

AstarothSolutions Peon

Useful Searches