Tuesday, July 6, 2010

PDF - As I Understand It

Earlier I have written about PDF and the types of PDF files that can be created. Today I want to elaborate on the my understanding of the PDF.

Since I became accustomed to PDF I always wondered what "Exactly is a PDF file", to be more precise, what exactly is "page in a PDF". I compared it to Microsoft Word, Open Office writer and many similar softwares. One fine day (this was in early 2002), in one of the conversation with my CEO (Versaware India Pvt. Ltd.), he mentioned "its just like paper".

When I returned to my desk, I sat down for a while just thinking on the statement. I realised he was absolutely right. It is a "paper", similar to the ones we day in day out keep printing stuff that is not essential. The only difference being that a PDF is an electronic paper (nature friendly). Enough of flash back, back to real thing.

As per my understanding PDF consists of four layers.

1) Content Layer: This is the upper most layer, which consists of text and or images. This is the layer that is visible (mostly, I will comeback to why I say mostly)
2) Inline Style Layer: This is the layer that decorates the content, the inline styles bold, italics, underline, superscript, subscript etc.
3) Content Style Layer: This is the layer that defines the structure of the content, the paragraph styles, fonts, font size etc.
4) Canvas Layer: This is the layer that defines the Page size,the galley, the margins and the header and footer area.

Seriously I never knew about this until, we were experimenting on content extraction from PDF and I requested one of the programmers to extract as much information from the PDF file as possible. To my surprise, all of the above information is stored very systematically within the PDF. This information can be extracted and reused and repurposed if the content is extracted with the PDF.

It is very important to note that once a PDF is created you cannot do much with it, it is the same as a printed page. At the most you can add in some remarks or annotations or notes. Nothing much.

PDF is a very good source of content storage in an absolutely elegantly styled way.

Coming back to mostly, there are some PDF files where the entire page is an image and the text content of the page is either maintained in front of the image or behind the image. If it is behind the image the text content will not be visible. This mostly done to make an image PDF searchable.

I think content if styled properly, can be extracted to HTML files and this content can be used to created ePUB files. By styled properly I mean the page layout with clearly defined and not to clogged layout. This will help to have the ePUB looking closer to the PDF. It is also important to note that making eBooks look pretty will not always help your books. Most devices just go ahead and destroy your layout.

Keep it simple! Thats the best Bet!

Cheers!!
VY!







1 comment:

  1. Hi,

    PDF is the most popular file type used in the internet for viewing documents. PDF supports several types of patterns. The simplest is the tiling pattern in which a piece of artwork is specified to be drawn repeatedly. Thanks....

    PDF Print Protection

    ReplyDelete