PDF Essentials for Software Developers and the New Gnostice PDFtoolkit 3.0

By: Girish Patil

Abstract: In this article, we will explore various aspects of the new Gnostice PDFtoolkit 3.0. We’ll start with an overview of the PDF format, the design goals and architecture of the new PDF Processor core of PDFtoolkit, some interesting new features of v3.0, the QC systems and finally explore the demo of the new product.

    Overview of the PDF format

As PDFtoolkit is a software library to enable software developers to work on PDF documents, it would be beneficial for the software developer to have a general understanding of the underlying technology: the technology of the Portable Document Format (PDF). This description is by no means an in-depth description of the PDF format; it is only intended to equip the software developer to better use the technology.

Hide image

A PDF file, internally, is broadly divided into four parts: the Header, Body, Cross-Reference Table and Trailer. The Header is expected to contain the version number of the PDF format that the file was written to – PDF-1.7 is the newest and corresponds to Acrobat 8; 1.6 to Acrobat 7 and so on. The version number is a general indication of the version of the PDF specification only; PDF readers and processors cannot necessarily rely on it to operate on the file. The Body holds the content of the PDF that we see on screen, such as text, images, drawings, bookmarks, forms, including all the necessary information to show and make use of the content in a proper way. All the parts of the Body are divided up into what are known as objects. For example, there’s an object for each page in the document, each font used, each image and so on. The page object holds, along with other page attributes, all the drawing commands that represent the page that we see on screen. The drawing commands reference other objects in the Body, such as fonts, images, etc. A point to note here is that many pages can reference to one font or image or other resource. This should also tell us that we can optimize use of resources in PDF through reuse and generate a much smaller PDF document. The Trailer contains some very important keys to the PDF document, without which no reading of the PDF document can even be possible. As we found out that all content in the Body is held in specialized types of objects, and that the objects are reusable, it would not make much sense to store the content in objects if they were not accessible randomly. The PDF document format is a quite well evolved and well thought out document format and it does take care of the random access of the objects in the Body. Random access is achieved by storing the absolute byte offset address of each object in a table known as the Cross-reference table (or XRef Table) and one of the keys that the Trailer contains is the starting point of this XRef table. The implementation detail to this is that the offset addresses in the XRef table needs to be updated for all objects that occur after the object that was modified in the PDF document. There are of course optimized mechanisms to handle this scenario.

As the PDF format evolved, it acquired many useful qualities and technologies to make the file more compact, faster to read, and support some of the standard document control and verification technologies. Following is a short description of a pick of those technologies and techniques:

  • Compression – Compression has been supported in PDF from the first versions, but it was limited to the compression of the core content of the objects in the Body. Rest of the PDF file: the attributes of the objects (at the start of the object), which can sometimes get quite large; the XRef table in the Trailer remained totally uncompressed. PDF 1.4 (and I hope I’m right about the exact version number) introduced the support for compression of the remaining parts as well, and these parts when they are compressed are known as object streams and XRef streams. PDF supports several compression algorithms, including Flate, LZW and other image compression algorithms.
  • Linearization – Linearization is a technique of organizing a PDF document in a way that it allows PDF viewers to view PDFs before downloading the entire file to the system. This feature is also known as ‘Fast Web View’ and it is extremely useful to linearize a large PDF document, especially when it is published for viewing over the web. We can quickly verify whether a PDF document is linearized or not by checking the ‘Fast Web View’ attribute in the Document Properties dialog (Ctrl+D), when we open the document in Acrobat Reader. We learnt, just a little while ago in this article that the processing of a PDF document begins at the Trailer, which is mostly located at the end of the document: yes, this is true for a non linearized PDF document. So, for the depth we can go to on this subject in this article, we just need to understand that a linearized PDF contains some essential keys and contents of the document right at the head.
  • Incremental Update – Incremental Update or Incremental Write is another technique in PDF where updates or changes to the document are done without rewriting or reconstructing the entire document. Only the updates and changes are appended to the existing document and appropriately marked so PDF processors will refer to the new content. This technique is very useful and greatly speeds up the entire process, when working with (updating or modifying) large PDF documents. More importantly, this technique is essential for PDF processors to support, if they have to provide full support for digital signature features.

Now that we have a good understanding of the PDF format, we should also be able to better appreciate the implications of performing operations on a PDF file, but more importantly, realize the potential possibilities when working with PDF documents, which is exactly the aim of the new Gnostice PDF Processor, to enable the software developer to harness the power of PDF.

    The New PDF Processor

The new PDF Processor core is the core engine that powers PDFtoolkit 3.0. The new PDF Processor has been designed and built from the ground up with the following key objectives:

  • To provide robust, error tolerant PDF reading and writing
  • To enable high speed loading, viewing and printing of any type of PDF file, with automatic loading optimization levels (Load for Viewing, Load for Manipulation, Load for Forms access, Load for DocInfo…)
  • To enable software developers to incorporate only the layers of the library they need, into their applications
  • To handle unknown elements and types in such a way that they are not lost in the transformation
  • To be extensible to incorporate new PDF features with ease, without disturbing the programmer interface

The PDF Processor, in its design and organization, closely reflects the PDF format, where each PDF object type, starting from the base type, has a corresponding class in the PDF Processor, including its hierarchy. It is also modular in such a way that only the necessary parts and layers needed to achieve a desired task can be used. For example, to read a PDF document and its objects there is a separate layer, to edit another, another to view/print and so on. The advantage that this design and organization provides us is exactly those we set out to achieve, through our objectives.

Hide image

So far most of the core PDF Processor’s implementation is complete and now the team is integrating the core into PDFtoolkit 3.0. Further down in this article is a pre-release demonstration to the capabilities of the new PDF Processor.

    Major New Features in PDFtoolkit 3.0

  • Ability to read all kinds of PDF files, including PDF 1.7 (which was supported in 2.x), is now highly optimized and with wide support for the new types in PDF 1.7
  • Instant loading of PDF documents, with automatic optimization at several levels, where if you need to read a document only for accessing the document information/properties, then only those parts of the document will be processed
  • Enhanced, high speed PDF Viewer with form field viewing, including custom drawn form fields
  • Enhanced PDF Printer capabilities, with high speed vector printing, auto page rotation, etc
  • Support for incremental update - appending only changed content without disturbing existing document structure (essential for digital signing, modifying digitally signed documents)
  • Support for most compression algorithms
  • Support for most image formats/types for reading, viewing and printing
  • Support for most font types, when viewing and printing

    QC in the New PDF Processor

Right through the development of the new PDF Processor, we have placed great emphasis on the quality of the product, testing, tuning, optimizing and documenting all the parts. One example to state of the systems we implemented, in addition to the extensive unit test automation through DUnit, which is the brainchild of the architect of the new PDF Processor, Shameer, is an automated testing framework which automates the allocation and performance testing of all the features and a large set of scenarios, across thousands of files that we have collected so far. The testing framework uses AutomatedQA’s award winning AQTime testing tool and it’s SDK to perform this testing and generate reports that the team can act on.

    The Pre-Release EXE Demo

The PDFtoolkit version 3.0 EXE demo program showcases the new, optimized PDF Processor core of PDFtoolkit 3.0.

This demo enables us to try out some of the complex functions of PDFtoolkit that require high computation and file manipulation, and experience:

  • The high-speed reading and processing of PDF documents
  • Ability to handle large and complex PDF documents
  • Ability to handle large number of files (when merging PDF documents)

For example, files such as the Acrobat PDF Specification (31.7 MB, 1310 pages, latest PDF 1.7 format with cross-reference streams) loads almost instantly - a 100x improvement over the earlier version.

Functions exposed in this EXE demo:
  • Read & Edit Document Information/Properties
  • Append Pages to PDF
  • Insert Pages to PDF
  • Extract Pages from PDF
  • Delete Pages from PDF
  • Merge multiple PDFs into one
  • Save new document with Full Rewrite or Incremental Write
  • Updates to EXE demo with more functions exposed will be provided shortly.

The zipped EXE demo can be downloaded from this link. For more information regarding download, features, purchase, and others, please see follow the links listed below.

PDFtoolkit VCL - Overview, Features, Downloads, Buy Now

PDFtoolkit ActiveX/.NET - Overview, Features, Downloads, Buy Now

More PDFtoolkit Articles - Rearrange pages in PDF, Flatten, Fill & Email PDF

---oO0Oo---

Server Response from: ETNASC01