The Magic of Data hidden in a PDF

The Magic of Data hidden in a PDF

It’s a common myth that the PDF is the final format of a document, where the data and content can only be rendered on a screen or printed on paper. This is far from correct and we will explain why. John Warnock had a vision:

“This project’s goal is to solve a fundamental problem that confronts today’s companies. The problem is concerned with our ability to communicate visual material between different computer applications and systems.”

So, the Portable Document Format was born.

Over 23 years ago Adobe Systems created the PDF, and at the same time gave everybody “Reader” – a PDF Viewing engine. The engine was restricted in functionality, which gave rise to the belief it was a secure end point of content. This belief has yet to change.

The humble PDF has many more powerful tricks up its sleeve. Designed for viewing and printing, the basic characteristic of a PDF is a fixed layout document that ensures pixel perfect rendering on any screen and paper. Text is not wrapped but placed at precise positions in the document – words are often not complete strings. Each letter can be written independently. Tables can be written as a stream of lines and numbers or text. Each character, or glyph, is an individual entity, not belonging to those around it, despite looking exactly as the author intended when viewed or printed – thanks to the power of structured content.

PDF Tagging

PDF Tagging provides fantastic assistance in recognizing this structured content. Yet a tagged PDF makes up less than 18% of the Trillions of PDF files in existence and PDF tagging still doesn’t guarantee the correctness of the PDF structure.

It is common practice, when the content is needed, to physically print the PDF, run it through an Optical Character Recognition (OCR) and physically manipulate the content. Even if the goal is to extract only the text, OCR fails to obtain a vast array of information inside the PDF that can actually assist in the correct extraction of the content.

For the successful extraction of data from within a PDF’s structure, the tool must be able to find the semantics, the content structure, and its logical reading order. The tool must also find the words, lines, paragraphs, tables, columns, rows, and each cell. It must detect vector graphics that make sense only when presented as a whole. Does your extraction tool do this? You don’t want your content changed or manipulated when extracted from a PDF.

So where does this leave us?

How do we go about getting the content from over 3.2 Trillion PDF documents that are out there?

There are many tools available for getting the unstructured content from a PDF, both commercial and open source, but not as many for getting it in a structured form. Structured form extraction tools, deal with extraction in different ways, with varied and mostly worse results. In the ideal case, words, lines, paragraphs, tables, graphs, and pictures as well as mathematical formulas, are not only available when viewing the PDF, but they are required for machine processing and learning and reusable to other applications.

In the case of machine processing, extracted data can be transformed into other formats like XML, JSON or CSV. It can also be used directly for big data processing and machine learning. By working with extracted content that has been correctly structured, it is possible, for example, to read a bank account number and the payable amount on an invoice, identify social security numbers from an application form, read numbers or even find the place in a document where a signature is required and who has to sign it.

Extracted content, extracted data

Extracted content can also benefit the visually impaired, if done correctly. The extracted data can be used to add further structured information and semantics to create PDF/UA (Universal Accessibility) compliant files without any human input. The usual process for creating of a PDF/UA compliant file is a laborious task, involving human intervention to ensure the file correctness. This is usually a very expensive process and unfortunately limited to a small subset of PDF files.

Extracted data can be used in other formats. By retaining the original content of a PDF, it can be manipulated for the required end user device – taking a simple PDF and making it readable on portable devices, without pinching and zooming. This responsive approach is becoming the norm in web pages, so why are leaving the humble PDF in the 90’s? Through the correct extraction of data, and building the semantics of the document, we can also make PDF responsive.

On the surface the myth may be true, but delve a little deeper into the unused power of a PDF, and possibilities are far from final format. There are numerous ways to get data from a PDF, but not all tools are equally effective, correct or accurate. We here at PDFix, have made it our mission to make sure that the information in your PDF isn’t changed or broken. We simply help you gain access to it easier.