PDF Data Extraction

PDF Data Extraction

Nowadays, you often need to use the data from your PDF files elsewhere. Whether it’s copying and pasting some simple text or any complicated structures – we’ve got you covered! PDFix offers multiple methods of extracting the data from your document and directly parsing the page content.

You can watch the linked video for a step-by-step explanation of the entire process.

Extracting raw data

First, we will take a look at the possibilities of extracting Raw data, namely a simple text.
Use any of your preferred selection methods to select the text you want to copy. Next, right-click on the selection and press either copy or copy with formatting. As the name suggests, copy with formatting preserves the format of the text, while copy preserves the Raw text only.

Extracting raw data
Copying the raw text from the selection.

Format panel

Another option is to extract the properties of your PDF through the Format panel. With the Format panel open, pick any content from your PDF, ranging from a single line to the entire document. As soon as you make the selection, you can examine the properties of this content, such as its position, color, or used font. Additionally, you can also copy the raw text or its Unicode representation.

Format panel
Open format panel (right) with the information concerning the selected paragraph.

Convert to HTML

We also offer the possibility of creating an HTML document from your PDF and using that for data extraction. We recommend following our previous blog post and its attached video to learn more about this process.

Convert to JSON

Furthermore, you can convert your document into a JSON file and examine its structure. This can be done by pressing the Convert to JSON button located in the Conversion panel. Once pressed, you will be faced with a dialogue where you can specify the structure of your JSON and the information you want to preserve.

First, you can choose to save any combination of the PDF Objects, Layout, or Tags. If your PDF is well-tagged, we recommend using the Tags option, which builds a JSON based solely on the tag tree of your document. The Layout option is intended for documents that are tagged poorly or not at all. The layout option automatically detects the structure in your document using advanced technologies, including machine learning and uses it to create your JSON. And finally, the PDF Objects can be used if you want to view the individual objects of your document as selected by the object tool.

Convert to JSON
Dialog for specifying the JSON file.

Choose the proper settings for your PDF and press OK to run the conversion and create your JSON.

Convert to JSON - Output

And that is it! If you want to extract the data of multiple documents at the same time, you can follow our further blog post concerning Batch processing and use the Convert to JSON command.