Nowadays, you often need to use the data from your PDF files elsewhere. Whether it’s copying and pasting some simple text or any complicated structures – we’ve got you covered! PDFix offers multiple methods of extracting the data from your document and directly parsing the page content.
Extracting raw data
First, we will take a look at the possibilities of extracting Raw data, namely a simple text.
Use any of your preferred selection methods to select the text you want to copy. Next, right-click on the selection and press either copy or copy with formatting. As the name suggests, copy with formatting preserves the format of the text, while copy preserves the Raw text only.
Another option is to extract the properties of your PDF through the Format panel. With the Format panel open, pick any content from your PDF, ranging from a single line to the entire document. As soon as you make the selection, you can examine the properties of this content, such as its position, color, or used font. Additionally, you can also copy the raw text or its Unicode representation.
Convert to HTML
We also offer the possibility of creating an HTML document from your PDF and using that for data extraction. We recommend following our previous blog post and its attached video to learn more about this process.
Convert to JSON
Furthermore, you can convert your document into a JSON file and examine its structure. This can be done by pressing the Convert to JSON button located in the Conversion panel. Once pressed, you will be faced with a dialogue where you can specify the structure of your JSON and the information you want to preserve.
First, you can choose to save any combination of the PDF Objects, Layout, or Tags. If your PDF is well-tagged, we recommend using the Tags option, which builds a JSON based solely on the tag tree of your document. The Layout option is intended for documents that are tagged poorly or not at all. The layout option automatically detects the structure in your document using advanced technologies, including machine learning and uses it to create your JSON. And finally, the PDF Objects can be used if you want to view the individual objects of your document as selected by the object tool.
Choose the proper settings for your PDF and press OK to run the conversion and create your JSON.
And that is it! If you want to extract the data of multiple documents at the same time, you can follow our further blog post concerning Batch processing and use the Convert to JSON command.