Use Case

Add Tags to PDF to improve text and graphics extraction.

Resources

Download the original PDF document

Integration

The SDK provides two options for integrating into your project using a Command Line Utility or programatically.

Click here to create your free trial license key.

> Command Line

PDFix provides simple and fast automated PDF processing using a command line. PDFix Command Line Utility is the easiest way to integrate the SDK functionality into your solution available for Windows, MacOS and Linux. Learn more about the Command Line Utility.

$ cd /pdfix_mac/bin
$ ./pdfix_app support@pdfix.net 3bE31NaixzFE58ir -addtags /Users/admin/Documents/input.pdf output_csv

Output:

Add Tags to PDF
Success

This command adds tags to PDF. No options are currently available for CLI.

These code samples show how to add tags to a PDF document. Code integration into your project allows you to take full control of the PDF data processing

 

Result

General configuration file:

Our engine uses general configuration file which should be ok for majority of cases. Here´s the output using this default configuration file:

Looking at the tag structure, we see that there is tagged only one table instead of two, separate tables. A similar case is also seen within the images – graphs, that also are tagged under one figure, instead of two. There is no perfect algorithm that works under all circumstances. Such an partially incorrect tag structure results in unsatisfactory output, for example in case of making the PDF Accessible or outputting the PDF content into responsive HTML layout.

For these cases, the PDFix SDK allows customization of the output by using custom configuration files that affect the particular elements detection process and the output tags structure.

Customizing the output

To improve the tagged output of our sample PDF document, we will use this custom JSON configuration file. To learn more about the configuration files please follow the Documentation. When using the SDK programatically there are no limits to fit the output your needs.

We can see the updated tag structure after applying the custom configuration file. For example, we set custom headings and pointed to table and image elements to consider within the detection process. Now we have more acceptable and usable tagged PDF output.

The PDFix SDK uses the generated tag structure for example to output the PDF content into responsive HTML layout. To compare the particular outputs, please follow these links:

Open the original responsive HTML output

Open the HTML output after applying custom config file

Add Tags to PDF

Contact us if you need help with integration.