Automated Data Extraction

Scrape structured data from any PDF

PDFix SDK provides different levels of data extraction. No matter which data extraction level you use, you can export data as HTML, XML, JSON, CSV, or use PDFix API calls to use data directly in your workflows. Based on your use case you can extract only simple data or data enriched with formatting or other information. Try different approaches to get the best possible results for your use case.

PDFix Data Scraping

The majority of PDF documents today is unstructured, not well-tagged or not tagged at all. PDFix gives you the power to rediscover the missing structure automatically. Powered by advanced technologies, including machine learning we are able to detect logical elements like paragraphs, headings,  images, tables, lists, headers/footers, table of contents, and more.

If the general layout recognition doesn’t provide the expected structure, you can modify the extraction engine with custom conversion templates based on the file set. PDFix SDK pre-generate this template automatically. You can adjust the template manually to get the desired output.

Reusable data from any PDF document
Detection of hight-level elements like tables, headings, lists, and more
Highly customizable
PDFix Data Scraping Showcase
Tagged PDF Showcase
Tagged PDF

If you are lucky enough and you already have a well-tagged PDF document, PDFix SDK gives you access to the document structure tree and allows you to extract structural data from there.

Simple and straightforward solution
Insufficient or missing tag structure in most cases
Raw PDF Data

PDFix SDK allows you to parse PDF page content directly. You have an access to all page objects as they are stored in PDF. You can read text chunks, paths, images, and other low-level objects. For each object, there is a set of API methods to get their properties as a bounding box, graphics state, texts state, etc.

Sufficient for simple use-cases
Unstructured data
Additional post-processing is required
Raw PDF Data Showcase

Windows, MacOS, Linux

Java, Python, C#, C++