Automated Data Extraction
Scrape structured data from any PDF
PDFix SDK provides different levels of data extraction. No matter which data extraction level you use, you can export data as HTML, XML, JSON, CSV, or use PDFix API calls to use data directly in your workflows. Based on your use case you can extract only simple data or data enriched with formatting or other information. Try different approaches to get the best possible results for your use case.
PDFix Data Scraping
The majority of PDF documents today is unstructured, not well-tagged or not tagged at all. PDFix gives you the power to rediscover the missing structure automatically. Powered by advanced technologies, including machine learning we are able to detect logical elements like paragraphs, headings, images, tables, lists, headers/footers, table of contents, and more.
If the general layout recognition doesn’t provide the expected structure, you can modify the extraction engine with custom conversion templates based on the file set. PDFix SDK pre-generate this template automatically. You can adjust the template manually to get the desired output.
|Reusable data from any PDF document|
|Detection of hight-level elements like tables, headings, lists, and more|
If you are lucky enough and you already have a well-tagged PDF document, PDFix SDK gives you access to the document structure tree and allows you to extract structural data from there.
|Simple and straightforward solution|
|Insufficient or missing tag structure in most cases|
Raw PDF Data
PDFix SDK allows you to parse PDF page content directly. You have an access to all page objects as they are stored in PDF. You can read text chunks, paths, images, and other low-level objects. For each object, there is a set of API methods to get their properties as a bounding box, graphics state, texts state, etc.
|Sufficient for simple use-cases|
|Additional post-processing is required|