Logical content extraction and conversion

Logical content extraction and conversion

Logical Content Extraction

Make your PDF’s accessible.

This is where the magic happens.

If you’ve ever tried to get any data from PDF files, you know how painful it is. There is no easy way how to do that. What looks like an image is not a image. You are not able to copy texts in a right reading order, what looks like a table is just a bunch of isolated elements as lines, rectangles and texts.

PDFix gives you a power to do that. We turn your simple PDF into a fully responsive document with semantic content and logical reading order built for you.

  • Document Structure Recognition
  • Intelligent Data Extraction
  • Text paragraphs Detection
  • Images, Graphics Extraction
  • Annotation Extraction
  • Reading Order Detection
  • White space Detection
  • Table Detection (including cells & rows)
  • Text Table Detection (including cells & rows)
  • Header/Footer Detection
  • Table of Contents Detection
  • AcroForm Reading Order Detection (Coming Soon)
  • Chart Detection (Coming Soon)
  • Regular Expression, Pattern Matching

Untitled
Untitled
Untitled
Untitled

How do we do that?

  1. We take original PDF document
  2. We recognize logical text paragraphs, headings, normal texts, bulleted lists, numbered lists, TOC
  3. We detect graphics logical elements as images, tables, lines, rectangles
  4. We recognize headers, footers and document main texts.  Articles recognition is in progress
  5. We set elements hierarchy
  6. We detect logical reading order

After successful processing you have access to all logical elements. You can search texts, you can save all images, you can export table values into your database or you can use exported elements for conversions – to HTML, JSON, Word, Excel, etc.

Pull data out of PDFs and

  • Search in texts
  • Export images
  • Export tables
  • Export any data you want in structured, usable formats
  • Convert to HTML
  • Convert to JSON
  • Convert to Word, Excel
  • PDF/UA Accessible PDFs