Layout Template

Workspace

  • SHORTCUTS
  • GENERAL UI

Preferences

GENERAL

DISPLAY

PANELS

SHORTCUTS

APPLICATION VIEW

ACCESSIBILITY

PAGE MAP

TABLE TOOL

TEMPLATE


PDFix Actions


Selection Tools


Validation


Accessibility


Tags


Annotations


Content


Bookmarks


Conversion


Destinations

Browser

Template

License

Master PDF Auto-Tagging with Layout Templates

Introduction

PDFix Template Layout is a rule-based system that allows users to define custom layout recognition and tagging logic for PDFs. This is essential for structured documents (e.g., statements, catalogs, invoices) and for achieving compliance with accessibility standards such as PDF/UA.

This technical documentation provides a comprehensive reference to all features, functions, and nodes used in the Template Layout language, complete with advanced examples, implementation details, and behavioral logic derived from internal SDK methods like split_texts.

General Template Logic

The fundamental approach to building a complex template is to divide the document into logical sections using initial elements. Each section can have its own child template (defined in an element_template node), which overrides the main behavior of the layout recognition algorithm for that part of the document.

Initial elements are created using the element_create function. The most important value to define for an initial element is its bounding box (bbox).

Bounding boxes can be defined in two main ways:

  • Fixed bbox: Using direct coordinates.
  • Start/end bbox: More dynamic, may span multiple anchors.

Bounding box coordinates (left, top, right, bottom) can use:

  • Static float values
  • General context values ($page_num, $page_width, $doc_num_pages, etc.)
  • Parent values ($parent_top, $parent_left, etc.)
  • Anchor references ($A1_bottom, $ANCHOR_right, etc.)
  • Math functions (e.g. SUM($parent_left, 10))

Why Use a Template?

  • Ensure consistent tagging across similar documents
  • Improve the accuracy of semantic tagging
  • Save time on manual remediation
  • Enhance accessibility and screen reader compatibility
  • Boost efficiency by applying one template to thousands of similar PDFs

Pagemap Overview

The pagemap engine is the core class responsible for page layout recognition. It processes each page from bottom to top and groups primitive objects into logical entities like words, lines, paragraphs, rules, tables, lists, etc.

In PDFix Desktop, each processing step can be visually debugged and adjusted. You can use the settings -> debug_pagemap_stop property in the template to stop the layout engine after any step to inspect intermediate results. This is especially useful when building templates – you can verify every step separately and identify where recognition failed.

For example, setting the stop point to word_update halts after words are recognized. You can then verify whether superscripts and subscripts are properly merged into words. This technique is also helpful for debugging complex processes like paragraph assembly or table recognition.

Functions


element_create

Purpose and Behavior

The element_create function inserts a new layout element into the document structure. Unlike other functions that modify detected elements, this one defines virtual elements manually-based on explicitly set parameters like bounding box and type. These elements can be used as standalone tags, layout hints, containers for tagging, or structural anchors for downstream operations (e.g., TOC positioning or header/footer marking).

Created elements can be tagged, styled, and grouped just like native PDF elements. This is especially useful in cases where automatic detection fails or manual control over structure is required.

The element is only created if the following is defined:

  • valid type
    • pde_text, pde_text_line, pde_image, pde_container, pde_list, pde_line, pde_rect, pde_table, pde_cell, pde_toc, pde_header, pde_footer
  • bbox, start_bbox, or end_bbox
    • Fixed Bounding Box Elements – The simplest way to define an initial element is with a fixed bbox. This method is best for elements like headers, footers, or static sidebars that appear at known locations.
    • Anchor-Based Initial Elements – A more flexible method uses anchors. An initial element’s position and size are defined relative to previously detected anchor elements. This allows dynamic placement based on content structure. The system creates initial elements at the start of page processing and assigns each page object to one of them based on overlap.

If the bbox is zero-sized the function cause error during layout detection.

Warning: By using this method, you override the default behavior of the recognition engine. Instead of detecting elements heuristically, the engine will use your pre-defined type, bbox, and flags.

Nested Templates (element_template)

An initial element can contain its own element_template node. If present, this child template completely replaces the parent template for that section. This means that:

  • Functions and values from the main template do not apply within this section
  • Only the child template will be used to process that region

If element_template is not defined, the initial element inherits the parent or global template rules.

Initial element matching modes

  1. Exact bounding box match – if the no_expand flag is set, the element’s bbox is extended using initial_element_expansion.
  2. Overlap-based matching – if no_expand is not set, the layout engine checks if the overlap area with a parent exceeds the initial_element_overlap threshold. If so, the parent is assigned.

Editable Values

  • type – Element type (e.g. pde_text, pde_line, pde_table)
  • bbox – Bounding box for element position and size
  • start_bbox – defines beginning boundary for dynamic elements
  • end_bbox – defines end boundary for dynamic elements
  • name – Unique name for identification and references
  • id – ID used in tagging or alt-text generation
  • flag – Flags like artifact, header, footer for logical role
  • text_flag – Flags for text elements (e.g. no_newline, first_cap, etc.)
  • tag – Semantic tag (P, H1, L, etc.) for accessibility
  • heading – Text style role: normal, h1, h2, h3
  • label – Label level (label, li_1, label_no, etc.)
  • lang – Language of element content (e.g. en-US)
  • actual_text – Replacement text for screen readers
  • alt – Alternate text used for descriptions (especially images)
  • single_instance – Constraints for creating only one instance per page (e.g. font_size)
  • sort_direction – Reading order in containers: 0 = default, 1 = columns, 2 = rows
  • splitter – If element acts as a layout splitter (e.g. pde_cell)
  • element_template – Embedded template definition for nested configuration

Table or Cell specific fields

  • col_num – Number of columns (for tables)
  • row_num – Number of rows (for tables)
  • cell_row – Row position in table (for cell)
  • cell_column – Column position in table (for cell)
  • cell_row_span – Number of rows spanned by the cell
  • cell_column_span – Number of columns spanned by the cell
  • cell_scope – Scope of cell: row, column
  • cell_header – Whether the cell is a header
  • cell_associated_header – Comma-separated list of headers

Thresholds

Template Source: initial element

  • initial_element_overlap
  • initial_element_expansion

Example

This JSON snippet defines virtual elements for tagging content on page 1 of a PDF document using the element_create function. It demonstrates two primary use cases:

  • Creating a dynamic header region with a nested template
  • Manually tagging the text in specific bounding box wrapping each line of the NovaBank address in a single  P tags
"element_create": [
    {
        "comment": "Tag elements on page n.1",
        "elements": [
            {
                "bbox": [
                    "0",
                    "688",
                    "$page_width",
                    "$page_height"
                ],
                "comment": "Tag header",
                "element_template": {
                    "template": {
                        "pagemap": [
                            {
                                "rd_sort": "2",
                                "rd_sort_direction": "2",
                                "statement": "$if"
                            }
                        ]
                    }
                },
                "type": "pde_header"
            },
            {
                "bbox": [
                    "88.05127716064453",
                    "527.709228515625",
                    "325.1794738769531",
                    "677.7356567382812"
                ],
                "comment": "Tag NovaBank address as single line text",
                "text_flag": "new_line",
                "type": "pde_text"
            }
        ]
    }
]

object_update

Purpose and Behavior

The object_update function parses low-level PDS objects on the current PDF page and prepares them for text extraction and semantic assignment.

This function parses non-text objects from the current page and can apply to the following types:

  • pds_object
  • pds_path
  • pds_image
  • pds_shading
  • pds_form

This function is also where pde_line and pde_rect elements are derived from pds_path objects.

All recognized objects are assigned an initial element based on their bounding box.

Editable Values

  • flag – behavior modifier (artifact, header, footer)

Thresholds

Template Source: pds_object → initial element

  • artifact_similarity
  • element_line_similarity
  • angle_deviation
  • isolated_element_ratio
  • path_object_max
  • path_object_min
  • isolated_text

Best Practice Recommendations

  • If you already know that certain objects are artifacts (like background images, aside boxes, or decorative headers/footers), it’s best to mark them in object_update early. Once an object is flagged as artifact, it will be excluded from further recognition stages.

Example

Mark all text objects located above 750px on the first page as artifacts.

{
 "object_update": [
   {
     "flag": "artifact",
     "query": {
       "$0_bottom": { "$gt": "750" },
       "$page_num": "1"
     },
     "param": ["pds_object"],
     "statement": "$if"
   }
 ]
}

annot_update

Purpose and Behavior

The annot_update function processes annotations (pdf_annot) on a PDF page and determines how they should be represented in the layout structure – either as form fields, markups, or excluded artifacts.

This function is responsible for:

  • Converting widget annotations (e.g., form fields like checkboxes or text inputs) into pde_form_field elements.
  • Creating layout representations for non-widget annotations such as highlights, comments, or shapes.
  • Assigning each annotation to its initial element container based on bounding box overlap.
  • Filtering out annotations that are excluded from processing via internal flags (kStateExclude).

line_update

Purpose and Behavior

The line_update function processes all recognized line elements (pde_line) and attempts to extend or merge them into longer logical lines. This is especially important for recognizing structural lines used in tables, headers, or boxed layouts.

Merging is controlled by a strict priority-based sequence.

Merge Decision Logic (in exact order)

Lines are eligible to be merged only if:

  1. Same Form Object: Both lines originate from the same form object (m_form_obj).
    If this check fails, no further conditions are evaluated.
  2. Initial Element Relationship:
    • ✅ Line A is the initial_element of Line B → merge.
      ✅ Line A’s name matches Line B’s parent → merge.
    • ✅ If neither condition applies, both lines must share the same initial_element to be considered.
  3. No Join Restriction:
    • If either line has the flag no_join, they are not merged, regardless of other matches.
  4. Line Orientation and Geometry Check:
    • Lines must pass the test, which evaluates parallelism, overlap, and angle tolerance (as controlled by thresholds like table_line_intersection).

The merging proceeds in two passes:

  • First pass processes only initial lines – to allow them to grow before others are merged.
  • Second pass processes all lines again, now using updated geometry.

The resulting extended lines are then available for tagging or further recognition steps (e.g., table detection).

Editable Values

  • name – assigns a unique identifier.
  • parent – sets a link to initial element.
  • label – optional semantic marker.
  • tag – explicitly tags the line for PDF structure.
  • flag – behavior modifier (artifact, header, footer, splitter, no_join).

Thresholds

Template Source: pde_line → initial element

  • table_line_intersection – defines how closely two lines must intersect or align to be eligible for merging in the line extend test.

Best Practice Recommendations

  • Lines marked as artifact here will be excluded from tagging. This is ideal for footers, underlines, or visual dividers.

Example

Mark all detected lines as artifacts.

"line_update": [
  {
    "@statement": "$if",
    "@query": {
      "@param<type>": "query_param",
      "param": [
        ["pde_line"]
      ]
    },
    "@flag": "artifact"
  }
]

rect_update

Purpose and Behavior

The rect_update function processes detected rectangle elements (pde_rect) and attempts to merge them into unified rectangular blocks when certain geometric and structural criteria are met. This is particularly useful for simplifying visual background elements like section boxes, shaded containers, and graphic panels.

Merging is conducted recursively within each container, using the rect_update method in two passes: one for initial rectangles and a second for the remaining content.

Merge Decision Logic (in exact order)

A rectangle will be merged with another only if all of the following conditions are met:

  1. Same Form Object:
    Both rectangles must originate from the same form XObject (m_form_obj).
  2. Initial Element Relationship:
    • ✅ Rectangle A is the initial_element of Rectangle B → merge.
      ✅ Rectangle A’s name matches Rectangle B’s parent → merge.
    • ✅ If none of the above apply, both rectangles must have the same initial element.
  3. Merge Permissions (Flags):
    If either rectangle has the no_join flag set, they are explicitly excluded from merging, regardless of any other logic.
  4. Geometry and Visual Checks
    • Rectangles must pass the internal extend test, which validates geometric alignment and similarity.
    • Merging is allowed when they are parallel, share the same width or height, and meet internal tolerances for alignment and spacing.

Editable Values

  • name – a unique identifier for reference or hierarchy.
  • parent – links this rectangle logically to another.
  • label – optional descriptive tag.
  • tag – structural tag used in PDF tagging.
  • flag – controls behavior (e.g., “artifact”, “header”, “no_join”).

Thresholds

Template Source: pde_rect → initial element

  • table_line_intersection – defines how closely two lines must intersect or align to be eligible for merging in the rectangle extend test.

Example

Mark gray rectangles with both width and height less than 10px as labels.

"rect_update": [
    {
        "label": "li_1",
        "query": {
            "$and": [
                {
                    "$0_fill_color": [
                        "100",
                        "100",
                        "100"
                    ]
                },
                {
                    "$0_height": {
                        "$lte": "10"
                    }
                },
                {
                    "$0_width": {
                        "$lte": "10"
                    }
                }
            ],
            "param": [
                "pde_rect"
            ]
        },
        "statement": "$if"
    }
]

object_update_text

Purpose and Behavior

The object_update_text function parses text-based PDS objects from the current page. It is invoked with parameters:

  • pds_object
  • pds_text

The function extracts text runs from the page content, segments them into words, and assigns each word to an appropriate initial element based on its bounding box.

This is one of the earliest and most critical steps in the layout pipeline. If this segmentation fails, all downstream logic (headings, tagging, tables) will be incorrect.

Text runs are later grouped into words based on spatial properties – including inter-character spacing and baseline alignment.

Editable Values

  • flag – controls behavior (artifact, header, footer)

Thresholds

Template Source: page level using values in the main template block.

  • word_space_width_ratio
    Ratio multiplier used to estimate the maximum allowed space between characters within a word. If spacing between characters exceeds this value × minimal char spacing → a new word begins.
  • word_space_width_min_ratio
    Optional lower bound used to constrain the influence of very small font sizes. Applied as: allowed_space = font_size × word_space_width_min_ratio

These values control when and where characters get split into new words – especially important in justified or stylized text.

Warning: These are defined in the root-level (global) template, not per-element.

Example

Mark all text objects positioned above 740px as artifacts.

"object_update": [
    {
        "comment": "Artifact texts except first page",
        "flag": "artifact",
        "query": {
            "$and": [
                {
                    "$page_num": {
                        "$gt": "1"
                    }
                },
                {
                    "$0_bottom": {
                        "$gte": "740"
                    }
                }
            ],
            "param": [
                "pds_text"
            ]
        },
        "statement": "$if"
    }
]

text_run_update

Purpose and Behavior

The text_run_update function modifies properties of individual pde_text_run elements after they are parsed from the page.

Its primary purpose is to assign text state flags based on visual or semantic context – such as subscript/superscript styling.

The text_run_update rule is evaluated per text run, immediately after it is extracted from the PDF stream.

Editable Values

  • text_state_flag
    Can include values like:
    • subscript
    • superscript

These flags do not split the text run from its associated word. The run remains part of the word but is marked for later structural tagging.

This preserves grouped representations like: H₂O instead of: H 2 O

Thresholds

Template Source: page level using values in the main template block.

  • text_line_baseline_ratio
  • angle_deviation

Warning: These are defined in the root-level (global) template, not per-element.

Example

Mark all text with a font size of 4px as superscript.

"text_run_update": [
    {
        "query": {
            "$and": [
                {
                    "$0_font_name": {
                        "$regex": "ArialMT"
                    }
                },
                {
                    "$0_font_size": "4"
                }
            ],
            "param": [
                "pde_text_run"
            ]
        },
        "statement": "$if",
        "text_state_flag": "superscript"
    }
]

text_run_neighbours

Purpose and Behavior

The text_run_neighbours function determines whether two consecutive pde_text_run elements should be merged into a single word or split.

This function overrides automatic word detection, offering precision control in cases where standard heuristics fail (e.g., tight kerning, styled fragments).

Merge Decision Logic

The function compares two runs:

  1. If join = true → the runs are always joined into the same word.
  2. If join = false → a forced split is applied between the runs, even if they align.
  3. If no explicit rule is matched, the fallback logic checks:
    • Angle consistency (same_angle)
    • Baseline alignment (same_baseline)
    • Spacing break (e.g., visual separator or vertical border)

This lets you force join or split conditions for complex or edge-case text layouts.

Editable Values

  • join – Boolean flag to force merge or split

Thresholds

Template Source: page level using values in the main template block.

  • text_line_baseline_ratio – internal margin for baseline deviation
  • angle_deviation

Warning: These are defined in the root-level (global) template, not per-element.


word_update

Purpose and Behavior

The word_update function processes individual pde_word elements after word segmentation is complete. It serves three key purposes:

  1. Semantic Flag Assignment
    Applies structural or semantic flags to words based on regex rules. These flags influence downstream layout recognition (e.g., list detection, heading classification, TOC generation).
  2. Filling and Label Detection
    • Uses regex_filling to detect filler-only words (like “…”, “–“).
    • Splits or skips them depending on match length.
    • Detects structured labels (e.g., numbered bullets or Roman/letter markers).

Property Update and Artifact Extraction
If a word is marked as artifact, header, or footer, it’s converted into a separate pde_text element, detached from the text stream, and placed into the corresponding container. This prevents accidental tagging or reading by screen readers.

The word_update function identifies possible list labels and table of contents (TOC) entries among pde_word elements. It marks these elements with corresponding label or toc values and attempts to pair them with a sibling (typically a neighboring word) that represents the list item content or TOC title.

Editable Values

  • name – unique name
  • tag – semantic tag (e.g., Span, Div, Note)
  • flag – behavior modifier (artifact, header, footer, etc.)
  • label – used for logical labeling
  • heading – heading level
  • actual_text – alternate value for screen readers
  • lang – language code
  • word_flag – manual override for system-assigned flags
  • single_instance – suppresses duplicates across layout
  • word_space” – sets an exact space width for this word’s font-size/font. If word_space is set manually here, it overrides all computed word spacing logic and disables the word_space_ratio.

Thresholds

Template Source: pde_word → initial element.

  • word_space_ratio – multiplier for auto-calculated spacing (used unless overridden)
  • word_space_update_max – prevents auto-updates beyond a limit (0 = no updates)

pagemap_regex flags used for semantic detection

  • hyphen – regex_hyphen
  • bullet – regex_bullet, regex_bullet_font
  • colon – regex_colon
  • number – number_chars
  • terminal – regex_terminal
  • capital – regex_first_cap
  • decimal_num – regex_decimal_numbering
  • roman_num – regex_roman_numbering
  • letter_num – regex_letter_numbering
  • page_num – regex_page_number
  • filling – regex_filling
  • comma – regex_comma
  • label – regex_label, label_chars, regex_letter

Example

Mark all dot-leader words as artifacts.

{
 "word_update": [
   {
     "flag": "artifact",
     "query": {
       "$0_text": { "$regex": "^\\.+$" }
     },
     "param": ["pde_word"],
     "statement": "$if"
   }
 ]
}

Change Tag Type.

"word_update": [
    {
        "query": {
            "$and": [
                {
                    "$0_font_size": "4.5"
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if",
        "tag": "Span"
    }
]

Modify Actual Text based on character properties (text).

"word_update": [
    {
        "actual_text": "No",
        "query": {
            "$and": [
                {
                    "$0_text": {
                        "$regex": "□"
                    }
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if",
        "tag": "Span"
    },
    {
        "actual_text": "Yes",
        "query": {
            "$and": [
                {
                    "$0_text": {
                        "$regex": "■"
                    }
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if",
        "tag": "Span"
    }
]

Word Spacing Precision Logic

Accurate word spacing is critical for reliable recognition of text lines, paragraph types (e.g. justified vs. simple), and ultimately, for layout structure such as heading alignment or table row segmentation.

Automatic Detection of Word Space

During word recognition, a base word space is automatically estimated for each unique:

  • Font name
  • Font size

This is computed by analyzing inter-character gaps across text runs with matching font properties. It defines what spacing is considered “normal” within a word vs. between words.

This automatically estimated word space width is then used to:

  • Detect word boundaries
  • Classify a line as:
    • Simple: uniform space width
    • Justified: variable space widths

Ways to Adjust Word Space

1. Exact Override per Word (word_update)

Template Source: pde_word → initial element

"word_update": [
   {
       "query": {
           "$and": [
               {
                   "$0_font_size": "4.5"
               }
           ],
           "param": [
               "pde_word"
           ]
       },
       "statement": "$if",
       "word_space": "4.2"
   }
]

Sets an exact word spacing value for this font-size & font-name combination.

  • Overrides all other logic
  • Disables word_space_ratio

✅ Use this when auto-estimation fails for a specific word or stylized font.

2. Proportional Scaling (word_space_ratio)

Template Source: pde_container → initial element

"pagemap": [
   {
       "word_space_ratio": "1.15",
   }
]

Multiplies the estimated space width by a scaling factor. Useful for small global corrections.

  • Only used if no word_space is defined
  • Applies to all words in the container
3. Post-Line Update Adjustment (text_line_update)

Template Source: pde_line → initial element (for word_space)

"text_line_update": [
   {
       "word_space": "4.2",
       "query": {},
       "param": [
           "pde_text_line"
       ],
       "statement": "$if"
   }
]

Once words are grouped into lines, text_line_update can fine-tune or lock the final spacing:

  • If word_space is defined in text_line_update, the spacing for all words in the line is set to that value.
  • If word_space_update_max is defined:
    • It limits re-estimation from line analysis
    • Set to 0 to prevent any automatic spacing changes

Best Practice Recommendations

  • Use word_space only when auto-estimation fails consistently for specific fonts.
  • Apply word_space_ratio globally in containers (e.g., invoices or tables).
  • Avoid multiple re-definitions across word_update and text_line_update unless needed for layout corrections.
  • Use word_space_update_max:0 to lock final spacing post-assembly.

Label Detection

Purpose and Behavior

The word_update function detects and classifies list label words (li_1 to li_4) using regex rules and spatial relationships. When a word is marked as a potential label (e.g., 1., A), (iii)), the engine attempts to pair it with a sibling element – usually the associated paragraph or line content.

  • If a word is explicitly marked with a label property in a template rule (e.g., label: li_1), it is treated as a list item at the corresponding nesting level.
  • If no label is set manually, the system attempts automatic detection using regex and heuristics:
    • Valid label formats: numeric (1.), alphabetic (A), Roman numerals (IV), bullets (•), or combinations with brackets or dots.
    • The word is analyzed for regex_label, regex_letter, regex_roman_numbering, and others.
  • Once a word is flagged as a possible label, it is matched to a sibling word based on reading order (RTL or LTR).

Thresholds

Template Source: pde_word → initial element.

  • label_word_detect – Enables automatic label detection. Set to 0 in templates where labeling is irrelevant to prevent unwanted grouping.
  • label_distance_ratio – Used to calculate label-sibling horizontal distance. dist = font_size × label_distance_ratio. It affects how far the algorithm will look horizontally for a matching sibling.
  • label_word_w1 – Weight for vertical alignment similarity between label candidates during clustering.
  • label_word_w2 – Weight for label-sibling offset alignment.
  • label_word_dist_sibling_ratio – Reject labels if sibling is too far. Max distance = font_size × ratio.
  • label_sibling_distance_ratio – Reject false labels if sibling is too close to another word in the line.
  • label_word_distance – Maximum absolute clustering distance between label candidates.
  • label_word_distance_ratio – Used if label_word_distance == 0, scales by page width.
  • concurrent_threads – Multithreaded clustering of label candidates.

pagemap_regex flags used for semantic detection

  • label_chars: Characters like “(“, “)”, “.” used to strip and normalize labels.
  • regex_label: Pattern for detecting list markers.

Editable Values

  • label: label, label_no, li_1 … li_4

Best Practice Recommendations

  • label_word_detect = 0 should be explicitly disabled in templates where label detection is irrelevant (e.g. tables without list-like elements).

Example

"word_update": [
    {
        "label": "li_1",
        "query": {
            "$and": [
                {
                    "$0_text": {
                        "$regex": "^\\([a-z]\\)$"
                    }
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if"
    },
    {
        "label": "li_2",
        "query": {
            "$and": [
                {
                    "$0_text": {
                        "$regex": "^\\(\\d\\)$"
                    }
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if"
    },
    {
        "label": "li_3",
        "query": {
            "$and": [
                {
                    "$0_text": {
                        "$regex": "^\\((i|ii|iii|iv|v|vi|vii|viii|ix|x|xi|xii|xiii|xiv|xv|xvi|xvii|xviii|xix|xx)\\)$"
                    }
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if"
    }
]

TOC Detection

Purpose and Behavior

The word_update function also supports detection of Table of Contents (TOC) elements by identifying possible page number words (e.g., 15, 248, xii) and matching them with title words on the same line.

  • Words are flagged as TOC numbers using regex_page_number.
  • A sibling is searched to the left or right (depending on document direction) that likely represents the section heading.
  • TOC entries are grouped and clustered spatially, then marked as a TOC item.

Thresholds

Template Source: pde_word → initial element.

  • toc_detect – Enables TOC detection. Set to 0 in non-TOC templates for performance and precision.
  • toc_word_distance – Absolute clustering cutoff for grouping similar TOC numbers. Defines the tightness of TOC number clustering. This ensures only aligned page number entries are grouped.
  • toc_word_distance_ratio – Used if the absolute distance is unset. Multiplied by page_font_width.
  • concurrent_threads – Parallel processing during clustering of TOC words.

pagemap_regex flags used for semantic detection

  • regex_page_number: Used to detect page numbers.

Editable Values

  • toc: label, label_no

Best Practice Recommendations

  • toc_detect = 0 is essential in improving performance and avoiding misclassifications in non-TOC sections.

word_neighbours

Purpose and Behavior

The word_neighbours function controls how recognized words (pde_word) are grouped into text lines (pde_text_line). It plays a critical role in reconstructing logical reading order and grouping sequences of text.

Two words will be joined into the same line if:

  1. Their initial elements are compatible.
  2. Neither has the no_join flag.
  3. They have the same text style.
  4. Their writing angles match.
  5. They share the same baseline (within text_line_baseline_ratio × font).
  6. No splitting object (line, rect, other words) is between them.
  7. The gap between them is ≤ word_space_distance_max or word_space_distance_max_ratio.

You can override these constraints explicitly using the word_neighbours rule with join: true | false.

If a word has an initial element that is already a pde_text_line, it is automatically inserted into that line. Otherwise, a new line is created, and neighboring words are added if they pass all join conditions.

Editable Values

  • Defined in word_neighbours:
    • join:
      • If true, forcibly joins two words into a line.
      • If false, forcibly prevents joining.
      • If not defined, fallback logic uses baseline, spacing, and flags.
  • Defined in word_update for individual word flags:
    • no_join: If a word has this flag set, it will never be joined into a line, even if all other conditions match.

Thresholds

Template Source: pde_text_line → initial element.

  • text_line_baseline_ratio – Maximum vertical offset allowed between baselines, multiplied by font size.
  • word_space_distance_max – Maximum absolute horizontal gap between words (in user units).
  • word_space_distance_max_ratio – If word_space_distance_max is zero, this ratio × max font size is used.
    • These two thresholds are ignored for RTL label words unless label_word_detect = 0.

Best Practice Recommendations

word_neighbours method is called only for spacific pairs of words, not all pairs.

Example

"word_neighbours": [
    {
        "join": "true",
        "query": {
            "$and": [
                {
                    "$0_font_name": {
                        "$regex": "Arial-BoldMT"
                    }
                },
                {
                    "$1_font_name": {
                        "$regex": "Arial-BoldMT"
                    }
                }
            ],
            "param": [
                "pde_word",
                "pde_word"
            ]
        },
        "statement": "$if"
    }
]

word_connect

Purpose and Behavior

The word_connect function merges consecutive text lines that logically belong together. It is typically used to reconstruct full paragraphs or wrapped lines that were split due to PDF layout quirks.

The function evaluates each pair of pde_text_line elements based on alignment, spacing, and their surrounding context. If certain conditions are met, two lines are merged into one.

Behavioral priorities for merging:

  1. Initial Element: If a pde_text_line has the no_join flag set, its linked pde_text_line will not be joined with others.
  2. Alignment Detection: Before merging, all lines are grouped by their vertical alignment (left, right, center) with a font-size-based threshold.
  3. Explicit Relationship Check: The function calls word_matches_to_line() to verify whether a word from one line can be logically extended into the next line.
  4. Alignment Count Heuristic: If both lines belong to strong alignment groups (≥3 members), merging is rejected to avoid false positives in structured layouts like tables or columns.
  5. Word Spacing Check: Distance between the rightmost word of the left line and the leftmost word of the right line must be less than calculated word spacing (from get_simple_word_spacing()).

When lines are merged, all their words are combined into a single pde_text_line, and the redundant line is removed from the container.

Editable Values

  • word_update: e.g. kElemNoJoin, word_flag
  • text_line_update: e.g. m_text_style, bounding boxes
  • Line alignment grouping (implicitly controlled by layout)

Thresholds

Template Source: pde_text_line → initial element.

Word-level decisions from: pde_word → initial element

  • text_line_baseline_ratio: Used in word_matches_to_line() to determine baseline tolerance.
  • word_space_distance_max and word_space_distance_max_ratio: Control maximum allowed space between words/lines. Used in both word_neighbours and word_connect.

⚠️ Note: Merging is only attempted if the spacing threshold is not exceeded and no graphical splitters or layout anomalies are in between.

Best Practice Recommendations

word_neighbours method is called only for spacific pairs of words, not all pairs.

Example

"word_neighbours": [
    {
        "join": "true",
        "query": {
            "$and": [
                {
                    "$0_font_name": {
                        "$regex": "Arial-BoldMT"
                    }
                },
                {
                    "$1_font_name": {
                        "$regex": "Arial-BoldMT"
                    }
                }
            ],
            "param": [
                "pde_word",
                "pde_word"
            ]
        },
        "statement": "$if"
    }
]