Layout Template

Workspace

  • SHORTCUTS
  • GENERAL UI

Preferences

GENERAL

DISPLAY

PANELS

SHORTCUTS

APPLICATION VIEW

ACCESSIBILITY

PAGE MAP

TABLE TOOL

TEMPLATE


PDFix Actions


Selection Tools


Validation


Accessibility


Tags


Annotations


Content


Bookmarks


Conversion


Destinations

Browser

Template

License

Master PDF Auto-Tagging with Layout Templates

Introduction

PDFix Template Layout is a rule-based system that allows users to define custom layout recognition and tagging logic for PDFs. This is essential for structured documents (e.g., statements, catalogs, invoices) and for achieving compliance with accessibility standards such as PDF/UA.

This technical documentation provides a comprehensive reference to all features, functions, and nodes used in the Template Layout language, complete with advanced examples, implementation details, and behavioral logic derived from internal SDK methods like split_texts.

General Template Logic

The fundamental approach to building a complex template is to divide the document into logical sections using initial elements. Each section can have its own child template (defined in an element_template node), which overrides the main behavior of the layout recognition algorithm for that part of the document.

Initial elements are created using the element_create function. The most important value to define for an initial element is its bounding box (bbox).

Bounding boxes can be defined in two main ways:

  • Fixed bbox: Using direct coordinates.
  • Start/end bbox: More dynamic, may span multiple anchors.

Bounding box coordinates (left, top, right, bottom) can use:

  • Static float values
  • General context values ($page_num, $page_width, $doc_num_pages, etc.)
  • Parent values ($parent_top, $parent_left, etc.)
  • Anchor references ($A1_bottom, $ANCHOR_right, etc.)
  • Math functions (e.g. SUM($parent_left, 10))

Why Use a Template?

  • Ensure consistent tagging across similar documents
  • Improve the accuracy of semantic tagging
  • Save time on manual remediation
  • Enhance accessibility and screen reader compatibility
  • Boost efficiency by applying one template to thousands of similar PDFs

Pagemap Overview

The pagemap engine is the core class responsible for page layout recognition. It processes each page from bottom to top and groups primitive objects into logical entities like words, lines, paragraphs, rules, tables, lists, etc.

In PDFix Desktop, each processing step can be visually debugged and adjusted. You can use the settings -> debug_pagemap_stop property in the template to stop the layout engine after any step to inspect intermediate results. This is especially useful when building templates – you can verify every step separately and identify where recognition failed.

For example, setting the stop point to word_update halts after words are recognized. You can then verify whether superscripts and subscripts are properly merged into words. This technique is also helpful for debugging complex processes like paragraph assembly or table recognition.

Functions


element_create

Purpose and Behavior

The element_create function inserts a new layout element into the document structure. Unlike other functions that modify detected elements, this one defines virtual elements manually-based on explicitly set parameters like bounding box and type. These elements can be used as standalone tags, layout hints, containers for tagging, or structural anchors for downstream operations (e.g., TOC positioning or header/footer marking).

Created elements can be tagged, styled, and grouped just like native PDF elements. This is especially useful in cases where automatic detection fails or manual control over structure is required.

The element is only created if the following is defined:

  • valid type
    • pde_text, pde_text_line, pde_image, pde_container, pde_list, pde_line, pde_rect, pde_table, pde_cell, pde_toc, pde_header, pde_footer
  • bbox, start_bbox, or end_bbox
    • Fixed Bounding Box Elements – The simplest way to define an initial element is with a fixed bbox. This method is best for elements like headers, footers, or static sidebars that appear at known locations.
    • Anchor-Based Initial Elements – A more flexible method uses anchors. An initial element’s position and size are defined relative to previously detected anchor elements. This allows dynamic placement based on content structure. The system creates initial elements at the start of page processing and assigns each page object to one of them based on overlap.

If the bbox is zero-sized the function cause error during layout detection.

⚠️ Note: By using this method, you override the default behavior of the recognition engine. Instead of detecting elements heuristically, the engine will use your pre-defined type, bbox, and flags.

Nested Templates (element_template)

An initial element can contain its own element_template node. If present, this child template completely replaces the parent template for that section. This means that:

  • Functions and values from the main template do not apply within this section
  • Only the child template will be used to process that region

If element_template is not defined, the initial element inherits the parent or global template rules.

Initial element matching modes

  1. Exact bounding box match – if the no_expand flag is set, the element’s bbox is extended using initial_element_expansion.
  2. Overlap-based matching – if no_expand is not set, the layout engine checks if the overlap area with a parent exceeds the initial_element_overlap threshold. If so, the parent is assigned.

Editable Values

  • type – Element type (e.g. pde_text, pde_line, pde_table)
  • bbox – Bounding box for element position and size
  • start_bbox – defines beginning boundary for dynamic elements
  • end_bbox – defines end boundary for dynamic elements
  • name – Unique name for identification and references
  • id – ID used in tagging or alt-text generation
  • flag – Flags like artifact, header, footer for logical role
  • text_flag – Flags for text elements (e.g. no_newline, first_cap, etc.)
  • tag – Semantic tag (P, H1, L, etc.) for accessibility
  • heading – Text style role: normal, h1, h2, h3
  • label – Label level (label, li_1, label_no, etc.)
  • lang – Language of element content (e.g. en-US)
  • actual_text – Replacement text for screen readers
  • alt – Alternate text used for descriptions (especially images)
  • single_instance – Constraints for creating only one instance per page (e.g. font_size)
  • sort_direction – Reading order in containers: 0 = default, 1 = columns, 2 = rows
  • splitter – If element acts as a layout splitter (e.g. pde_cell)
  • element_template – Embedded template definition for nested configuration

Table or Cell specific fields

  • col_num – Number of columns (for tables)
  • row_num – Number of rows (for tables)
  • cell_row – Row position in table (for cell)
  • cell_column – Column position in table (for cell)
  • cell_row_span – Number of rows spanned by the cell
  • cell_column_span – Number of columns spanned by the cell
  • cell_scope – Scope of cell: row, column
  • cell_header – Whether the cell is a header
  • cell_associated_header – Comma-separated list of headers

Thresholds

Template Source: initial element

  • initial_element_overlap
  • initial_element_expansion

Example

This JSON snippet defines virtual elements for tagging content on page 1 of a PDF document using the element_create function. It demonstrates two primary use cases:

  • Creating a dynamic header region with a nested template
  • Manually tagging the text in specific bounding box wrapping each line of the NovaBank address in a single  P tags
"element_create": [
    {
        "comment": "Tag elements on page n.1",
        "elements": [
            {
                "bbox": [
                    "0",
                    "688",
                    "$page_width",
                    "$page_height"
                ],
                "comment": "Tag header",
                "element_template": {
                    "template": {
                        "pagemap": [
                            {
                                "rd_sort": "2",
                                "rd_sort_direction": "2",
                                "statement": "$if"
                            }
                        ]
                    }
                },
                "type": "pde_header"
            },
            {
                "bbox": [
                    "88.05127716064453",
                    "527.709228515625",
                    "325.1794738769531",
                    "677.7356567382812"
                ],
                "comment": "Tag NovaBank address as single line text",
                "text_flag": "new_line",
                "type": "pde_text"
            }
        ]
    }
]

object_update

🧩 Function Purpose and Behavior

The object_update function processes low-level graphical objects (pds_object) from the current PDF page and prepares them for downstream layout recognition and tagging. It is primarily responsible for classifying visual elements such as lines, rectangles, background images, and shadings into semantic structures.

This is the stage where pds_path objects (such as vector paths) are converted into high-level PDFix elements like pde_line and pde_rect.

Supported object types include:

  • pds_object
  • pds_path
  • pds_image
  • pds_shading
  • pds_form

Each recognized object is assigned an initial_element based on its bounding box and type. These objects are later processed by structural functions like line_update, rect_update, or element_create.

✅ To Force Specific Behavior

Desired BehaviorWhat to Set
Force object to be ignored from layout analysisSet "flag": "artifact"
Force object into header/footer groupingSet "flag": "header" or "footer"
Match specific shapes or bounding boxesUse custom "$if" query (see Example)
Force path into a line or rectAdjust similarity thresholds (see below) so they are recognized as pde_line or pde_rect

🚫 To Prevent Specific Behavior

Prevent ThisWhat to Do
Prevent accidental grouping of decorative paths as meaningful layout elementsIncrease artifact_similarity or lower path_object_max
Prevent backgrounds from being misclassifiedFilter pds_shading with "flag": "artifact" early in object_update
Prevent object from appearing as line or rectEnsure thresholds like element_line_similarity are not met or use "flag": "artifact"

🔧 Editable Values

These can be set per object or query in your template:

ValueWhat to Do
flagDefines layout role. Accepts values like: "artifact" (object will be excluded), "header" or "footer" (moved to special containers)

⚙️ Thresholds

Defined in the initial_element or page-level config, these control object classification:

ThresholdDescription
artifact_similarityControls grouping tolerance for objects with similar position/size when marked as artifacts
element_line_similaritySimilarity threshold to detect if a pds_path represents a pde_line
angle_deviationMax angle difference (in degrees) to cluster elements as horizontal/vertical lines
isolated_element_ratioRatio to identify if an element is isolated and not part of a structure
path_object_maxUpper size limit for path to be recognized as one object
path_object_minLower size threshold for path to be recognized as one object

📦 Template Source

  • Values are evaluated from the initial_element of each pds_object.

💡 Best Practice Recommendations

  • If you already know certain graphics (e.g., background boxes, top/bottom bars, separator lines) should be excluded from tagging or layout detection, tag them early in object_update with "flag": "artifact".
  • Excluding these early prevents misclassification in later stages like line_update, rect_update, or table detection.
  • You can write precise queries using "$page_num", "$0_top", "$0_bottom", and other bounding box keys to match only specific graphical objects.

🧪 Example

Mark all graphic objects above 750 pixels on the first page as artifacts.

{
 "object_update": [
   {
     "flag": "artifact",
     "query": {
       "$0_bottom": { "$gt": "750" },
       "$page_num": "1"
     },
     "param": ["pds_object"],
     "statement": "$if"
   }
 ]
}

annot_update

🧩 Purpose and Behavior

The annot_update function processes all pdf_annot objects on the current PDF page and decides how they should be represented in the page layout. It ensures annotations are either:

  • Converted to structured layout elements (e.g., pde_form_field, pde_annot)
  • Skipped from further processing if irrelevant
  • Tagged correctly for accessibility or export

This function operates as the bridge between low-level PDF annotations and high-level semantic tagging.

Key operations include:

  • Widget annotations (like form checkboxes or text inputs) are transformed into pde_form_field elements.
  • Non-widget annotations (such as comments, highlights, or drawing shapes) become pde_annot layout elements.
  • 📦 All annotations are inserted into the page map structure and attached to their closest initial container using bounding box proximity.
  • 🚫 Excluded annotations (those with internal kStateExclude flags) are skipped entirely.

line_update

🧩 Function Purpose and Behavior

The line_update function evaluates graphical lines (pde_line) extracted from pds_path objects and assigns them semantic meaning or layout roles. These lines can later influence element segmentation, table detection, or be marked as artifacts.

Each line is matched to an initial element based on its bounding box and geometric similarity. This function also filters lines that are too short, too slanted, or redundant.

🚫 To Prevent Behavior

ConditionEffect
Lines from different form_obj or XObjectCannot be merged
Lines with different initial elementsMerge skipped unless a parent-child relationship exists
Lines has no_join flagPrevents merging with any other lines

✅ To Force Behavior

What You Want to DoHow to Do It
Force it into a containerUse parent to explicitly attach it to a named element
Ensure it is merged into another lineMake one line initial elements
Extend lines automaticallyLines are only merged if they pass internal checks for graphic style and geometric alignment. They must also originate from the same XObject. If merging fails due to this constraint, consider using the Flatten Form XObject feature to normalize content structure.

🔧 Editable Values

KeyDescription
flagSemantic classification — e.g. "artifact", "header", "footer". Set "artifact" to move it to the artifact layer and exclude from layout grouping
labelLabel recognition flag (label, li_1, label_no, etc.)
tagTag type (e.g., Figure, Span,) used for accessibility
nameInternal or debug-friendly name for the line
parentReference to the name of another element — links this line to its container

⚙️ Thresholds

ThresholdDescription
angle_deviationMaximum allowed deviation in degrees to consider the angle as same line
table_line_intersectionDefines how closely two lines must intersect or align to be eligible for merging in the line extend test.

📦 Template Source

  • pde_line → initial element

💡 Best Practice Recommendations

  • Lines marked as artifact here will be excluded from tagging. This is ideal for footers, underlines, or visual dividers.

📝 Example

Mark all detected lines as artifacts.

"line_update": [
  {
    "@statement": "$if",
    "@query": {
      "@param<type>": "query_param",
      "param": [
        ["pde_line"]
      ]
    },
    "@flag": "artifact"
  }
]

rect_update

Purpose and Behavior

The rect_update function processes detected rectangle elements (pde_rect) and attempts to merge them into unified rectangular blocks when certain geometric and structural criteria are met. This is particularly useful for simplifying visual background elements like section boxes, shaded containers, and graphic panels.

✅ To Force Behavior

What You Want to DoHow to Do It
Force it into a containerUse parent to explicitly attach it to a named element
Ensure it is merged into another rectangleMake both have matching initial elements or initial parent
Extend rectangles automaticallyRectangles are only merged if they pass internal checks for graphic style, geometric alignment, and structural similarity. They must also originate from the same XObject. If merging fails due to this constraint, consider using the Flatten Form XObject feature to normalize content structure.

🚫 To Prevent Behavior

ConditionEffect
Rectangles from different form_obj or XObjectCannot be merged
Rectangles with different initial elementsMerge skipped unless a parent-child relationship exists
Rectangle has no_join flagPrevents merging with any other rectangle

🔧 Editable Values

NameDescription
flagLayout role (artifact, header, footer, etc.)
labelLabel recognition flag (label, li_1, label_no, etc.)
tagSemantic tag (Art, Figure, etc.)
nameUser-defined identifier for referencing or debugging
parentAssign rect into a known template-defined container

⚙️ Thresholds

ThresholdDescription
table_line_intersectionDefines how closely two lines must intersect or align to be eligible for merging in the rectangle extend test.

📦 Template Source

  • pde_rect → initial element

📝 fExample

Mark gray rectangles with both width and height less than 10px as labels.

"rect_update": [
    {
        "label": "li_1",
        "query": {
            "$and": [
                {
                    "$0_fill_color": [
                        "100",
                        "100",
                        "100"
                    ]
                },
                {
                    "$0_height": {
                        "$lte": "10"
                    }
                },
                {
                    "$0_width": {
                        "$lte": "10"
                    }
                }
            ],
            "param": [
                "pde_rect"
            ]
        },
        "statement": "$if"
    }
]

object_update_text

Purpose and Behavior

The object_update_text function parses text-based PDS objects from the current page. It is invoked with parameters:

  • pds_object
  • pds_text

The function extracts text runs from the page content, segments them into words, and assigns each word to an appropriate initial element based on its bounding box.

This is one of the earliest and most critical steps in the layout pipeline. If this segmentation fails, all downstream logic (headings, tagging, tables) will be incorrect.

Text runs are later grouped into words based on spatial properties – including inter-character spacing and baseline alignment.

Editable Values

  • flag – controls behavior (artifact, header, footer)

Thresholds

Template Source: page level using values in the main template block

  • word_space_width_ratio
    Ratio multiplier used to estimate the maximum allowed space between characters within a word. If spacing between characters exceeds this value × minimal char spacing → a new word begins.
  • word_space_width_min_ratio
    Optional lower bound used to constrain the influence of very small font sizes. Applied as: allowed_space = font_size × word_space_width_min_ratio

These values control when and where characters get split into new words – especially important in justified or stylized text.

⚠️ Note: These are defined in the root-level (global) template, not per-element.

Example

Mark all text objects positioned above 740px as artifacts.

"object_update": [
    {
        "comment": "Artifact texts except first page",
        "flag": "artifact",
        "query": {
            "$and": [
                {
                    "$page_num": {
                        "$gt": "1"
                    }
                },
                {
                    "$0_bottom": {
                        "$gte": "740"
                    }
                }
            ],
            "param": [
                "pds_text"
            ]
        },
        "statement": "$if"
    }
]

text_run_update

Purpose and Behavior

The text_run_update function modifies properties of individual pde_text_run elements after they are parsed from the page.

Its primary purpose is to assign text state flags based on visual or semantic context – such as subscript/superscript styling.

The text_run_update rule is evaluated per text run, immediately after it is extracted from the PDF stream.

Editable Values

  • text_state_flag
    Can include values like:
    • subscript
    • superscript

These flags do not split the text run from its associated word. The run remains part of the word but is marked for later structural tagging.

This preserves grouped representations like: H₂O instead of: H 2 O

Thresholds

Template Source: page level using values in the main template block

  • text_line_baseline_ratio
  • angle_deviation

Warning: These are defined in the root-level (global) template, not per-element.

Example

Mark all text with a font size of 4px as superscript.

"text_run_update": [
    {
        "query": {
            "$and": [
                {
                    "$0_font_name": {
                        "$regex": "ArialMT"
                    }
                },
                {
                    "$0_font_size": "4"
                }
            ],
            "param": [
                "pde_text_run"
            ]
        },
        "statement": "$if",
        "text_state_flag": "superscript"
    }
]

text_run_neighbours

Purpose and Behavior

The text_run_neighbours function determines whether two consecutive pde_text_run elements should be merged into a single word or split.

This function overrides automatic word detection, offering precision control in cases where standard heuristics fail (e.g., tight kerning, styled fragments).

Merge Decision Logic

The function compares two runs:

  1. If join = true → the runs are always joined into the same word.
  2. If join = false → a forced split is applied between the runs, even if they align.
  3. If no explicit rule is matched, the fallback logic checks:
    • Angle consistency (same_angle)
    • Baseline alignment (same_baseline)
    • Spacing break (e.g., visual separator or vertical border)

This lets you force join or split conditions for complex or edge-case text layouts.

Editable Values

  • join – Boolean flag to force merge or split

Thresholds

Template Source: page level using values in the main template block.

  • text_line_baseline_ratio – internal margin for baseline deviation
  • angle_deviation

⚠️ Note: These are defined in the root-level (global) template, not per-element.


word_update

Purpose and Behavior

The word_update function processes individual pde_word elements after word segmentation is complete. It serves three key purposes:

  1. Semantic Flag Assignment
    Applies structural or semantic flags to words based on regex rules. These flags influence downstream layout recognition (e.g., list detection, heading classification, TOC generation).
  2. Filling and Label Detection
    • Uses regex_filling to detect filler-only words (like “…”, “–“).
    • Splits or skips them depending on match length.
    • Detects structured labels (e.g., numbered bullets or Roman/letter markers).

Property Update and Artifact Extraction
If a word is marked as artifact, header, or footer, it’s converted into a separate pde_text element, detached from the text stream, and placed into the corresponding container. This prevents accidental tagging or reading by screen readers.

The word_update function identifies possible list labels and table of contents (TOC) entries among pde_word elements. It marks these elements with corresponding label or toc values and attempts to pair them with a sibling (typically a neighboring word) that represents the list item content or TOC title.

Editable Values

  • name – unique name
  • tag – semantic tag (e.g., Span, Div, Note)
  • flag – behavior modifier (artifact, header, footer, etc.)
  • label – used for logical labeling
  • heading – heading level
  • actual_text – alternate value for screen readers
  • lang – language code
  • word_flag – manual override for system-assigned flags
  • single_instance – suppresses duplicates across layout
  • word_space” – sets an exact space width for this word’s font-size/font. If word_space is set manually here, it overrides all computed word spacing logic and disables the word_space_ratio.

Thresholds

Template Source: pde_word → initial element

  • word_space_ratio – multiplier for auto-calculated spacing (used unless overridden)
  • word_space_update_max – prevents auto-updates beyond a limit (0 = no updates)

pagemap_regex flags used for semantic detection

  • hyphen – regex_hyphen
  • bullet – regex_bullet, regex_bullet_font
  • colon – regex_colon
  • number – number_chars
  • terminal – regex_terminal
  • capital – regex_first_cap
  • decimal_num – regex_decimal_numbering
  • roman_num – regex_roman_numbering
  • letter_num – regex_letter_numbering
  • page_num – regex_page_number
  • filling – regex_filling
  • comma – regex_comma
  • label – regex_label, label_chars, regex_letter

Example

Mark all dot-leader words as artifacts.

{
 "word_update": [
   {
     "flag": "artifact",
     "query": {
       "$0_text": { "$regex": "^\\.+$" }
     },
     "param": ["pde_word"],
     "statement": "$if"
   }
 ]
}

Change Tag Type.

"word_update": [
    {
        "query": {
            "$and": [
                {
                    "$0_font_size": "4.5"
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if",
        "tag": "Span"
    }
]

Modify Actual Text based on character properties (text).

"word_update": [
    {
        "actual_text": "No",
        "query": {
            "$and": [
                {
                    "$0_text": {
                        "$regex": "□"
                    }
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if",
        "tag": "Span"
    },
    {
        "actual_text": "Yes",
        "query": {
            "$and": [
                {
                    "$0_text": {
                        "$regex": "■"
                    }
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if",
        "tag": "Span"
    }
]

Word Spacing Precision Logic

Accurate word spacing is critical for reliable recognition of text lines, paragraph types (e.g. justified vs. simple), and ultimately, for layout structure such as heading alignment or table row segmentation.

Automatic Detection of Word Space

During word recognition, a base word space is automatically estimated for each unique:

  • Font name
  • Font size

This is computed by analyzing inter-character gaps across text runs with matching font properties. It defines what spacing is considered “normal” within a word vs. between words.

This automatically estimated word space width is then used to:

  • Detect word boundaries
  • Classify a line as:
    • Simple: uniform space width
    • Justified: variable space widths

Ways to Adjust Word Space

1. Exact Override per Word (word_update)

Template Source: pde_word → initial element

"word_update": [
   {
       "query": {
           "$and": [
               {
                   "$0_font_size": "4.5"
               }
           ],
           "param": [
               "pde_word"
           ]
       },
       "statement": "$if",
       "word_space": "4.2"
   }
]

Sets an exact word spacing value for this font-size & font-name combination.

  • Overrides all other logic
  • Disables word_space_ratio

Use this when auto-estimation fails for a specific word or stylized font.

2. Proportional Scaling (word_space_ratio)

Template Source: pde_container → initial element

"pagemap": [
   {
       "word_space_ratio": "1.15",
   }
]

Multiplies the estimated space width by a scaling factor. Useful for small global corrections.

  • Only used if no word_space is defined
  • Applies to all words in the container
3. Post-Line Update Adjustment (text_line_update)

Template Source: pde_line → initial element (for word_space)

"text_line_update": [
   {
       "word_space": "4.2",
       "query": {},
       "param": [
           "pde_text_line"
       ],
       "statement": "$if"
   }
]

Once words are grouped into lines, text_line_update can fine-tune or lock the final spacing:

  • If word_space is defined in text_line_update, the spacing for all words in the line is set to that value.
  • If word_space_update_max is defined:
    • It limits re-estimation from line analysis
    • Set to 0 to prevent any automatic spacing changes

Best Practice Recommendations

  • Use word_space only when auto-estimation fails consistently for specific fonts.
  • Apply word_space_ratio globally in containers (e.g., invoices or tables).
  • Avoid multiple re-definitions across word_update and text_line_update unless needed for layout corrections.
  • Use word_space_update_max:0 to lock final spacing post-assembly.

Label Detection

Purpose and Behavior

The word_update function detects and classifies list label words (li_1 to li_4) using regex rules and spatial relationships. When a word is marked as a potential label (e.g., 1., A), (iii)), the engine attempts to pair it with a sibling element – usually the associated paragraph or line content.

  • If a word is explicitly marked with a label property in a template rule (e.g., label: li_1), it is treated as a list item at the corresponding nesting level.
  • If no label is set manually, the system attempts automatic detection using regex and heuristics:
    • Valid label formats: numeric (1.), alphabetic (A), Roman numerals (IV), bullets (•), or combinations with brackets or dots.
    • The word is analyzed for regex_label, regex_letter, regex_roman_numbering, and others.
  • Once a word is flagged as a possible label, it is matched to a sibling word based on reading order (RTL or LTR).

Thresholds

Template Source: pde_word → initial element

  • label_word_detect – Enables automatic label detection. Set to 0 in templates where labeling is irrelevant to prevent unwanted grouping.
  • label_distance_ratio – Used to calculate label-sibling horizontal distance. dist = font_size × label_distance_ratio. It affects how far the algorithm will look horizontally for a matching sibling.
  • label_word_w1 – Weight for vertical alignment similarity between label candidates during clustering.
  • label_word_w2 – Weight for label-sibling offset alignment.
  • label_word_dist_sibling_ratio – Reject labels if sibling is too far. Max distance = font_size × ratio.
  • label_sibling_distance_ratio – Reject false labels if sibling is too close to another word in the line.
  • label_word_distance – Maximum absolute clustering distance between label candidates.
  • label_word_distance_ratio – Used if label_word_distance == 0, scales by page width.
  • concurrent_threads – Multithreaded clustering of label candidates.

pagemap_regex flags used for semantic detection

  • label_chars: Characters like “(“, “)”, “.” used to strip and normalize labels.
  • regex_label: Pattern for detecting list markers.

Editable Values

  • label: label, label_no, li_1 … li_4

Best Practice Recommendations

  • label_word_detect = 0 should be explicitly disabled in templates where label detection is irrelevant (e.g. tables without list-like elements).

Example

"word_update": [
    {
        "label": "li_1",
        "query": {
            "$and": [
                {
                    "$0_text": {
                        "$regex": "^\\([a-z]\\)$"
                    }
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if"
    },
    {
        "label": "li_2",
        "query": {
            "$and": [
                {
                    "$0_text": {
                        "$regex": "^\\(\\d\\)$"
                    }
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if"
    },
    {
        "label": "li_3",
        "query": {
            "$and": [
                {
                    "$0_text": {
                        "$regex": "^\\((i|ii|iii|iv|v|vi|vii|viii|ix|x|xi|xii|xiii|xiv|xv|xvi|xvii|xviii|xix|xx)\\)$"
                    }
                }
            ],
            "param": [
                "pde_word"
            ]
        },
        "statement": "$if"
    }
]

TOC Detection

Purpose and Behavior

The word_update function also supports detection of Table of Contents (TOC) elements by identifying possible page number words (e.g., 15, 248, xii) and matching them with title words on the same line.

  • Words are flagged as TOC numbers using regex_page_number.
  • A sibling is searched to the left or right (depending on document direction) that likely represents the section heading.
  • TOC entries are grouped and clustered spatially, then marked as a TOC item.

Thresholds

Template Source: pde_word → initial element

  • toc_detect – Enables TOC detection. Set to 0 in non-TOC templates for performance and precision.
  • toc_word_distance – Absolute clustering cutoff for grouping similar TOC numbers. Defines the tightness of TOC number clustering. This ensures only aligned page number entries are grouped.
  • toc_word_distance_ratio – Used if the absolute distance is unset. Multiplied by page_font_width.
  • concurrent_threads – Parallel processing during clustering of TOC words.

pagemap_regex flags used for semantic detection

  • regex_page_number: Used to detect page numbers.

Editable Values

  • toc: label, label_no

Best Practice Recommendations

  • toc_detect = 0 is essential in improving performance and avoiding misclassifications in non-TOC sections.

word_neighbours

Purpose and Behavior

The word_neighbours function controls how recognized words (pde_word) are grouped into text lines (pde_text_line). It plays a critical role in reconstructing logical reading order and grouping sequences of text.

Two words will be joined into the same line if:

  1. Their initial elements are compatible.
  2. Neither has the no_join flag.
  3. They have the same text style.
  4. Their writing angles match.
  5. They share the same baseline (within text_line_baseline_ratio × font).
  6. No splitting object (line, rect, other words) is between them.
  7. The gap between them is ≤ word_space_distance_max or word_space_distance_max_ratio.

You can override these constraints explicitly using the word_neighbours rule with join: true | false.

If a word has an initial element that is already a pde_text_line, it is automatically inserted into that line. Otherwise, a new line is created, and neighboring words are added if they pass all join conditions.

Editable Values

  • Defined in word_neighbours:
    • join:
      • If true, forcibly joins two words into a line.
      • If false, forcibly prevents joining.
      • If not defined, fallback logic uses baseline, spacing, and flags.
  • Defined in word_update for individual word flags:
    • no_join: If a word has this flag set, it will never be joined into a line, even if all other conditions match.

Thresholds

Template Source: pde_text_line → initial element

  • text_line_baseline_ratio – Maximum vertical offset allowed between baselines, multiplied by font size.
  • word_space_distance_max – Maximum absolute horizontal gap between words (in user units).
  • word_space_distance_max_ratio – If word_space_distance_max is zero, this ratio × max font size is used.
    • These two thresholds are ignored for RTL label words unless label_word_detect = 0.

⚠️ Note: word_neighbours method is called only for spacific pairs of words, not for all pairs.

⚠️ Note: Merging is only attempted if the spacing threshold is not exceeded and no graphical splitters or layout anomalies are in between.

Best Practice Recommendations

  • To ensure two lines are treated as part of the same paragraph, create an initial pde_text element to unify them

Example

"word_neighbours": [
    {
        "join": "true",
        "query": {
            "$and": [
                {
                    "$0_font_name": {
                        "$regex": "Arial-BoldMT"
                    }
                },
                {
                    "$1_font_name": {
                        "$regex": "Arial-BoldMT"
                    }
                }
            ],
            "param": [
                "pde_word",
                "pde_word"
            ]
        },
        "statement": "$if"
    }
]

word_connect

Purpose and Behavior

The word_connect function merges consecutive text lines that logically belong together. It is typically used to reconstruct full paragraphs or wrapped lines that were split due to PDF layout quirks.

The function evaluates pairs of pde_text_line elements based on alignment, spacing, and their surrounding context. If certain conditions are met, two lines are merged into one.

When lines are merged, all their words are combined into a single pde_text_line, and the redundant line is removed from the container.

Editable Values

Defined in word_neighbours:

  • join:
    • If true, forcibly joins two words into a line.
    • If false, forcibly prevents joining.
    • If not defined, fallback logic uses baseline, spacing, and flags.

Thresholds

Template Source: pde_text_line → initial element

Word-level decisions from: pde_word → initial element

  • text_line_baseline_ratio: Used in word_matches_to_line() to determine baseline tolerance.
  • word_space_distance_max and word_space_distance_max_ratio: Control maximum allowed space between words. Used in both word_neighbours and word_connect.

text_line_update

Purpose and Behavior

The text_line_update function processes detected text lines (pde_text_line) and updates their structural roles, properties, and downstream tagging behavior. It is a critical step before paragraph recognition, as it ensures each line is correctly interpreted and labeled.

This function performs the following key operations:

  • Assigns properties such as name, tag, flag, label, heading, lang, and splitter.
  • Recognizes and processes label-based lines (e.g. list items) previously marked during word_update.
  • If a text line is flagged as artifact, header, or footer, it is excluded from structural grouping and moved into the appropriate container (artifact, header, footer).
  • Detects and handles filling characters (e.g., repeated dots) using the text_line_split_filling function. This can split out decorative patterns unless split=false or the line has a no_split flag.
  • Analyzes underlines by evaluating proximity of pde_line graphics to the baseline of the text line.
  • Constructs chunks (homogenous blocks) based on consistent word spacing within the line. These chunks are later used to decide line breaking and column detection. Lines marked no_split or with text initial elements are not broken during this stage.
  • Lines with pde_text as their initial element are never split, ensuring strict preservation of manually defined blocks.

Editable Values

  • name
  • tag
  • flag
  • actual_text
  • lang
  • label
  • heading
  • text_line_flag
  • splitter
  • single_instance
  • word_space

Thresholds

Template Source: pde_text_line → initial element

  • text_line_underline_distance — maximum allowed vertical distance to consider a line as underlined.
  • text_line_underline_char_distance_ratio — adjusts underline detection based on font size.
  • text_line_chunk_distance_max — maximum allowed distance between words to consider them in the same chunk.
  • text_line_chunk_distance_max_ratio — scaled by font size for dynamic control of chunk merging.
  • text_line_chunk_distance — absolute chunk separation threshold.
  • text_line_chunk_distance_ratio — relative distance for detecting chunk boundaries.

Best Practice Recommendations

  • word_neighbours method is called only for specific pairs of words, not all pairs.

text_create

Purpose and Behavior

The text_create function assembles detected text lines into paragraph-level text blocks (pde_text).

  • If a pde_text_line has an initial element of type pde_text, it is automatically added to that pde_text block.
  • Lines that do not belong to the same initial element are never joined.
  • The function respects the join parameter in the text_line_neighbours function:
    • If join: true, the two lines are always merged (as long as they are in the same container).
    • If join: false, the lines will never be merged.
  • If the text_line has the no_join flag, it is excluded from paragraph assembly.
  • Lines are merged only if their text style values match (if styles are defined).
  • Font size must also match between lines. If they differ slightly, the threshold text_line_join_font_size_distance determines the acceptable range.
  • Paragraph merging is blocked if any visual splitters (such as pde_line, pde_rect, or other layout-dividing elements) are detected between the lines.
    • For example, if a pde_line has been declared as an initial element between two pde text lines, those text lines will remain separate.

Editable Values

  • name
  • tag
  • flag
  • actual_text
  • lang
  • label
  • heading
  • text_flag
  • single_instance

Thresholds

Template Source: pde_text_line → initial element

  • text_line_join_font_size_distance – Maximum font size difference allowed for two lines to be joined.

Best Practice Recommendations

Example


text_split

Function Purpose and Behavior

The text_split function decides whether a pde_text block should be split into smaller parts, typically individual lines or even words. This segmentation helps improve paragraph detection and logical reading order.

🔹 How to Control Splitting

Splitting can be driven explicitly via template functions and flags, or implicitly via heuristics based on layout and spacing. The rules below are applied in priority order.

✅ To Force Splitting

You can force the text to split under any of the following conditions (higher priority listed first):

  1. Function-Based Rules
    • text_line_neighbours between two lines:
      • Set split: true → lines will be split at this point.
    • text_line_update for a single line:
      • Set split: true → the text will split at this line.
  2. Line Flags
    • The line has text_flag: new_line (or regex flag kTextLineFlagNewLine).
      • It is always split from the previous line.
  3. Heuristic Detection
    • For single-line texts:
      • It may be split into words if:
        • Average spacing between words is below the text_split_distance threshold.
        • Words are similar in length (e.g., character count).
    • For multi-line texts:
      • If all lines are single, non-hyphenated words → strong candidate for splitting.
      • If overall inter-line distance score (get_text_lines_distance) is below text_split_distance.

🚫 To Prevent Splitting

Text blocks will not be split in any of these conditions:

  1. Function-Based Rules
    • text_line_neighbours between two lines:
      • Set join: true → explicitly prevents splitting.
  2. Line Flags
    • The line has text_flag: no_new_line (or regex flag no_new_line).
      • It is never split from the previous line.
  3. Text-Level Flags or Structure
    • The pde_text has the flag no_split.
    • The pde_text‘s initial element is of type list → splitting is disabled.
  4. Initial Elements

🔧 Editable Values (in Template)

ValueWhere UsedPurpose
splittext_line_update, text_line_neighboursForce split at this line or between two lines
jointext_line_neighboursPrevent split between two adjacent lines
text_flagAny line (e.g., new_line, no_new_line)Line-level control via flags
flagAny text (e.g., no_split)Marks the text block as unsplittable

⚙️ Thresholds

Threshold NameDescription
text_split_distanceDistance threshold for deciding splits based on line or word spacing
Word space estimateBased on font (font_name + font_size); used for heuristic new_line tagging

📦 Template Source

  • text_line_update → affects individual lines
  • text_line_neighbours → controls splitting between adjacent lines
  • Distance and spacing thresholds → inherited from the initial element of the pde_text_line

text_update

Purpose and Behavior

The text_update function processes all detected pde_text elements within a container and updates their metadata, classification, and structural role within the document.

This function plays a key role in:

  1. Calculate paragraph similarity with isolated_text_ratio threshold.
  2. Assigns properties such as name, tag, flag, label, heading, lang, and splitter.
  3. If a text is flagged as artifact, header, or footer, it is excluded from structural grouping and moved into the appropriate container (artifact, header, footer).
  4. Caption detection: If regex matches regex_table_caption, regex_image_caption, regex_chart_caption, or regex_note_caption

Editable Values

  • name
  • tag
  • flag
  • actual_text
  • lang
  • label
  • heading
  • text_flag
  • text_style
  • single_instance

Thresholds

Template Source: pde_text → initial element

  • isolated_text_ratio

pagemap_regex flags used for semantic detection

  • regex_table_caption
  • regex_image_caption
  • regex_chart_caption
  • regex_note_caption

Best Practice Recommendations

Example

🧩 text_split

🧠 Function Purpose and Behavior

The text_split function controls whether a pde_text block should be split into smaller paragraphs or lines. This affects paragraph detection and reading order.

The function uses several rules and thresholds, as well as template-defined overrides. It is triggered during layout processing and modifies how pde_text elements are structured internally.


✅ To Force Splitting

To split lines or words from a pde_text, set:

  • "split": true inside a text_line_update or text_line_neighbours template function.
  • A low threshold value in "text_split_distance" to encourage separation based on spacing.
  • Ensure text_line_flag contains "new_line" on the relevant line.

🚫 To Prevent Splitting

Avoid splitting by:

  • Adding "flag": "no_split" to the pde_text object.
  • Setting text_line_flag to "no_new_line".
  • Ensuring lines have an initial element (e.g., pde_text_line → initial element), which blocks any splitting.
  • Using "join": true in text_line_neighbours to force keeping lines together.

🔧 Editable Values

You can set the following values to control splitting:

  • "flag" — use "no_split" to prevent any split.
  • "text_line_flag" — use "new_line" or "no_new_line" on pde_text_line.
  • "split" — apply within text_line_update or text_line_neighbours.

⚙️ Thresholds

These control splitting heuristics:

  • "text_split_distance" — distance between words/lines below which splitting is likely.

📦 Template Source

  • From: pde_text_line → initial element
  • Functions:
    • text_line_update
    • text_line_neighbours
  • Value checks:
    • text_line_flag
    • split
    • join