Master PDF Auto-Tagging with Layout Templates
Introduction
PDFix Template Layout is a rule-based system that allows users to define custom layout recognition and tagging logic for PDFs. This is essential for structured documents (e.g., statements, catalogs, invoices) and for achieving compliance with accessibility standards such as PDF/UA.
This technical documentation provides a comprehensive reference to all features, functions, and nodes used in the Template Layout language, complete with advanced examples, implementation details, and behavioral logic derived from internal SDK methods like split_texts.
General Template Logic
The fundamental approach to building a complex template is to divide the document into logical sections using initial elements. Each section can have its own child template (defined in an element_template node), which overrides the main behavior of the layout recognition algorithm for that part of the document.
Initial elements are created using the element_create function. The most important value to define for an initial element is its bounding box (bbox).
Bounding boxes can be defined in two main ways:
- Fixed bbox: Using direct coordinates.
- Start/end bbox: More dynamic, may span multiple anchors.
Bounding box coordinates (left, top, right, bottom) can use:
- Static float values
- General context values (
$page_num, $page_width, $doc_num_pages
, etc.) - Parent values (
$parent_top, $parent_left, etc.
) - Anchor references (
$A1_bottom, $ANCHOR_right, etc.
) - Math functions (
e.g. SUM($parent_left, 10)
)
Why Use a Template?
- Ensure consistent tagging across similar documents
- Improve the accuracy of semantic tagging
- Save time on manual remediation
- Enhance accessibility and screen reader compatibility
- Boost efficiency by applying one template to thousands of similar PDFs
Pagemap Overview
The pagemap engine is the core class responsible for page layout recognition. It processes each page from bottom to top and groups primitive objects into logical entities like words, lines, paragraphs, rules, tables, lists, etc.
In PDFix Desktop, each processing step can be visually debugged and adjusted. You can use the settings -> debug_pagemap_stop property in the template to stop the layout engine after any step to inspect intermediate results. This is especially useful when building templates – you can verify every step separately and identify where recognition failed.
For example, setting the stop point to word_update halts after words are recognized. You can then verify whether superscripts and subscripts are properly merged into words. This technique is also helpful for debugging complex processes like paragraph assembly or table recognition.
Functions
- element_create
- object_update
- annot_update
- line_update
- rect_update
- object_update_text
- text_run_update
- text_run_neighbours
- word_update
- text_run_update
- word_connect
- text_line_update
- text_create
- text_split
- text_update
element_create
Purpose and Behavior
The element_create
function inserts a new layout element into the document structure. Unlike other functions that modify detected elements, this one defines virtual elements manually-based on explicitly set parameters like bounding box and type. These elements can be used as standalone tags, layout hints, containers for tagging, or structural anchors for downstream operations (e.g., TOC positioning or header/footer marking).
Created elements can be tagged, styled, and grouped just like native PDF elements. This is especially useful in cases where automatic detection fails or manual control over structure is required.
The element is only created if the following is defined:
- valid
type
pde_text, pde_text_line, pde_image, pde_container, pde_list, pde_line, pde_rect, pde_table, pde_cell, pde_toc, pde_header, pde_footer
bbox
,start_bbox
, orend_bbox
- Fixed Bounding Box Elements – The simplest way to define an initial element is with a fixed bbox. This method is best for elements like headers, footers, or static sidebars that appear at known locations.
- Anchor-Based Initial Elements – A more flexible method uses anchors. An initial element’s position and size are defined relative to previously detected anchor elements. This allows dynamic placement based on content structure. The system creates initial elements at the start of page processing and assigns each page object to one of them based on overlap.
If the bbox is zero-sized the function cause error during layout detection.
⚠️ Note: By using this method, you override the default behavior of the recognition engine. Instead of detecting elements heuristically, the engine will use your pre-defined type, bbox, and flags.
Nested Templates (element_template)
An initial element can contain its own element_template
node. If present, this child template completely replaces the parent template for that section. This means that:
- Functions and values from the main template do not apply within this section
- Only the child template will be used to process that region
If element_template
is not defined, the initial element inherits the parent or global template rules.
Initial element matching modes
- Exact bounding box match – if the
no_expand
flag is set, the element’s bbox is extended using initial_element_expansion. - Overlap-based matching – if no_expand is not set, the layout engine checks if the overlap area with a parent exceeds the
initial_element_overlap
threshold. If so, the parent is assigned.
Editable Values
type
– Element type (e.g.pde_text
,pde_line
,pde_table
)bbox
– Bounding box for element position and sizestart_bbox
– defines beginning boundary for dynamic elementsend_bbox
– defines end boundary for dynamic elementsname
– Unique name for identification and referencesid
– ID used in tagging or alt-text generationflag
– Flags likeartifact, header, footer
for logical roletext_flag
– Flags for text elements (e.g.no_newline, first_cap
, etc.)tag
– Semantic tag (P, H1, L
, etc.) for accessibilityheading
– Text style role:normal, h1, h2, h3
label
– Label level (label, li_1, label_no
, etc.)lang
– Language of element content (e.g.en-US
)actual_text
– Replacement text for screen readersalt
– Alternate text used for descriptions (especially images)single_instance
– Constraints for creating only one instance per page (e.g.font_size
)sort_direction
– Reading order in containers:0 = default, 1 = columns, 2 = rows
splitter
– If element acts as a layout splitter (e.g.pde_cell
)element_template
– Embedded template definition for nested configuration
Table or Cell specific fields
col_num
– Number of columns (for tables)row_num
– Number of rows (for tables)cell_row
– Row position in table (for cell)cell_column
– Column position in table (for cell)cell_row_span
– Number of rows spanned by the cellcell_column_span
– Number of columns spanned by the cellcell_scope
– Scope of cell: row, columncell_header
– Whether the cell is a headercell_associated_header
– Comma-separated list of headers
Thresholds
Template Source: initial element
initial_element_overlap
initial_element_expansion
Example
This JSON snippet defines virtual elements for tagging content on page 1 of a PDF document using the element_create
function. It demonstrates two primary use cases:
- Creating a dynamic header region with a nested template
- Manually tagging the text in specific bounding box wrapping each line of the NovaBank address in a single P tags
"element_create": [
{
"comment": "Tag elements on page n.1",
"elements": [
{
"bbox": [
"0",
"688",
"$page_width",
"$page_height"
],
"comment": "Tag header",
"element_template": {
"template": {
"pagemap": [
{
"rd_sort": "2",
"rd_sort_direction": "2",
"statement": "$if"
}
]
}
},
"type": "pde_header"
},
{
"bbox": [
"88.05127716064453",
"527.709228515625",
"325.1794738769531",
"677.7356567382812"
],
"comment": "Tag NovaBank address as single line text",
"text_flag": "new_line",
"type": "pde_text"
}
]
}
]
object_update
🧩 Function Purpose and Behavior
The object_update
function processes low-level graphical objects (pds_object
) from the current PDF page and prepares them for downstream layout recognition and tagging. It is primarily responsible for classifying visual elements such as lines, rectangles, background images, and shadings into semantic structures.
This is the stage where pds_path
objects (such as vector paths) are converted into high-level PDFix elements like pde_line
and pde_rect
.
Supported object types include:
pds_object
pds_path
pds_image
pds_shading
pds_form
Each recognized object is assigned an initial_element
based on its bounding box and type. These objects are later processed by structural functions like line_update
, rect_update
, or element_create
.
✅ To Force Specific Behavior
Desired Behavior | What to Set |
---|---|
Force object to be ignored from layout analysis | Set "flag": "artifact" |
Force object into header/footer grouping | Set "flag": "header" or "footer" |
Match specific shapes or bounding boxes | Use custom "$if" query (see Example) |
Force path into a line or rect | Adjust similarity thresholds (see below) so they are recognized as pde_line or pde_rect |
🚫 To Prevent Specific Behavior
Prevent This | What to Do |
---|---|
Prevent accidental grouping of decorative paths as meaningful layout elements | Increase artifact_similarity or lower path_object_max |
Prevent backgrounds from being misclassified | Filter pds_shading with "flag": "artifact" early in object_update |
Prevent object from appearing as line or rect | Ensure thresholds like element_line_similarity are not met or use "flag": "artifact" |
🔧 Editable Values
These can be set per object or query in your template:
Value | What to Do |
---|---|
flag | Defines layout role. Accepts values like: "artifact" (object will be excluded), "header" or "footer" (moved to special containers) |
⚙️ Thresholds
Defined in the initial_element
or page-level config, these control object classification:
Threshold | Description |
---|---|
artifact_similarity | Controls grouping tolerance for objects with similar position/size when marked as artifacts |
element_line_similarity | Similarity threshold to detect if a pds_path represents a pde_line |
angle_deviation | Max angle difference (in degrees) to cluster elements as horizontal/vertical lines |
isolated_element_ratio | Ratio to identify if an element is isolated and not part of a structure |
path_object_max | Upper size limit for path to be recognized as one object |
path_object_min | Lower size threshold for path to be recognized as one object |
📦 Template Source
- Values are evaluated from the
initial_element
of eachpds_object
.
💡 Best Practice Recommendations
- If you already know certain graphics (e.g., background boxes, top/bottom bars, separator lines) should be excluded from tagging or layout detection, tag them early in
object_update
with"flag": "artifact"
. - Excluding these early prevents misclassification in later stages like
line_update
,rect_update
, or table detection. - You can write precise queries using
"$page_num"
,"$0_top"
,"$0_bottom"
, and other bounding box keys to match only specific graphical objects.
🧪 Example
Mark all graphic objects above 750 pixels on the first page as artifacts.
{
"object_update": [
{
"flag": "artifact",
"query": {
"$0_bottom": { "$gt": "750" },
"$page_num": "1"
},
"param": ["pds_object"],
"statement": "$if"
}
]
}
annot_update
🧩 Purpose and Behavior
The annot_update
function processes all pdf_annot
objects on the current PDF page and decides how they should be represented in the page layout. It ensures annotations are either:
- Converted to structured layout elements (e.g.,
pde_form_field
,pde_annot
) - Skipped from further processing if irrelevant
- Tagged correctly for accessibility or export
This function operates as the bridge between low-level PDF annotations and high-level semantic tagging.
Key operations include:
- ✅ Widget annotations (like form checkboxes or text inputs) are transformed into
pde_form_field
elements. - ✅ Non-widget annotations (such as comments, highlights, or drawing shapes) become
pde_annot
layout elements. - 📦 All annotations are inserted into the page map structure and attached to their closest initial container using bounding box proximity.
- 🚫 Excluded annotations (those with internal
kStateExclude
flags) are skipped entirely.
line_update
🧩 Function Purpose and Behavior
The line_update
function evaluates graphical lines (pde_line
) extracted from pds_path
objects and assigns them semantic meaning or layout roles. These lines can later influence element segmentation, table detection, or be marked as artifacts.
Each line is matched to an initial element based on its bounding box and geometric similarity. This function also filters lines that are too short, too slanted, or redundant.
🚫 To Prevent Behavior
Condition | Effect |
---|---|
Lines from different form_obj or XObject | Cannot be merged |
Lines with different initial elements | Merge skipped unless a parent-child relationship exists |
Lines has no_join flag | Prevents merging with any other lines |
✅ To Force Behavior
What You Want to Do | How to Do It |
---|---|
Force it into a container | Use parent to explicitly attach it to a named element |
Ensure it is merged into another line | Make one line initial elements |
Extend lines automatically | Lines are only merged if they pass internal checks for graphic style and geometric alignment. They must also originate from the same XObject. If merging fails due to this constraint, consider using the Flatten Form XObject feature to normalize content structure. |
🔧 Editable Values
Key | Description |
---|---|
flag | Semantic classification — e.g. "artifact" , "header" , "footer" . Set "artifact" to move it to the artifact layer and exclude from layout grouping |
label | Label recognition flag (label , li_1 , label_no , etc.) |
tag | Tag type (e.g., Figure, Span ,) used for accessibility |
name | Internal or debug-friendly name for the line |
parent | Reference to the name of another element — links this line to its container |
⚙️ Thresholds
Threshold | Description |
---|---|
angle_deviation | Maximum allowed deviation in degrees to consider the angle as same line |
| Defines how closely two lines must intersect or align to be eligible for merging in the line extend test. |
📦 Template Source
pde_line
→ initial element
💡 Best Practice Recommendations
- Lines marked as
artifact
here will be excluded from tagging. This is ideal for footers, underlines, or visual dividers.
📝 Example
Mark all detected lines as artifacts.
"line_update": [
{
"@statement": "$if",
"@query": {
"@param<type>": "query_param",
"param": [
["pde_line"]
]
},
"@flag": "artifact"
}
]
rect_update
Purpose and Behavior
The rect_update function processes detected rectangle elements (pde_rect) and attempts to merge them into unified rectangular blocks when certain geometric and structural criteria are met. This is particularly useful for simplifying visual background elements like section boxes, shaded containers, and graphic panels.
✅ To Force Behavior
What You Want to Do | How to Do It |
---|---|
Force it into a container | Use parent to explicitly attach it to a named element |
Ensure it is merged into another rectangle | Make both have matching initial elements or initial parent |
Extend rectangles automatically | Rectangles are only merged if they pass internal checks for graphic style, geometric alignment, and structural similarity. They must also originate from the same XObject. If merging fails due to this constraint, consider using the Flatten Form XObject feature to normalize content structure. |
🚫 To Prevent Behavior
Condition | Effect |
---|---|
Rectangles from different form_obj or XObject | Cannot be merged |
Rectangles with different initial elements | Merge skipped unless a parent-child relationship exists |
Rectangle has no_join flag | Prevents merging with any other rectangle |
🔧 Editable Values
Name | Description |
---|---|
flag | Layout role (artifact , header , footer , etc.) |
label | Label recognition flag (label , li_1 , label_no , etc.) |
tag | Semantic tag (Art , Figure , etc.) |
name | User-defined identifier for referencing or debugging |
parent | Assign rect into a known template-defined container |
⚙️ Thresholds
Threshold | Description |
---|---|
| Defines how closely two lines must intersect or align to be eligible for merging in the rectangle extend test. |
📦 Template Source
pde_rect
→ initial element
📝 fExample
Mark gray rectangles with both width and height less than 10px as labels.
"rect_update": [
{
"label": "li_1",
"query": {
"$and": [
{
"$0_fill_color": [
"100",
"100",
"100"
]
},
{
"$0_height": {
"$lte": "10"
}
},
{
"$0_width": {
"$lte": "10"
}
}
],
"param": [
"pde_rect"
]
},
"statement": "$if"
}
]
object_update_text
Purpose and Behavior
The object_update_text function parses text-based PDS objects from the current page. It is invoked with parameters:
pds_object
pds_text
The function extracts text runs from the page content, segments them into words, and assigns each word to an appropriate initial element based on its bounding box.
This is one of the earliest and most critical steps in the layout pipeline. If this segmentation fails, all downstream logic (headings, tagging, tables) will be incorrect.
Text runs are later grouped into words based on spatial properties – including inter-character spacing and baseline alignment.
Editable Values
flag
– controls behavior (artifact, header, footer
)
Thresholds
Template Source: page level using values in the main template block
word_space_width_ratio
Ratio multiplier used to estimate the maximum allowed space between characters within a word. If spacing between characters exceeds this value × minimal char spacing → a new word begins.word_space_width_min_ratio
Optional lower bound used to constrain the influence of very small font sizes. Applied as: allowed_space = font_size × word_space_width_min_ratio
These values control when and where characters get split into new words – especially important in justified or stylized text.
⚠️ Note: These are defined in the root-level (global) template, not per-element.
Example
Mark all text objects positioned above 740px as artifacts.
"object_update": [
{
"comment": "Artifact texts except first page",
"flag": "artifact",
"query": {
"$and": [
{
"$page_num": {
"$gt": "1"
}
},
{
"$0_bottom": {
"$gte": "740"
}
}
],
"param": [
"pds_text"
]
},
"statement": "$if"
}
]
text_run_update
Purpose and Behavior
The text_run_update function modifies properties of individual pde_text_run elements after they are parsed from the page.
Its primary purpose is to assign text state flags based on visual or semantic context – such as subscript/superscript styling.
The text_run_update
rule is evaluated per text run, immediately after it is extracted from the PDF stream.
Editable Values
text_state_flag
Can include values like:subscript
superscript
These flags do not split the text run from its associated word. The run remains part of the word but is marked for later structural tagging.
This preserves grouped representations like: H₂O instead of: H 2 O
Thresholds
Template Source: page level using values in the main template block
text_line_baseline_ratio
angle_deviation
Warning: These are defined in the root-level (global) template, not per-element.
Example
Mark all text with a font size of 4px as superscript.
"text_run_update": [
{
"query": {
"$and": [
{
"$0_font_name": {
"$regex": "ArialMT"
}
},
{
"$0_font_size": "4"
}
],
"param": [
"pde_text_run"
]
},
"statement": "$if",
"text_state_flag": "superscript"
}
]
text_run_neighbours
Purpose and Behavior
The text_run_neighbours function determines whether two consecutive pde_text_run elements should be merged into a single word or split.
This function overrides automatic word detection, offering precision control in cases where standard heuristics fail (e.g., tight kerning, styled fragments).
Merge Decision Logic
The function compares two runs:
- If join = true → the runs are always joined into the same word.
- If join = false → a forced split is applied between the runs, even if they align.
- If no explicit rule is matched, the fallback logic checks:
- Angle consistency (same_angle)
- Baseline alignment (same_baseline)
- Spacing break (e.g., visual separator or vertical border)
This lets you force join or split conditions for complex or edge-case text layouts.
Editable Values
join
– Boolean flag to force merge or split
Thresholds
Template Source: page level using values in the main template block.
text_line_baseline_ratio
– internal margin for baseline deviationangle_deviation
⚠️ Note: These are defined in the root-level (global) template, not per-element.
word_update
Purpose and Behavior
The word_update function processes individual pde_word elements after word segmentation is complete. It serves three key purposes:
- Semantic Flag Assignment
Applies structural or semantic flags to words based on regex rules. These flags influence downstream layout recognition (e.g., list detection, heading classification, TOC generation). - Filling and Label Detection
- Uses
regex_filling
to detect filler-only words (like “…”, “–“). - Splits or skips them depending on match length.
- Detects structured labels (e.g., numbered bullets or Roman/letter markers).
- Uses
Property Update and Artifact Extraction
If a word is marked as artifact, header
, or footer
, it’s converted into a separate pde_text element, detached from the text stream, and placed into the corresponding container. This prevents accidental tagging or reading by screen readers.
The word_update
function identifies possible list labels and table of contents (TOC) entries among pde_word elements. It marks these elements with corresponding label or toc values and attempts to pair them with a sibling (typically a neighboring word) that represents the list item content or TOC title.
Editable Values
name
– unique nametag
– semantic tag (e.g.,Span, Div, Note
)flag
– behavior modifier(artifact, header, footer
, etc.)label
– used for logical labelingheading
– heading levelactual_text
– alternate value for screen readerslang
– language codeword_flag
– manual override for system-assigned flagssingle_instance
– suppresses duplicates across layoutword_space
” – sets an exact space width for this word’s font-size/font. If word_space is set manually here, it overrides all computed word spacing logic and disables the word_space_ratio.
Thresholds
Template Source: pde_word → initial element
word_space_ratio
– multiplier for auto-calculated spacing (used unless overridden)word_space_update_max
– prevents auto-updates beyond a limit (0 = no updates)
pagemap_regex flags used for semantic detection
hyphen
– regex_hyphenbullet
– regex_bullet, regex_bullet_fontcolon
– regex_colonnumber
– number_charsterminal
– regex_terminalcapital
– regex_first_capdecimal_num
– regex_decimal_numberingroman_num
– regex_roman_numberingletter_num
– regex_letter_numberingpage_num
– regex_page_numberfilling
– regex_fillingcomma
– regex_commalabel
– regex_label, label_chars, regex_letter
Example
Mark all dot-leader words as artifacts.
{
"word_update": [
{
"flag": "artifact",
"query": {
"$0_text": { "$regex": "^\\.+$" }
},
"param": ["pde_word"],
"statement": "$if"
}
]
}
Change Tag Type.
"word_update": [
{
"query": {
"$and": [
{
"$0_font_size": "4.5"
}
],
"param": [
"pde_word"
]
},
"statement": "$if",
"tag": "Span"
}
]
Modify Actual Text based on character properties (text).
"word_update": [
{
"actual_text": "No",
"query": {
"$and": [
{
"$0_text": {
"$regex": "□"
}
}
],
"param": [
"pde_word"
]
},
"statement": "$if",
"tag": "Span"
},
{
"actual_text": "Yes",
"query": {
"$and": [
{
"$0_text": {
"$regex": "■"
}
}
],
"param": [
"pde_word"
]
},
"statement": "$if",
"tag": "Span"
}
]
Word Spacing Precision Logic
Accurate word spacing is critical for reliable recognition of text lines, paragraph types (e.g. justified vs. simple), and ultimately, for layout structure such as heading alignment or table row segmentation.
Automatic Detection of Word Space
During word recognition, a base word space is automatically estimated for each unique:
- Font name
- Font size
This is computed by analyzing inter-character gaps across text runs with matching font properties. It defines what spacing is considered “normal” within a word vs. between words.
This automatically estimated word space width is then used to:
- Detect word boundaries
- Classify a line as:
- Simple: uniform space width
- Justified: variable space widths
Ways to Adjust Word Space
1. Exact Override per Word (word_update)
Template Source: pde_word → initial element
"word_update": [
{
"query": {
"$and": [
{
"$0_font_size": "4.5"
}
],
"param": [
"pde_word"
]
},
"statement": "$if",
"word_space": "4.2"
}
]
Sets an exact word spacing value for this font-size & font-name combination.
- Overrides all other logic
- Disables word_space_ratio
Use this when auto-estimation fails for a specific word or stylized font.
2. Proportional Scaling (word_space_ratio)
Template Source: pde_container → initial element
"pagemap": [
{
"word_space_ratio": "1.15",
}
]
Multiplies the estimated space width by a scaling factor. Useful for small global corrections.
- Only used if no word_space is defined
- Applies to all words in the container
3. Post-Line Update Adjustment (text_line_update)
Template Source: pde_line → initial element (for word_space)
"text_line_update": [
{
"word_space": "4.2",
"query": {},
"param": [
"pde_text_line"
],
"statement": "$if"
}
]
Once words are grouped into lines, text_line_update can fine-tune or lock the final spacing:
- If
word_space
is defined in text_line_update, the spacing for all words in the line is set to that value. - If
word_space_update_max
is defined:- It limits re-estimation from line analysis
- Set to 0 to prevent any automatic spacing changes
Best Practice Recommendations
- Use
word_space
only when auto-estimation fails consistently for specific fonts. - Apply
word_space_ratio
globally in containers (e.g., invoices or tables). - Avoid multiple re-definitions across
word_update
andtext_line_update
unless needed for layout corrections. - Use
word_space_update_max:0
to lock final spacing post-assembly.
Label Detection
Purpose and Behavior
The word_update function detects and classifies list label words (li_1 to li_4) using regex rules and spatial relationships. When a word is marked as a potential label (e.g., 1., A), (iii)), the engine attempts to pair it with a sibling element – usually the associated paragraph or line content.
- If a word is explicitly marked with a label property in a template rule (e.g.,
label: li_1
), it is treated as a list item at the corresponding nesting level. - If no label is set manually, the system attempts automatic detection using regex and heuristics:
- Valid label formats: numeric (1.), alphabetic (A), Roman numerals (IV), bullets (•), or combinations with brackets or dots.
- The word is analyzed for
regex_label, regex_letter, regex_roman_numbering
, and others.
- Once a word is flagged as a possible label, it is matched to a sibling word based on reading order (RTL or LTR).
Thresholds
Template Source: pde_word → initial element
label_word_detect
– Enables automatic label detection. Set to 0 in templates where labeling is irrelevant to prevent unwanted grouping.label_distance_ratio
– Used to calculate label-sibling horizontal distance. dist = font_size × label_distance_ratio. It affects how far the algorithm will look horizontally for a matching sibling.label_word_w1
– Weight for vertical alignment similarity between label candidates during clustering.label_word_w2
– Weight for label-sibling offset alignment.label_word_dist_sibling_ratio
– Reject labels if sibling is too far. Max distance = font_size × ratio.label_sibling_distance_ratio
– Reject false labels if sibling is too close to another word in the line.label_word_distance
– Maximum absolute clustering distance between label candidates.label_word_distance_ratio
– Used if label_word_distance == 0, scales by page width.concurrent_threads
– Multithreaded clustering of label candidates.
pagemap_regex flags used for semantic detection
label_chars
: Characters like “(“, “)”, “.” used to strip and normalize labels.regex_label
: Pattern for detecting list markers.
Editable Values
label: label, label_no, li_1 … li_4
Best Practice Recommendations
label_word_detect = 0
should be explicitly disabled in templates where label detection is irrelevant (e.g. tables without list-like elements).
Example
"word_update": [
{
"label": "li_1",
"query": {
"$and": [
{
"$0_text": {
"$regex": "^\\([a-z]\\)$"
}
}
],
"param": [
"pde_word"
]
},
"statement": "$if"
},
{
"label": "li_2",
"query": {
"$and": [
{
"$0_text": {
"$regex": "^\\(\\d\\)$"
}
}
],
"param": [
"pde_word"
]
},
"statement": "$if"
},
{
"label": "li_3",
"query": {
"$and": [
{
"$0_text": {
"$regex": "^\\((i|ii|iii|iv|v|vi|vii|viii|ix|x|xi|xii|xiii|xiv|xv|xvi|xvii|xviii|xix|xx)\\)$"
}
}
],
"param": [
"pde_word"
]
},
"statement": "$if"
}
]
TOC Detection
Purpose and Behavior
The word_update function also supports detection of Table of Contents (TOC) elements by identifying possible page number words (e.g., 15, 248, xii) and matching them with title words on the same line.
- Words are flagged as TOC numbers using regex_page_number.
- A sibling is searched to the left or right (depending on document direction) that likely represents the section heading.
- TOC entries are grouped and clustered spatially, then marked as a TOC item.
Thresholds
Template Source: pde_word → initial element
- toc_detect – Enables TOC detection. Set to 0 in non-TOC templates for performance and precision.
- toc_word_distance – Absolute clustering cutoff for grouping similar TOC numbers. Defines the tightness of TOC number clustering. This ensures only aligned page number entries are grouped.
- toc_word_distance_ratio – Used if the absolute distance is unset. Multiplied by page_font_width.
- concurrent_threads – Parallel processing during clustering of TOC words.
pagemap_regex flags used for semantic detection
regex_page_number
: Used to detect page numbers.
Editable Values
toc: label, label_no
Best Practice Recommendations
toc_detect = 0
is essential in improving performance and avoiding misclassifications in non-TOC sections.
word_neighbours
Purpose and Behavior
The word_neighbours
function controls how recognized words (pde_word
) are grouped into text lines (pde_text_line
). It plays a critical role in reconstructing logical reading order and grouping sequences of text.
Two words will be joined into the same line if:
- Their initial elements are compatible.
- Neither has the
no_join
flag. - They have the same text style.
- Their writing angles match.
- They share the same baseline (within
text_line_baseline_ratio
× font). - No splitting object (line, rect, other words) is between them.
- The gap between them is ≤
word_space_distance_max
orword_space_distance_max_ratio
.
You can override these constraints explicitly using the word_neighbours
rule with join: true | false
.
If a word has an initial element that is already a pde_text_line
, it is automatically inserted into that line. Otherwise, a new line is created, and neighboring words are added if they pass all join conditions.
Editable Values
- Defined in
word_neighbours
:join
:- If true, forcibly joins two words into a line.
- If false, forcibly prevents joining.
- If not defined, fallback logic uses baseline, spacing, and flags.
- Defined in
word_update
for individual word flags:no_join
: If a word has this flag set, it will never be joined into a line, even if all other conditions match.
Thresholds
Template Source: pde_text_line → initial element
text_line_baseline_ratio
– Maximum vertical offset allowed between baselines, multiplied by font size.word_space_distance_max
– Maximum absolute horizontal gap between words (in user units).word_space_distance_max_ratio
– If word_space_distance_max is zero, this ratio × max font size is used.- These two thresholds are ignored for RTL label words unless
label_word_detect
= 0.
- These two thresholds are ignored for RTL label words unless
⚠️ Note: word_neighbours
method is called only for spacific pairs of words, not for all pairs.
⚠️ Note: Merging is only attempted if the spacing threshold is not exceeded and no graphical splitters or layout anomalies are in between.
Best Practice Recommendations
- To ensure two lines are treated as part of the same paragraph, create an initial
pde_text
element to unify them
Example
"word_neighbours": [
{
"join": "true",
"query": {
"$and": [
{
"$0_font_name": {
"$regex": "Arial-BoldMT"
}
},
{
"$1_font_name": {
"$regex": "Arial-BoldMT"
}
}
],
"param": [
"pde_word",
"pde_word"
]
},
"statement": "$if"
}
]
word_connect
Purpose and Behavior
The word_connect
function merges consecutive text lines that logically belong together. It is typically used to reconstruct full paragraphs or wrapped lines that were split due to PDF layout quirks.
The function evaluates pairs of pde_text_line
elements based on alignment, spacing, and their surrounding context. If certain conditions are met, two lines are merged into one.
When lines are merged, all their words are combined into a single pde_text_line
, and the redundant line is removed from the container.
Editable Values
Defined in word_neighbours
:
join
:- If true, forcibly joins two words into a line.
- If false, forcibly prevents joining.
- If not defined, fallback logic uses baseline, spacing, and flags.
Thresholds
Template Source: pde_text_line → initial element
Word-level decisions from: pde_word → initial element
text_line_baseline_ratio
: Used in word_matches_to_line() to determine baseline tolerance.word_space_distance_max
andword_space_distance_max_ratio
: Control maximum allowed space between words. Used in both word_neighbours and word_connect.
text_line_update
Purpose and Behavior
The text_line_update
function processes detected text lines (pde_text_line
) and updates their structural roles, properties, and downstream tagging behavior. It is a critical step before paragraph recognition, as it ensures each line is correctly interpreted and labeled.
This function performs the following key operations:
- Assigns properties such as
name, tag, flag, label, heading, lang,
andsplitter
. - Recognizes and processes label-based lines (e.g. list items) previously marked during
word_update
. - If a text line is flagged as
artifact, header,
orfooter
, it is excluded from structural grouping and moved into the appropriate container (artifact, header, footer
). - Detects and handles filling characters (e.g., repeated dots) using the
text_line_split_filling
function. This can split out decorative patterns unlesssplit=false
or the line has ano_split
flag. - Analyzes underlines by evaluating proximity of
pde_line
graphics to the baseline of the text line. - Constructs chunks (homogenous blocks) based on consistent word spacing within the line. These chunks are later used to decide line breaking and column detection. Lines marked
no_split
or with text initial elements are not broken during this stage. - Lines with
pde_text
as their initial element are never split, ensuring strict preservation of manually defined blocks.
Editable Values
name
tag
flag
actual_text
lang
label
heading
text_line_flag
splitter
single_instance
word_space
Thresholds
Template Source: pde_text_line → initial element
text_line_underline_distance
— maximum allowed vertical distance to consider a line as underlined.text_line_underline_char_distance_ratio
— adjusts underline detection based on font size.text_line_chunk_distance_max
— maximum allowed distance between words to consider them in the same chunk.text_line_chunk_distance_max_ratio
— scaled by font size for dynamic control of chunk merging.text_line_chunk_distance
— absolute chunk separation threshold.text_line_chunk_distance_ratio
— relative distance for detecting chunk boundaries.
Best Practice Recommendations
word_neighbours
method is called only for specific pairs of words, not all pairs.
text_create
Purpose and Behavior
The text_create
function assembles detected text lines into paragraph-level text blocks (pde_text
).
- If a
pde_text_line
has an initial element of type pde_text, it is automatically added to thatpde_text
block. - Lines that do not belong to the same initial element are never joined.
- The function respects the
join
parameter in thetext_line_neighbours
function:- If
join: true
, the two lines are always merged (as long as they are in the same container). - If
join: false,
the lines will never be merged.
- If
- If the
text_line
has theno_join
flag, it is excluded from paragraph assembly. - Lines are merged only if their text style values match (if styles are defined).
- Font size must also match between lines. If they differ slightly, the threshold
text_line_join_font_size_distance
determines the acceptable range. - Paragraph merging is blocked if any visual splitters (such as
pde_line
,pde_rect
, or other layout-dividing elements) are detected between the lines.- For example, if a
pde_line
has been declared as an initial element between two pde text lines, those text lines will remain separate.
- For example, if a
Editable Values
name
tag
flag
actual_text
lang
label
heading
text_flag
single_instance
Thresholds
Template Source: pde_text_line → initial element
text_line_join_font_size_distance
– Maximum font size difference allowed for two lines to be joined.
Best Practice Recommendations
Example
text_split
Function Purpose and Behavior
The text_split
function decides whether a pde_text
block should be split into smaller parts, typically individual lines or even words. This segmentation helps improve paragraph detection and logical reading order.
🔹 How to Control Splitting
Splitting can be driven explicitly via template functions and flags, or implicitly via heuristics based on layout and spacing. The rules below are applied in priority order.
✅ To Force Splitting
You can force the text to split under any of the following conditions (higher priority listed first):
- Function-Based Rules
text_line_neighbours
between two lines:- Set
split: true
→ lines will be split at this point.
- Set
text_line_update
for a single line:- Set
split: true
→ the text will split at this line.
- Set
- Line Flags
- The line has
text_flag: new_line
(or regex flagkTextLineFlagNewLine
).- It is always split from the previous line.
- The line has
- Heuristic Detection
- For single-line texts:
- It may be split into words if:
- Average spacing between words is below the
text_split_distance
threshold. - Words are similar in length (e.g., character count).
- Average spacing between words is below the
- It may be split into words if:
- For multi-line texts:
- If all lines are single, non-hyphenated words → strong candidate for splitting.
- If overall inter-line distance score (
get_text_lines_distance
) is belowtext_split_distance
.
- For single-line texts:
🚫 To Prevent Splitting
Text blocks will not be split in any of these conditions:
- Function-Based Rules
text_line_neighbours
between two lines:- Set
join: true
→ explicitly prevents splitting.
- Set
- Line Flags
- The line has
text_flag: no_new_line
(or regex flagno_new_line
).- It is never split from the previous line.
- The line has
- Text-Level Flags or Structure
- The
pde_text
has the flagno_split
. - The
pde_text
‘s initial element is of typelist
→ splitting is disabled.
- The
- Initial Elements
🔧 Editable Values (in Template)
Value | Where Used | Purpose |
---|---|---|
split | text_line_update , text_line_neighbours | Force split at this line or between two lines |
join | text_line_neighbours | Prevent split between two adjacent lines |
text_flag | Any line (e.g., new_line , no_new_line ) | Line-level control via flags |
flag | Any text (e.g., no_split ) | Marks the text block as unsplittable |
⚙️ Thresholds
Threshold Name | Description |
---|---|
text_split_distance | Distance threshold for deciding splits based on line or word spacing |
Word space estimate | Based on font (font_name + font_size ); used for heuristic new_line tagging |
📦 Template Source
text_line_update
→ affects individual linestext_line_neighbours
→ controls splitting between adjacent lines- Distance and spacing thresholds → inherited from the initial element of the
pde_text_line
text_update
Purpose and Behavior
The text_update function processes all detected pde_text elements within a container and updates their metadata, classification, and structural role within the document.
This function plays a key role in:
- Calculate paragraph similarity with
isolated_text_ratio
threshold. - Assigns properties such as
name, tag, flag, label, heading, lang,
andsplitter
. - If a text is flagged as
artifact, header,
orfooter
, it is excluded from structural grouping and moved into the appropriate container (artifact, header, footer
). - Caption detection: If regex matches
regex_table_caption, regex_image_caption, regex_chart_caption, or regex_note_caption
Editable Values
name
tag
flag
actual_text
lang
label
heading
text_flag
text_style
single_instance
Thresholds
Template Source: pde_text → initial element
isolated_text_ratio
pagemap_regex flags used for semantic detection
regex_table_caption
regex_image_caption
regex_chart_caption
regex_note_caption
Best Practice Recommendations
Example
🧩 text_split
🧠 Function Purpose and Behavior
The text_split
function controls whether a pde_text
block should be split into smaller paragraphs or lines. This affects paragraph detection and reading order.
The function uses several rules and thresholds, as well as template-defined overrides. It is triggered during layout processing and modifies how pde_text
elements are structured internally.
✅ To Force Splitting
To split lines or words from a pde_text
, set:
"split": true
inside atext_line_update
ortext_line_neighbours
template function.- A low threshold value in
"text_split_distance"
to encourage separation based on spacing. - Ensure
text_line_flag
contains"new_line"
on the relevant line.
🚫 To Prevent Splitting
Avoid splitting by:
- Adding
"flag": "no_split"
to thepde_text
object. - Setting
text_line_flag
to"no_new_line"
. - Ensuring lines have an initial element (e.g.,
pde_text_line → initial element
), which blocks any splitting. - Using
"join": true
intext_line_neighbours
to force keeping lines together.
🔧 Editable Values
You can set the following values to control splitting:
"flag"
— use"no_split"
to prevent any split."text_line_flag"
— use"new_line"
or"no_new_line"
onpde_text_line
."split"
— apply withintext_line_update
ortext_line_neighbours
.
⚙️ Thresholds
These control splitting heuristics:
"text_split_distance"
— distance between words/lines below which splitting is likely.
📦 Template Source
- From:
pde_text_line → initial element
- Functions:
text_line_update
text_line_neighbours
- Value checks:
text_line_flag
split
join