Master PDF Auto-Tagging with Layout Templates
Introduction
PDFix Template Layout is a rule-based system that allows users to define custom layout recognition and tagging logic for PDFs. This is essential for structured documents (e.g., statements, catalogs, invoices) and for achieving compliance with accessibility standards such as PDF/UA.
This technical documentation provides a comprehensive reference to all features, functions, and nodes used in the Template Layout language, complete with advanced examples, implementation details, and behavioral logic derived from internal SDK methods like split_texts.
General Template Logic
The fundamental approach to building a complex template is to divide the document into logical sections using initial elements. Each section can have its own child template (defined in an element_template node), which overrides the main behavior of the layout recognition algorithm for that part of the document.
Initial elements are created using the element_create function. The most important value to define for an initial element is its bounding box (bbox).
Bounding boxes can be defined in two main ways:
- Fixed bbox: Using direct coordinates.
- Start/end bbox: More dynamic, may span multiple anchors.
Bounding box coordinates (left, top, right, bottom) can use:
- Static float values
- General context values (
$page_num, $page_width, $doc_num_pages
, etc.) - Parent values (
$parent_top, $parent_left, etc.
) - Anchor references (
$A1_bottom, $ANCHOR_right, etc.
) - Math functions (
e.g. SUM($parent_left, 10)
)
Why Use a Template?
- Ensure consistent tagging across similar documents
- Improve the accuracy of semantic tagging
- Save time on manual remediation
- Enhance accessibility and screen reader compatibility
- Boost efficiency by applying one template to thousands of similar PDFs
Pagemap Overview
The pagemap engine is the core class responsible for page layout recognition. It processes each page from bottom to top and groups primitive objects into logical entities like words, lines, paragraphs, rules, tables, lists, etc.
In PDFix Desktop, each processing step can be visually debugged and adjusted. You can use the settings -> debug_pagemap_stop property in the template to stop the layout engine after any step to inspect intermediate results. This is especially useful when building templates – you can verify every step separately and identify where recognition failed.
For example, setting the stop point to word_update halts after words are recognized. You can then verify whether superscripts and subscripts are properly merged into words. This technique is also helpful for debugging complex processes like paragraph assembly or table recognition.
Functions
- element_create
- object_update
- annot_update
- line_update
- rect_update
- object_update_text
- text_run_update
- text_run_neighbours
- word_update
element_create
Purpose and Behavior
The element_create
function inserts a new layout element into the document structure. Unlike other functions that modify detected elements, this one defines virtual elements manually-based on explicitly set parameters like bounding box and type. These elements can be used as standalone tags, layout hints, containers for tagging, or structural anchors for downstream operations (e.g., TOC positioning or header/footer marking).
Created elements can be tagged, styled, and grouped just like native PDF elements. This is especially useful in cases where automatic detection fails or manual control over structure is required.
The element is only created if the following is defined:
- valid
type
pde_text, pde_text_line, pde_image, pde_container, pde_list, pde_line, pde_rect, pde_table, pde_cell, pde_toc, pde_header, pde_footer
bbox
,start_bbox
, orend_bbox
- Fixed Bounding Box Elements – The simplest way to define an initial element is with a fixed bbox. This method is best for elements like headers, footers, or static sidebars that appear at known locations.
- Anchor-Based Initial Elements – A more flexible method uses anchors. An initial element’s position and size are defined relative to previously detected anchor elements. This allows dynamic placement based on content structure. The system creates initial elements at the start of page processing and assigns each page object to one of them based on overlap.
If the bbox is zero-sized the function cause error during layout detection.
Warning: By using this method, you override the default behavior of the recognition engine. Instead of detecting elements heuristically, the engine will use your pre-defined type, bbox, and flags.
Nested Templates (element_template)
An initial element can contain its own element_template
node. If present, this child template completely replaces the parent template for that section. This means that:
- Functions and values from the main template do not apply within this section
- Only the child template will be used to process that region
If element_template
is not defined, the initial element inherits the parent or global template rules.
Initial element matching modes
- Exact bounding box match – if the
no_expand
flag is set, the element’s bbox is extended using initial_element_expansion. - Overlap-based matching – if no_expand is not set, the layout engine checks if the overlap area with a parent exceeds the
initial_element_overlap
threshold. If so, the parent is assigned.
Editable Values
type
– Element type (e.g.pde_text
,pde_line
,pde_table
)bbox
– Bounding box for element position and sizestart_bbox
– defines beginning boundary for dynamic elementsend_bbox
– defines end boundary for dynamic elementsname
– Unique name for identification and referencesid
– ID used in tagging or alt-text generationflag
– Flags likeartifact, header, footer
for logical roletext_flag
– Flags for text elements (e.g.no_newline, first_cap
, etc.)tag
– Semantic tag (P, H1, L
, etc.) for accessibilityheading
– Text style role:normal, h1, h2, h3
label
– Label level (label, li_1, label_no
, etc.)lang
– Language of element content (e.g.en-US
)actual_text
– Replacement text for screen readersalt
– Alternate text used for descriptions (especially images)single_instance
– Constraints for creating only one instance per page (e.g.font_size
)sort_direction
– Reading order in containers:0 = default, 1 = columns, 2 = rows
splitter
– If element acts as a layout splitter (e.g.pde_cell
)element_template
– Embedded template definition for nested configuration
Table or Cell specific fields
col_num
– Number of columns (for tables)row_num
– Number of rows (for tables)cell_row
– Row position in table (for cell)cell_column
– Column position in table (for cell)cell_row_span
– Number of rows spanned by the cellcell_column_span
– Number of columns spanned by the cellcell_scope
– Scope of cell: row, columncell_header
– Whether the cell is a headercell_associated_header
– Comma-separated list of headers
Thresholds
Template Source: initial element
initial_element_overlap
initial_element_expansion
Example
This JSON snippet defines virtual elements for tagging content on page 1 of a PDF document using the element_create
function. It demonstrates two primary use cases:
- Creating a dynamic header region with a nested template
- Manually tagging the text in specific bounding box wrapping each line of the NovaBank address in a single P tags
"element_create": [
{
"comment": "Tag elements on page n.1",
"elements": [
{
"bbox": [
"0",
"688",
"$page_width",
"$page_height"
],
"comment": "Tag header",
"element_template": {
"template": {
"pagemap": [
{
"rd_sort": "2",
"rd_sort_direction": "2",
"statement": "$if"
}
]
}
},
"type": "pde_header"
},
{
"bbox": [
"88.05127716064453",
"527.709228515625",
"325.1794738769531",
"677.7356567382812"
],
"comment": "Tag NovaBank address as single line text",
"text_flag": "new_line",
"type": "pde_text"
}
]
}
]
object_update
Purpose and Behavior
The object_update
function parses low-level PDS objects on the current PDF page and prepares them for text extraction and semantic assignment.
This function parses non-text objects from the current page and can apply to the following types:
pds_object
pds_path
pds_image
pds_shading
pds_form
This function is also where pde_line
and pde_rect
elements are derived from pds_path objects.
All recognized objects are assigned an initial element based on their bounding box.
Editable Values
flag
– behavior modifier (artifact, header, footer
)
Thresholds
Template Source: pds_object → initial element
artifact_similarity
element_line_similarity
angle_deviation
isolated_element_ratio
path_object_max
path_object_min
isolated_text
Best Practice Recommendations
- If you already know that certain objects are artifacts (like background images, aside boxes, or decorative headers/footers), it’s best to mark them in
object_update
early. Once an object is flagged asartifact
, it will be excluded from further recognition stages.
Example
Mark all text objects located above 750px on the first page as artifacts.
{
"object_update": [
{
"flag": "artifact",
"query": {
"$0_bottom": { "$gt": "750" },
"$page_num": "1"
},
"param": ["pds_object"],
"statement": "$if"
}
]
}
annot_update
Purpose and Behavior
The annot_update function processes annotations (pdf_annot) on a PDF page and determines how they should be represented in the layout structure – either as form fields, markups, or excluded artifacts.
This function is responsible for:
- Converting widget annotations (e.g., form fields like checkboxes or text inputs) into pde_form_field elements.
- Creating layout representations for non-widget annotations such as highlights, comments, or shapes.
- Assigning each annotation to its initial element container based on bounding box overlap.
- Filtering out annotations that are excluded from processing via internal flags (kStateExclude).
line_update
Purpose and Behavior
The line_update function processes all recognized line elements (pde_line) and attempts to extend or merge them into longer logical lines. This is especially important for recognizing structural lines used in tables, headers, or boxed layouts.
Merging is controlled by a strict priority-based sequence.
Merge Decision Logic (in exact order)
Lines are eligible to be merged only if:
- Same Form Object: Both lines originate from the same form object (m_form_obj).
If this check fails, no further conditions are evaluated. - Initial Element Relationship:
- ✅ Line A is the
initial_element
of Line B → merge.
✅ Line A’s name matches Line B’sparent
→ merge. - ✅ If neither condition applies, both lines must share the same initial_element to be considered.
- ✅ Line A is the
- No Join Restriction:
- If either line has the flag
no_join
, they are not merged, regardless of other matches.
- If either line has the flag
- Line Orientation and Geometry Check:
- Lines must pass the test, which evaluates parallelism, overlap, and angle tolerance (as controlled by thresholds like
table_line_intersection
).
- Lines must pass the test, which evaluates parallelism, overlap, and angle tolerance (as controlled by thresholds like
The merging proceeds in two passes:
- First pass processes only initial lines – to allow them to grow before others are merged.
- Second pass processes all lines again, now using updated geometry.
The resulting extended lines are then available for tagging or further recognition steps (e.g., table detection).
Editable Values
name
– assigns a unique identifier.parent
– sets a link to initial element.label
– optional semantic marker.tag
– explicitly tags the line for PDF structure.flag
– behavior modifier (artifact, header, footer, splitter, no_join
).
Thresholds
Template Source: pde_line → initial element
table_line_intersection
– defines how closely two lines must intersect or align to be eligible for merging in the line extend test.
Best Practice Recommendations
- Lines marked as
artifact
here will be excluded from tagging. This is ideal for footers, underlines, or visual dividers.
Example
Mark all detected lines as artifacts.
"line_update": [
{
"@statement": "$if",
"@query": {
"@param<type>": "query_param",
"param": [
["pde_line"]
]
},
"@flag": "artifact"
}
]
rect_update
Purpose and Behavior
The rect_update function processes detected rectangle elements (pde_rect) and attempts to merge them into unified rectangular blocks when certain geometric and structural criteria are met. This is particularly useful for simplifying visual background elements like section boxes, shaded containers, and graphic panels.
Merging is conducted recursively within each container, using the rect_update
method in two passes: one for initial rectangles and a second for the remaining content.
Merge Decision Logic (in exact order)
A rectangle will be merged with another only if all of the following conditions are met:
- Same Form Object:
Both rectangles must originate from the same form XObject (m_form_obj). - Initial Element Relationship:
- ✅ Rectangle A is the initial_element of Rectangle B → merge.
✅ Rectangle A’s name matches Rectangle B’s parent → merge. - ✅ If none of the above apply, both rectangles must have the same initial element.
- ✅ Rectangle A is the initial_element of Rectangle B → merge.
- Merge Permissions (Flags):
If either rectangle has theno_join
flag set, they are explicitly excluded from merging, regardless of any other logic. - Geometry and Visual Checks
- Rectangles must pass the internal extend test, which validates geometric alignment and similarity.
- Merging is allowed when they are parallel, share the same width or height, and meet internal tolerances for alignment and spacing.
Editable Values
name
– a unique identifier for reference or hierarchy.parent
– links this rectangle logically to another.label
– optional descriptive tag.tag
– structural tag used in PDF tagging.flag
– controls behavior (e.g., “artifact”, “header”, “no_join”).
Thresholds
Template Source: pde_rect → initial element
table_line_intersection
– defines how closely two lines must intersect or align to be eligible for merging in the rectangle extend test.
Example
Mark gray rectangles with both width and height less than 10px as labels.
"rect_update": [
{
"label": "li_1",
"query": {
"$and": [
{
"$0_fill_color": [
"100",
"100",
"100"
]
},
{
"$0_height": {
"$lte": "10"
}
},
{
"$0_width": {
"$lte": "10"
}
}
],
"param": [
"pde_rect"
]
},
"statement": "$if"
}
]
object_update_text
Purpose and Behavior
The object_update_text function parses text-based PDS objects from the current page. It is invoked with parameters:
pds_object
pds_text
The function extracts text runs from the page content, segments them into words, and assigns each word to an appropriate initial element based on its bounding box.
This is one of the earliest and most critical steps in the layout pipeline. If this segmentation fails, all downstream logic (headings, tagging, tables) will be incorrect.
Text runs are later grouped into words based on spatial properties – including inter-character spacing and baseline alignment.
Editable Values
flag
– controls behavior (artifact, header, footer
)
Thresholds
Template Source: page level using values in the main template block.
word_space_width_ratio
Ratio multiplier used to estimate the maximum allowed space between characters within a word. If spacing between characters exceeds this value × minimal char spacing → a new word begins.word_space_width_min_ratio
Optional lower bound used to constrain the influence of very small font sizes. Applied as: allowed_space = font_size × word_space_width_min_ratio
These values control when and where characters get split into new words – especially important in justified or stylized text.
Warning: These are defined in the root-level (global) template, not per-element.
Example
Mark all text objects positioned above 740px as artifacts.
"object_update": [
{
"comment": "Artifact texts except first page",
"flag": "artifact",
"query": {
"$and": [
{
"$page_num": {
"$gt": "1"
}
},
{
"$0_bottom": {
"$gte": "740"
}
}
],
"param": [
"pds_text"
]
},
"statement": "$if"
}
]
text_run_update
Purpose and Behavior
The text_run_update function modifies properties of individual pde_text_run elements after they are parsed from the page.
Its primary purpose is to assign text state flags based on visual or semantic context – such as subscript/superscript styling.
The text_run_update
rule is evaluated per text run, immediately after it is extracted from the PDF stream.
Editable Values
text_state_flag
Can include values like:subscript
superscript
These flags do not split the text run from its associated word. The run remains part of the word but is marked for later structural tagging.
This preserves grouped representations like: H₂O instead of: H 2 O
Thresholds
Template Source: page level using values in the main template block.
text_line_baseline_ratio
angle_deviation
Warning: These are defined in the root-level (global) template, not per-element.
Example
Mark all text with a font size of 4px as superscript.
"text_run_update": [
{
"query": {
"$and": [
{
"$0_font_name": {
"$regex": "ArialMT"
}
},
{
"$0_font_size": "4"
}
],
"param": [
"pde_text_run"
]
},
"statement": "$if",
"text_state_flag": "superscript"
}
]
text_run_neighbours
Purpose and Behavior
The text_run_neighbours function determines whether two consecutive pde_text_run elements should be merged into a single word or split.
This function overrides automatic word detection, offering precision control in cases where standard heuristics fail (e.g., tight kerning, styled fragments).
Merge Decision Logic
The function compares two runs:
- If join = true → the runs are always joined into the same word.
- If join = false → a forced split is applied between the runs, even if they align.
- If no explicit rule is matched, the fallback logic checks:
- Angle consistency (same_angle)
- Baseline alignment (same_baseline)
- Spacing break (e.g., visual separator or vertical border)
This lets you force join or split conditions for complex or edge-case text layouts.
Editable Values
join
– Boolean flag to force merge or split
Thresholds
Template Source: page level using values in the main template block.
text_line_baseline_ratio
– internal margin for baseline deviationangle_deviation
Warning: These are defined in the root-level (global) template, not per-element.
word_update
Purpose and Behavior
The word_update function processes individual pde_word elements after word segmentation is complete. It serves three key purposes:
- Semantic Flag Assignment
Applies structural or semantic flags to words based on regex rules. These flags influence downstream layout recognition (e.g., list detection, heading classification, TOC generation). - Filling and Label Detection
- Uses
regex_filling
to detect filler-only words (like “…”, “–“). - Splits or skips them depending on match length.
- Detects structured labels (e.g., numbered bullets or Roman/letter markers).
- Uses
Property Update and Artifact Extraction
If a word is marked as artifact, header
, or footer
, it’s converted into a separate pde_text element, detached from the text stream, and placed into the corresponding container. This prevents accidental tagging or reading by screen readers.
The word_update
function identifies possible list labels and table of contents (TOC) entries among pde_word elements. It marks these elements with corresponding label or toc values and attempts to pair them with a sibling (typically a neighboring word) that represents the list item content or TOC title.
Editable Values
name
– unique nametag
– semantic tag (e.g.,Span, Div, Note
)flag
– behavior modifier(artifact, header, footer
, etc.)label
– used for logical labelingheading
– heading levelactual_text
– alternate value for screen readers- lang – language code
- word_flag – manual override for system-assigned flags
- single_instance – suppresses duplicates across layout
- word_space” – sets an exact space width for this word’s font-size/font. If word_space is set manually here, it overrides all computed word spacing logic and disables the word_space_ratio.
Thresholds
Template Source: pde_word → initial element.
word_space_ratio
– multiplier for auto-calculated spacing (used unless overridden)word_space_update_max
– prevents auto-updates beyond a limit (0 = no updates)
pagemap_regex flags used for semantic detection
hyphen
– regex_hyphenbullet
– regex_bullet, regex_bullet_fontcolon
– regex_colonnumber
– number_charsterminal
– regex_terminalcapital
– regex_first_capdecimal_num
– regex_decimal_numberingroman_num
– regex_roman_numberingletter_num
– regex_letter_numberingpage_num
– regex_page_numberfilling
– regex_fillingcomma
– regex_commalabel
– regex_label, label_chars, regex_letter
Example
Mark all dot-leader words as artifacts.
{
"word_update": [
{
"flag": "artifact",
"query": {
"$0_text": { "$regex": "^\\.+$" }
},
"param": ["pde_word"],
"statement": "$if"
}
]
}
Change Tag Type.
"word_update": [
{
"query": {
"$and": [
{
"$0_font_size": "4.5"
}
],
"param": [
"pde_word"
]
},
"statement": "$if",
"tag": "Span"
}
]
Modify Actual Text based on character properties (text).
"word_update": [
{
"actual_text": "No",
"query": {
"$and": [
{
"$0_text": {
"$regex": "□"
}
}
],
"param": [
"pde_word"
]
},
"statement": "$if",
"tag": "Span"
},
{
"actual_text": "Yes",
"query": {
"$and": [
{
"$0_text": {
"$regex": "■"
}
}
],
"param": [
"pde_word"
]
},
"statement": "$if",
"tag": "Span"
}
]
Word Spacing Precision Logic
Accurate word spacing is critical for reliable recognition of text lines, paragraph types (e.g. justified vs. simple), and ultimately, for layout structure such as heading alignment or table row segmentation.
Automatic Detection of Word Space
During word recognition, a base word space is automatically estimated for each unique:
- Font name
- Font size
This is computed by analyzing inter-character gaps across text runs with matching font properties. It defines what spacing is considered “normal” within a word vs. between words.
This automatically estimated word space width is then used to:
- Detect word boundaries
- Classify a line as:
- Simple: uniform space width
- Justified: variable space widths
Ways to Adjust Word Space
1. Exact Override per Word (word_update)
Template Source: pde_word → initial element
"word_update": [
{
"query": {
"$and": [
{
"$0_font_size": "4.5"
}
],
"param": [
"pde_word"
]
},
"statement": "$if",
"word_space": "4.2"
}
]
Sets an exact word spacing value for this font-size & font-name combination.
- Overrides all other logic
- Disables word_space_ratio
✅ Use this when auto-estimation fails for a specific word or stylized font.
2. Proportional Scaling (word_space_ratio)
Template Source: pde_container → initial element
"pagemap": [
{
"word_space_ratio": "1.15",
}
]
Multiplies the estimated space width by a scaling factor. Useful for small global corrections.
- Only used if no word_space is defined
- Applies to all words in the container
3. Post-Line Update Adjustment (text_line_update)
Template Source: pde_line → initial element (for word_space)
"text_line_update": [
{
"word_space": "4.2",
"query": {},
"param": [
"pde_text_line"
],
"statement": "$if"
}
]
Once words are grouped into lines, text_line_update can fine-tune or lock the final spacing:
- If
word_space
is defined in text_line_update, the spacing for all words in the line is set to that value. - If
word_space_update_max
is defined:- It limits re-estimation from line analysis
- Set to 0 to prevent any automatic spacing changes
Best Practice Recommendations
- Use
word_space
only when auto-estimation fails consistently for specific fonts. - Apply
word_space_ratio
globally in containers (e.g., invoices or tables). - Avoid multiple re-definitions across
word_update
andtext_line_update
unless needed for layout corrections. - Use
word_space_update_max:0
to lock final spacing post-assembly.
Label Detection
Purpose and Behavior
The word_update function detects and classifies list label words (li_1 to li_4) using regex rules and spatial relationships. When a word is marked as a potential label (e.g., 1., A), (iii)), the engine attempts to pair it with a sibling element – usually the associated paragraph or line content.
- If a word is explicitly marked with a label property in a template rule (e.g.,
label: li_1
), it is treated as a list item at the corresponding nesting level. - If no label is set manually, the system attempts automatic detection using regex and heuristics:
- Valid label formats: numeric (1.), alphabetic (A), Roman numerals (IV), bullets (•), or combinations with brackets or dots.
- The word is analyzed for
regex_label, regex_letter, regex_roman_numbering
, and others.
- Once a word is flagged as a possible label, it is matched to a sibling word based on reading order (RTL or LTR).
Thresholds
Template Source: pde_word → initial element.
label_word_detect
– Enables automatic label detection. Set to 0 in templates where labeling is irrelevant to prevent unwanted grouping.label_distance_ratio
– Used to calculate label-sibling horizontal distance. dist = font_size × label_distance_ratio. It affects how far the algorithm will look horizontally for a matching sibling.label_word_w1
– Weight for vertical alignment similarity between label candidates during clustering.label_word_w2
– Weight for label-sibling offset alignment.label_word_dist_sibling_ratio
– Reject labels if sibling is too far. Max distance = font_size × ratio.label_sibling_distance_ratio
– Reject false labels if sibling is too close to another word in the line.label_word_distance
– Maximum absolute clustering distance between label candidates.label_word_distance_ratio
– Used if label_word_distance == 0, scales by page width.concurrent_threads
– Multithreaded clustering of label candidates.
pagemap_regex flags used for semantic detection
label_chars
: Characters like “(“, “)”, “.” used to strip and normalize labels.regex_label
: Pattern for detecting list markers.
Editable Values
label: label, label_no, li_1 … li_4
Best Practice Recommendations
label_word_detect = 0
should be explicitly disabled in templates where label detection is irrelevant (e.g. tables without list-like elements).
Example
"word_update": [
{
"label": "li_1",
"query": {
"$and": [
{
"$0_text": {
"$regex": "^\\([a-z]\\)$"
}
}
],
"param": [
"pde_word"
]
},
"statement": "$if"
},
{
"label": "li_2",
"query": {
"$and": [
{
"$0_text": {
"$regex": "^\\(\\d\\)$"
}
}
],
"param": [
"pde_word"
]
},
"statement": "$if"
},
{
"label": "li_3",
"query": {
"$and": [
{
"$0_text": {
"$regex": "^\\((i|ii|iii|iv|v|vi|vii|viii|ix|x|xi|xii|xiii|xiv|xv|xvi|xvii|xviii|xix|xx)\\)$"
}
}
],
"param": [
"pde_word"
]
},
"statement": "$if"
}
]
TOC Detection
Purpose and Behavior
The word_update function also supports detection of Table of Contents (TOC) elements by identifying possible page number words (e.g., 15, 248, xii) and matching them with title words on the same line.
- Words are flagged as TOC numbers using regex_page_number.
- A sibling is searched to the left or right (depending on document direction) that likely represents the section heading.
- TOC entries are grouped and clustered spatially, then marked as a TOC item.
Thresholds
Template Source: pde_word → initial element.
- toc_detect – Enables TOC detection. Set to 0 in non-TOC templates for performance and precision.
- toc_word_distance – Absolute clustering cutoff for grouping similar TOC numbers. Defines the tightness of TOC number clustering. This ensures only aligned page number entries are grouped.
- toc_word_distance_ratio – Used if the absolute distance is unset. Multiplied by page_font_width.
- concurrent_threads – Parallel processing during clustering of TOC words.
pagemap_regex flags used for semantic detection
regex_page_number
: Used to detect page numbers.
Editable Values
toc: label, label_no
Best Practice Recommendations
toc_detect = 0
is essential in improving performance and avoiding misclassifications in non-TOC sections.
word_neighbours
Purpose and Behavior
The word_neighbours
function controls how recognized words (pde_word
) are grouped into text lines (pde_text_line
). It plays a critical role in reconstructing logical reading order and grouping sequences of text.
Two words will be joined into the same line if:
- Their initial elements are compatible.
- Neither has the
no_join
flag. - They have the same text style.
- Their writing angles match.
- They share the same baseline (within
text_line_baseline_ratio
× font). - No splitting object (line, rect, other words) is between them.
- The gap between them is ≤
word_space_distance_max
orword_space_distance_max_ratio
.
You can override these constraints explicitly using the word_neighbours
rule with join: true | false
.
If a word has an initial element that is already a pde_text_line
, it is automatically inserted into that line. Otherwise, a new line is created, and neighboring words are added if they pass all join conditions.
Editable Values
- Defined in
word_neighbours
:join
:- If true, forcibly joins two words into a line.
- If false, forcibly prevents joining.
- If not defined, fallback logic uses baseline, spacing, and flags.
- Defined in
word_update
for individual word flags:no_join
: If a word has this flag set, it will never be joined into a line, even if all other conditions match.
Thresholds
Template Source: pde_text_line → initial element.
text_line_baseline_ratio
– Maximum vertical offset allowed between baselines, multiplied by font size.word_space_distance_max
– Maximum absolute horizontal gap between words (in user units).word_space_distance_max_ratio
– If word_space_distance_max is zero, this ratio × max font size is used.- These two thresholds are ignored for RTL label words unless
label_word_detect
= 0.
- These two thresholds are ignored for RTL label words unless
Best Practice Recommendations
word_neighbours method is called only for spacific pairs of words, not all pairs.
Example
"word_neighbours": [
{
"join": "true",
"query": {
"$and": [
{
"$0_font_name": {
"$regex": "Arial-BoldMT"
}
},
{
"$1_font_name": {
"$regex": "Arial-BoldMT"
}
}
],
"param": [
"pde_word",
"pde_word"
]
},
"statement": "$if"
}
]
word_connect
Purpose and Behavior
The word_connect function merges consecutive text lines that logically belong together. It is typically used to reconstruct full paragraphs or wrapped lines that were split due to PDF layout quirks.
The function evaluates each pair of pde_text_line elements based on alignment, spacing, and their surrounding context. If certain conditions are met, two lines are merged into one.
Behavioral priorities for merging:
- Initial Element: If a pde_text_line has the no_join flag set, its linked pde_text_line will not be joined with others.
- Alignment Detection: Before merging, all lines are grouped by their vertical alignment (left, right, center) with a font-size-based threshold.
- Explicit Relationship Check: The function calls word_matches_to_line() to verify whether a word from one line can be logically extended into the next line.
- Alignment Count Heuristic: If both lines belong to strong alignment groups (≥3 members), merging is rejected to avoid false positives in structured layouts like tables or columns.
- Word Spacing Check: Distance between the rightmost word of the left line and the leftmost word of the right line must be less than calculated word spacing (from get_simple_word_spacing()).
When lines are merged, all their words are combined into a single pde_text_line, and the redundant line is removed from the container.
Editable Values
- word_update: e.g. kElemNoJoin, word_flag
- text_line_update: e.g. m_text_style, bounding boxes
- Line alignment grouping (implicitly controlled by layout)
Thresholds
Template Source: pde_text_line → initial element.
Word-level decisions from: pde_word → initial element
- text_line_baseline_ratio: Used in word_matches_to_line() to determine baseline tolerance.
- word_space_distance_max and word_space_distance_max_ratio: Control maximum allowed space between words/lines. Used in both word_neighbours and word_connect.
⚠️ Note: Merging is only attempted if the spacing threshold is not exceeded and no graphical splitters or layout anomalies are in between.
Best Practice Recommendations
word_neighbours method is called only for spacific pairs of words, not all pairs.
Example
"word_neighbours": [
{
"join": "true",
"query": {
"$and": [
{
"$0_font_name": {
"$regex": "Arial-BoldMT"
}
},
{
"$1_font_name": {
"$regex": "Arial-BoldMT"
}
}
],
"param": [
"pde_word",
"pde_word"
]
},
"statement": "$if"
}
]