Summary of Key Features:
- Nested Templates: Allow for additional processing rules to be applied to elements after they are created.
- Math Expressions: Enable dynamic calculations for element properties like bounding box coordinates.
- Anchors: Mark specific elements for later reference in math formulas or other rules.
- Dynamic Bounding Box Calculations: Position elements relative to other elements or page dimensions.
- Start-End Bounding Box Calculations: Position elements with dynamic length based on other start/end elements
Summary of Changes:
Metadata
- Changed { “
metadata": { "version
“: “2” }
Settings
- Removed { “
settings": { "retain_pdfua
“: “” } - Added { “
settings": { "rtl
“: “false” }- Prepared for future support of right-to-left reading order languages.
- Renamed key { “
settings": { "substructure_form_xobject
“: “…” }- Old name:
"substructure_form_XObject"
- Old name:
Pagemap
New keys
{
"template": {
"pagemap": [
{
"form_table_detect": "1",
"graphic_table_detect": "1",
"col_num": "0",
"label_image_detect": "1",
"label_word_detect": "1",
"label_word_dist_sibling_ratio": "4",
"text_table_alignment_distance": "0.4",
"text_table_alignment_type": "2",
"text_table_col_alignment_type": "1",
"text_table_col_distance": "0.8",
"text_table_col_similarity": "0.36",
"text_table_col_similarity_type": "0",
"text_table_column_similarity": "0.5",
"text_table_detect": "1",
"text_table_image_col_gs": "1",
"text_table_image_col_height_max": "0",
"text_table_image_col_height_max_ratio": "2",
"text_table_image_col_height_min": "0",
"text_table_image_col_height_min_ratio": "1",
"text_table_image_col_w1": "1",
"text_table_image_col_width_max": "0",
"text_table_image_col_width_max_ratio": "4",
"text_table_image_col_width_min": "0",
"text_table_image_col_width_min_ratio": "1",
"text_table_image_similarity": "0.7",
"text_table_image_similarity_w1": "1",
"text_table_image_similarity_w2": "1",
"text_table_paragraph_similarity": "0.7",
"text_table_row_alignment_type": "1",
"text_table_text_col_w1": "1",
"text_table_text_col_w2": "1",
"text_table_text_col_width_max": "0",
"text_table_text_col_width_max_ratio": "8",
"text_table_text_col_width_min": "0",
"text_table_text_col_width_min_ratio": "1",
"toc_detect": "1",
"toc_word_distance_ratio": "1",
}
]
}
}
Name | Default Value | Description |
form_table_detect | 1 | Recognize form fields as tables. Replaces old key table_detect_form |
graphic_table_detect | 1 | Graphic tables detection. Values: 0 | 1. If 0 is set, it prevents generating tables from paths. |
col_num | 0 | Column number. If 0, columns are automatically detected. Otherwise the col_num is used for layput detection. |
label_image_detect | 1 | Graphic labels detection. Possible values: 0 | 1. If 0 is set, it prevents generating labels from paths. |
label_word_detect | 1 | Text labels detection. Possible values: 0 | 1. If 0 is set, it prevents generating labels from words. |
label_word_dist_sibling_ratio | 4 | This threshold, defined as a ratio multiplied by a siblings font size, sets the maximum gap allowed between a label and its sibling element to be joined together. If the distance exceeds this value, the label and its sibling remain separate. |
text_table_<…> | Replaces previous table_sect_<> nodes. | |
toc_detect | 1 | Table of contents detection. Possible values: 0 | 1. If 0 is set, it prevents generating TOC from words. |
toc_word_distance_ratio | 0 | Threshold ratio that determines when TOC entries should be clustered together. The value is multiplied by the average page font width. |
Changed default values
{
"template": {
"pagemap": [
{
"initial_element_expansion": "1",
"label_image_distance": "4",
"label_image_w4": "4",
"label_image_width_min_ratio": "4",
"text_line_distance_max_ratio": "2",
"toc_word_distance": "0",
"word_distance_ratio": "0.4",
"word_space_distance_max_ratio": "4",
}
]
}
}
Removed Keys
{
"template": {
"pagemap": [
{
"label_image_w6": "1",
"label_image_w7": "1",
"label_image_w8": "0",
"label_image_w9": "0",
"label_sibling_dist_ratio": "1.2",
"sect_table_alignment_distance": "0.4",
"sect_table_alignment_type": "2",
"sect_table_col_alignment_type": "1",
"sect_table_col_distance": "0.8",
"sect_table_col_similarity": "0.36",
"sect_table_col_similarity_type": "0",
"sect_table_column_similarity": "0.5",
"sect_table_image_col_gs": "1",
"sect_table_image_col_height_max": "0",
"sect_table_image_col_height_max_ratio": "2",
"sect_table_image_col_height_min": "0",
"sect_table_image_col_height_min_ratio": "1",
"sect_table_image_col_w1": "1",
"sect_table_image_col_width_max": "0",
"sect_table_image_col_width_max_ratio": "4",
"sect_table_image_col_width_min": "0",
"sect_table_image_col_width_min_ratio": "1",
"sect_table_image_similarity": "0.7",
"sect_table_image_similarity_w1": "1",
"sect_table_image_similarity_w2": "1",
"sect_table_paragraph_similarity": "0.7",
"sect_table_row_alignment_type": "1",
"sect_table_text_col_w1": "1",
"sect_table_text_col_w2": "1",
"sect_table_text_col_width_max": "0",
"sect_table_text_col_width_max_ratio": "8",
"sect_table_text_col_width_min": "0",
"sect_table_text_col_width_min_ratio": "1",
"table_detect_form": "1",
"table_detect_sect": "1",
"text_line_subscript_len": "6",
"text_line_subscript_space_ratio": "0.5",
}
]
}
}
Pagemap Regex
New keys
Added new pagemap_regex section for “rtl”: “true” for the future support of RTL languages.
Functions
New Functions
Function Name | Description |
object_update | The test is triggered when the page content object is tested. Function replaces old functions form_object_process , path_object_process , image_object_process , text_obect_process . The object type is passed as a parameter. |
text_run_update | Updates a text run element after processing text objects. |
text_run_neighbours | This test is triggered when forming text lines from textruns. |
word_neighbours | This test is triggered when forming text lines from words. Function replaces word_spacing function. |
Removed Functions
Function Name | Description |
form_object_process path_object_process image_object_process text_obect_process | Function replaced by object_process function with passed object type as parameter. |
word_spacing text_line_add_word | Function replaced by word_neighbours function |
single_instance_detect | TBD |
1. Nested Templates in element_create
Nested templates allow for additional processing rules to be applied to specific elements after they are created. This is particularly useful for defining complex behaviors for elements like text lines or tables.
Example:
"element_create": [
{
"comment": "Initial elements on the top of Page 2 above the Datum table",
"disable": "true",
"elements": [
{
"bbox": [
"51.52302932739258",
"639.4674682617188",
"547.4321899414062",
"721.3561401367188"
],
"comment": "IBAN, BIC",
"flag": "no_table",
"element_template": {
"template": {
"text_line_update": [
{
"query": {
"param": [
"pde_text_line"
]
},
"statement": "$if",
"text_line_flag": "new_line"
}
]
}
},
"type": "pde_text"
}
],
"query": {
"$and": [
{
"$page_num": "2"
}
]
},
"statement": "$if"
}
]
Explanation:
- element_
template
: A nested template that defines additional rules for the element.text_line_update
: Updates text lines within the element.query
: Defines conditions for applying the update.param
: Specifies the type of element being processed (pde_text_line
).
statement
: Determines the evaluation logic ($if
in this case).text_line_flag
: Marks the text line with a specific flag (new_line
in this case).
2. Math expressions
Math formulas like SUM(...)
and MINUS(...)
are used to dynamically calculate values for element properties such as bounding box coordinates.
Supported functions:
Function | Syntax | Description |
SUM | SUM(value1, value2) | Returns the sum of numbers. Equivalent to the `+` operator. |
MINUS | MINUS(value1, value2) | Returns the difference of two numbers. Equivalent to the `-` operator. |
ABS | ABS(value1) | Returns the absolute value of a number. |
MULTIPLY | MULTIPLY(value1, value2) | Returns the product of two numbers. Equivalent to the `*` operator. |
DIVIDE | DIVIDE(value1, value2) | Returns one number divided by another. Equivalent to the `/` operator. |
MIN | MIN(value1, value2) | Returns the minimum value in a numeric dataset. |
MAX | MAX(value1, value2) | Returns the maximum value in a numeric dataset. |
MOD | MOD(dividend, divisor) | Returns the result of the modulo operator, the remainder after a division operation. |
FLOOR | FLOOR(value1) | Rounds a number down to the nearest integer multiple of specified significance. |
CEILING | CEILING(value1) | Rounds a number up to the nearest integer multiple of specified significance. |
Example:
"element_create": [
{
"elements": [
{
"bbox": [
"$A1_left",
"MINUS($A1_top,40)",
"SUM($A1_right,100)",
"$A1_top"
],
"type": "pde_image"
}
],
"query": {},
"statement": "$if"
}
]
Explanation:
bbox
: The bounding box coordinates are calculated using math formulas.$A1_left
: Refers to the left coordinate of the anchor elementA1
.MINUS($A1_top,40)
: Subtracts40
from the top coordinate of the anchor elementA1
.SUM($A1_right,100)
: Adds100
to the right coordinate of the anchor elementA1
.$A1_top
: Refers to the top coordinate of the anchor elementA1
.
3. Anchor elements
Anchors elements are used to mark specific elements for later reference. The “flag”: “anchor” and “name” must be assigned to the object which should be marked as an anchor. The object properties can be referenced in other template rules.
Example: $A1_left
, $A1_page_num
. for the anchor of name “A1“
Example:
"text_line_update": [
{
"comment": "Create anchor: Saldenmitteilung from a text_line object",
"flag": "anchor",
"name": "A1",
"query": {
"$and": [
{
"$0_fill_color": [
"255",
"0",
"0"
]
},
{
"$0_font_size": "11.5"
}
],
"param": [
"pde_text_line"
]
},
"statement": "$if"
}
]
Explanation:
flag
: Marks the element as ananchor
.name
: Assigns a name (A1
) to the anchor for later reference.query
: Defines conditions for creating the anchor.$and
: Logical AND operator.$0_fill_color
: Checks if the fill color of the element is[255, 0, 0]
(red).$0_font_size
: Checks if the font size is11.5
.
param
: Specifies the type of element being processed (pde_text_line
).statement
: Determines the evaluation logic ($if
in this case).
4. Dynamic Bounding Box Calculations
Dynamic bounding box calculations are used to position elements relative to other elements or page dimensions. A dimension or coordinate of an anchor is used in this case.
Example:
"element_create": [
{
"elements": [
{
"bbox": [
"$A1_left",
"MINUS($A1_top,40)",
"SUM($A1_right,100)",
"$A1_top"
],
"type": "pde_image"
}
],
"query": {},
"statement": "$if"
}
]
Explanation:
bbox
: The bounding box coordinates are dynamically calculated.$A1_left
: Refers to the left coordinate of the anchor elementA1
.MINUS($A1_top,40)
: Subtracts40
from the top coordinate of the anchor elementA1
.SUM($A1_right,100)
: Adds100
to the right coordinate of the anchor elementA1
.$A1_top
: Refers to the top coordinate of the anchor elementA1
.
5. Start-End Bounding Box Calculations
Dynamic start/end bounding box calculations are used to position elements relative to other elements or page dimensions by defining the start_bbox
and end_bbox
.
Example:
"element_create": [
{
"comment": "Initial elements on the top of Page 2 above the Datum table",
"disable": "true",
"elements": [
{
"comment": "Creates table below KostenANCHOR",
"end_bbox": [
"0",
"$AngabenANCHOR_top",
"$page_width",
"$AngabenANCHOR_top"
],
"start_bbox": [
"0",
"MINUS($KostenANCHOR_bottom,50)",
"$page_width",
"MINUS($KostenANCHOR_bottom,50)"
],
"type": "pde_table"
}
],
"query": {
"$and": [
{
"$page_num": "2"
}
]
},
"statement": "$if"
}
]
Explanation:
- start_
bbox
: The bounding box of the start element coordinates.0
: Refers to the left-most coordinate on the page.MINUS($KostenANCHOR_bottom,50)
: Refers to the bottom coordinate of the anchor elementKostenANCHOR
.$page_width
: Refers to the right-most coordinate on the page.MINUS($KostenANCHOR_bottom,50)
: Refers to the bottom coordinate of the anchor elementKostenANCHOR
.
- end_
bbox
: The bounding box of the end element coordinates.0
: Refers to the left-most coordinate on the page.$AngabenANCHOR_top
: Refers to the top coordinate of the anchor elementAngabenANCHOR
.$page_width
: Refers to the right-most coordinate on the page.$AngabenANCHOR_top
: Refers to the top coordinate of the anchor elementAngabenANCHOR
.