Template Language for PDF AutoTagging version 2 (Beta)

Summary of Key Features:

  1. Nested Templates: Allow for additional processing rules to be applied to elements after they are created.
  2. Math Expressions: Enable dynamic calculations for element properties like bounding box coordinates.
  3. Anchors: Mark specific elements for later reference in math formulas or other rules.
  4. Dynamic Bounding Box Calculations: Position elements relative to other elements or page dimensions.
  5. Start-End Bounding Box Calculations: Position elements with dynamic length based on other start/end elements

Summary of Changes:

Metadata

  • Changed { “metadata": { "version“: “2” }

Settings

  • Removed { “settings": { "retain_pdfua“: “” }
  • Added { “settings": { "rtl“: “false” }
    • Prepared for future support of right-to-left reading order languages.
  • Renamed key { “settings": { "substructure_form_xobject“: “…” }
    • Old name: "substructure_form_XObject"

Pagemap

New keys

{
    "template": {
        "pagemap": [
            {
                "form_table_detect": "1",
                "graphic_table_detect": "1",
                "col_num": "0",
                "label_image_detect": "1",
                "label_word_detect": "1",
                "label_word_dist_sibling_ratio": "4",
                "text_table_alignment_distance": "0.4",
                "text_table_alignment_type": "2",
                "text_table_col_alignment_type": "1",
                "text_table_col_distance": "0.8",
                "text_table_col_similarity": "0.36",
                "text_table_col_similarity_type": "0",
                "text_table_column_similarity": "0.5",
                "text_table_detect": "1",
                "text_table_image_col_gs": "1",
                "text_table_image_col_height_max": "0",
                "text_table_image_col_height_max_ratio": "2",
                "text_table_image_col_height_min": "0",
                "text_table_image_col_height_min_ratio": "1",
                "text_table_image_col_w1": "1",
                "text_table_image_col_width_max": "0",
                "text_table_image_col_width_max_ratio": "4",
                "text_table_image_col_width_min": "0",
                "text_table_image_col_width_min_ratio": "1",
                "text_table_image_similarity": "0.7",
                "text_table_image_similarity_w1": "1",
                "text_table_image_similarity_w2": "1",
                "text_table_paragraph_similarity": "0.7",
                "text_table_row_alignment_type": "1",
                "text_table_text_col_w1": "1",
                "text_table_text_col_w2": "1",
                "text_table_text_col_width_max": "0",
                "text_table_text_col_width_max_ratio": "8",
                "text_table_text_col_width_min": "0",
                "text_table_text_col_width_min_ratio": "1",
                "toc_detect": "1",
                "toc_word_distance_ratio": "1",
            }
        ]
    }
}
NameDefault ValueDescription
form_table_detect1Recognize form fields as tables.
Replaces old key table_detect_form
graphic_table_detect1Graphic tables detection. Values: 0 | 1. If 0 is set, it prevents generating tables from paths.
col_num0Column number. If 0, columns are automatically detected. Otherwise the col_num is used for layput detection.
label_image_detect1Graphic labels detection. Possible values: 0 | 1. If 0 is set, it prevents generating labels from paths.
label_word_detect1Text labels detection. Possible values: 0 | 1. If 0 is set, it prevents generating labels from words.
label_word_dist_sibling_ratio4This threshold, defined as a ratio multiplied by a siblings font size, sets the maximum gap allowed between a label and its sibling element to be joined together. If the distance exceeds this value, the label and its sibling remain separate.
text_table_<…>Replaces previous table_sect_<> nodes.
toc_detect1Table of contents detection. Possible values: 0 | 1. If 0 is set, it prevents generating TOC from words.
toc_word_distance_ratio0Threshold ratio that determines when TOC entries should be clustered together. The value is multiplied by the average page font width.

Changed default values

{
    "template": {
        "pagemap": [
            {
                "initial_element_expansion": "1",
                "label_image_distance": "4",
                "label_image_w4": "4",
                "label_image_width_min_ratio": "4",
                "text_line_distance_max_ratio": "2",
                "toc_word_distance": "0",
                "word_distance_ratio": "0.4",
                "word_space_distance_max_ratio": "4",
            }
        ]
    }
}

Removed Keys

{
    "template": {
        "pagemap": [
            {
                "label_image_w6": "1",
                "label_image_w7": "1",
                "label_image_w8": "0",
                "label_image_w9": "0",
                "label_sibling_dist_ratio": "1.2",
                "sect_table_alignment_distance": "0.4",
                "sect_table_alignment_type": "2",
                "sect_table_col_alignment_type": "1",
                "sect_table_col_distance": "0.8",
                "sect_table_col_similarity": "0.36",
                "sect_table_col_similarity_type": "0",
                "sect_table_column_similarity": "0.5",
                "sect_table_image_col_gs": "1",
                "sect_table_image_col_height_max": "0",
                "sect_table_image_col_height_max_ratio": "2",
                "sect_table_image_col_height_min": "0",
                "sect_table_image_col_height_min_ratio": "1",
                "sect_table_image_col_w1": "1",
                "sect_table_image_col_width_max": "0",
                "sect_table_image_col_width_max_ratio": "4",
                "sect_table_image_col_width_min": "0",
                "sect_table_image_col_width_min_ratio": "1",
                "sect_table_image_similarity": "0.7",
                "sect_table_image_similarity_w1": "1",
                "sect_table_image_similarity_w2": "1",
                "sect_table_paragraph_similarity": "0.7",
                "sect_table_row_alignment_type": "1",
                "sect_table_text_col_w1": "1",
                "sect_table_text_col_w2": "1",
                "sect_table_text_col_width_max": "0",
                "sect_table_text_col_width_max_ratio": "8",
                "sect_table_text_col_width_min": "0",
                "sect_table_text_col_width_min_ratio": "1",
                "table_detect_form": "1",
                "table_detect_sect": "1",
                "text_line_subscript_len": "6",
                "text_line_subscript_space_ratio": "0.5",
            }
        ]
    }
}

Pagemap Regex

New keys

Added new pagemap_regex section for “rtl”: “true” for the future support of RTL languages.

Functions

New Functions

Function NameDescription
object_updateThe test is triggered when the page content object is tested. Function replaces old functions form_object_process, path_object_process, image_object_process, text_obect_process. The object type is passed as a parameter.
text_run_updateUpdates a text run element after processing text objects.
text_run_neighboursThis test is triggered when forming text lines from textruns.
word_neighboursThis test is triggered when forming text lines from words. Function replaces word_spacing function.

Removed Functions

Function NameDescription
form_object_process
path_object_process
image_object_process
text_obect_process
Function replaced by object_process function with passed object type as parameter.
word_spacing
text_line_add_word
Function replaced by word_neighbours function
single_instance_detectTBD


1. Nested Templates in element_create

Nested templates allow for additional processing rules to be applied to specific elements after they are created. This is particularly useful for defining complex behaviors for elements like text lines or tables.

Example:

"element_create": [
    {
        "comment": "Initial elements on the top of Page 2 above the Datum table",
        "disable": "true",
        "elements": [
            {
                "bbox": [
                    "51.52302932739258",
                    "639.4674682617188",
                    "547.4321899414062",
                    "721.3561401367188"
                ],
                "comment": "IBAN, BIC",
                "flag": "no_table",
                "element_template": {
                    "template": {
                        "text_line_update": [
                            {
                                "query": {
                                    "param": [
                                        "pde_text_line"
                                    ]
                                },
                                "statement": "$if",
                                "text_line_flag": "new_line"
                            }
                        ]
                    }
                },
                "type": "pde_text"
            }
        ],
        "query": {
            "$and": [
                {
                    "$page_num": "2"
                }
            ]
        },
        "statement": "$if"
    }
]

Explanation:

  • element_template: A nested template that defines additional rules for the element.
    • text_line_update: Updates text lines within the element.
      • query: Defines conditions for applying the update.
        • param: Specifies the type of element being processed (pde_text_line).
      • statement: Determines the evaluation logic ($if in this case).
      • text_line_flag: Marks the text line with a specific flag (new_line in this case).

2. Math expressions

Math formulas like SUM(...) and MINUS(...) are used to dynamically calculate values for element properties such as bounding box coordinates.

Supported functions:

FunctionSyntaxDescription
SUMSUM(value1, value2)Returns the sum of numbers. Equivalent to the `+` operator.
MINUSMINUS(value1, value2)Returns the difference of two numbers. Equivalent to the `-` operator.
ABSABS(value1)Returns the absolute value of a number.
MULTIPLYMULTIPLY(value1, value2)Returns the product of two numbers. Equivalent to the `*` operator.
DIVIDEDIVIDE(value1, value2)Returns one number divided by another. Equivalent to the `/` operator.
MINMIN(value1, value2)Returns the minimum value in a numeric dataset.
MAXMAX(value1, value2)Returns the maximum value in a numeric dataset.
MODMOD(dividend, divisor)Returns the result of the modulo operator, the remainder after a division operation.
FLOORFLOOR(value1)Rounds a number down to the nearest integer multiple of specified significance.
CEILINGCEILING(value1)Rounds a number up to the nearest integer multiple of specified significance.

Example:

"element_create": [
    {
        "elements": [
            {
                "bbox": [
                    "$A1_left",
                    "MINUS($A1_top,40)",
                    "SUM($A1_right,100)",
                    "$A1_top"
                ],
                "type": "pde_image"
            }
        ],
        "query": {},
        "statement": "$if"
    }
]

Explanation:

  • bbox: The bounding box coordinates are calculated using math formulas.
    • $A1_left: Refers to the left coordinate of the anchor element A1.
    • MINUS($A1_top,40): Subtracts 40 from the top coordinate of the anchor element A1.
    • SUM($A1_right,100): Adds 100 to the right coordinate of the anchor element A1.
    • $A1_top: Refers to the top coordinate of the anchor element A1.

3. Anchor elements

Anchors elements are used to mark specific elements for later reference. The “flag”: “anchor” and “name” must be assigned to the object which should be marked as an anchor. The object properties can be referenced in other template rules.
Example: $A1_left, $A1_page_num. for the anchor of name “A1

Example:

"text_line_update": [
    {
        "comment": "Create anchor: Saldenmitteilung from a text_line object",
        "flag": "anchor",
        "name": "A1",
        "query": {
            "$and": [
                {
                    "$0_fill_color": [
                        "255",
                        "0",
                        "0"
                    ]
                },
                {
                    "$0_font_size": "11.5"
                }
            ],
            "param": [
                "pde_text_line"
            ]
        },
        "statement": "$if"
    }
]

Explanation:

  • flag: Marks the element as an anchor.
  • name: Assigns a name (A1) to the anchor for later reference.
  • query: Defines conditions for creating the anchor.
    • $and: Logical AND operator.
    • $0_fill_color: Checks if the fill color of the element is [255, 0, 0] (red).
    • $0_font_size: Checks if the font size is 11.5.
  • param: Specifies the type of element being processed (pde_text_line).
  • statement: Determines the evaluation logic ($if in this case).

4. Dynamic Bounding Box Calculations

Dynamic bounding box calculations are used to position elements relative to other elements or page dimensions. A dimension or coordinate of an anchor is used in this case.

Example:

"element_create": [
    {
        "elements": [
            {
                "bbox": [
                    "$A1_left",
                    "MINUS($A1_top,40)",
                    "SUM($A1_right,100)",
                    "$A1_top"
                ],
                "type": "pde_image"
            }
        ],
        "query": {},
        "statement": "$if"
    }
]

Explanation:

  • bbox: The bounding box coordinates are dynamically calculated.
    • $A1_left: Refers to the left coordinate of the anchor element A1.
    • MINUS($A1_top,40): Subtracts 40 from the top coordinate of the anchor element A1.
    • SUM($A1_right,100): Adds 100 to the right coordinate of the anchor element A1.
    • $A1_top: Refers to the top coordinate of the anchor element A1.

5. Start-End Bounding Box Calculations

Dynamic start/end bounding box calculations are used to position elements relative to other elements or page dimensions by defining the start_bbox and end_bbox.

Example:

"element_create": [
    {
        "comment": "Initial elements on the top of Page 2 above the Datum table",
        "disable": "true",
        "elements": [
            {
                "comment": "Creates table below KostenANCHOR",
                "end_bbox": [
                    "0",
                    "$AngabenANCHOR_top",
                    "$page_width",
                    "$AngabenANCHOR_top"
                ],
                "start_bbox": [
                    "0",
                    "MINUS($KostenANCHOR_bottom,50)",
                    "$page_width",
                    "MINUS($KostenANCHOR_bottom,50)"
                ],
                "type": "pde_table"
            }
        ],
        "query": {
            "$and": [
                {
                    "$page_num": "2"
                }
            ]
        },
        "statement": "$if"
    }
]

Explanation:

  • start_bbox: The bounding box of the start element coordinates.
    • 0: Refers to the left-most coordinate on the page.
    • MINUS($KostenANCHOR_bottom,50): Refers to the bottom coordinate of the anchor element KostenANCHOR.
    • $page_width: Refers to the right-most coordinate on the page.
    • MINUS($KostenANCHOR_bottom,50): Refers to the bottom coordinate of the anchor element KostenANCHOR.
  • end_bbox: The bounding box of the end element coordinates.
    • 0: Refers to the left-most coordinate on the page.
    • $AngabenANCHOR_top: Refers to the top coordinate of the anchor element AngabenANCHOR.
    • $page_width: Refers to the right-most coordinate on the page.
    • $AngabenANCHOR_top: Refers to the top coordinate of the anchor element AngabenANCHOR.