Template

Table of Content:

Workspace

Preferences

PDFix Actions

Selection Tools

Validation

Accessibility

Tags

Annotations

Content

Bookmarks

Conversion

Destinations

Browser

Template

License

A template serves as a tool for tailoring the content recognition algorithm, allowing for customization to match the specific requirements of the document.

Recognition can be tailored for specific objects by considering their type, style, position or other properies.

An example: Assist in identifying headings by analyzing the font name and size.

PDFix Template

Table of Contents

General Settings

General template settings

key type value
retain_pdfua bool True
page_tag string NonStruct
substructure_form_XObject bool True

Example:

{
  "template": {
    "settings": {
      "retain_pdfua": true,
      "page_tag": "NonStruct",
      "substructure_form_XObject": true
    }
  }
}

Threshold Values

key value desciption
preflight_columns 1 Preflight colums druring preflight method. Set to 0 if this information is not required.
preflight_column_width_distance 20 Columns width detection precision.
preflight_column_alignment_distance 4 Columns alignment detection precision.
preflight_artifact_font_size_min 32 Minimum font size for artifact
preflight_artifact_w1 1 Horizontal alignment weight.
preflight_artifact_w2 1 Vertical alignment weight.
preflight_artifact_w3 1 Element width weight.
preflight_artifact_w4 1 Element height(for images) or font size(for text) weight.
preflight_artifact_w5 1 Page numbers weight.
preflight_artifact_distance 0.7 Maximum distance<0,1> when elements can be an artifact/header/footer.
preflight_artifact_cluster_points 2 Minimal number of points in preflight_artifact_distance radian.
concurrent_threads 0 The number of concurrent threads. If zero, the number of concurrent threads supported by the implementation is used. If it’s set to 1, no parallel algorithms are used.
text_only 0 Process only texts in pagemap.
rotation_detect 1 Detect and correct page rotation for reading.
background_color_red 255 Page background color – red.
background_color_green 255 Page background color – green.
background_color_blue 255 Page background color – blue.
background_color_diff 2 Page background color max color component difference.
bbox_expansion 4 Bounding box expansion – half of kTrTextHeight.
angle_deviation 0.015707963267949 Maximum angle deviation for horizontal and vertical elements.
header_ratio 0.15 Maximum percentage of a header height. Possible values from interval [0,1].
footer_ratio 0.15 Maximum percentage of a footer height. Possible values from interval [0,1].
artifact_w1 1 Artifact page border distance weight.
artifact_w2 1 Artifact image area weight.
artifact_border_distance_max 2 Maximum distance of artifact to page border.
artifact_font_size 40 All texts with font size bigger then this is artifact.
artifact_similarity 0.7 Minimum similarity value when object or element is an artifact normalized to interval [0,1].
object_intersection 0.1 Expansion for page objects intersection.
object_isolated 16 Maximum isolated object heigh and width. It’s used for joining isolated objects.
path_object_max 2000 Maximum number of subsequence path objects that are still paths.
path_object_min 100 Minimum number of subsequence path objects that are still paths.
isolated_text_ratio 10 Maximum isolated text width ratio. Is multiplied with the font size.
isolated_text 80 Maximum isolated text width.
isolated_element_ratio 6 Maximum isolated element width/height ratio. Is multiplied with the font size.
initial_element_expansion 0 Initial element bounding box expansion when searching children. Size in points. If its zero, a half of default page font size is used.
element_isolated_w1 1 Element paragraph weight.
element_isolated_w2 1 Element width weight.
element_isolated_caption 1 If set to 1 and element contains caption(table, image, chart, note) do not mark it as isolated element.
element_isolated_width_min 0 Minimal value of bbox width for isolated element. If zero, element_isolated_width_min_ratio is used. Size in points.
element_isolated_width_min_ratio 4 Minimal value of bbox width for isolated element multiply with average page font size.
element_isolated_width_max 0 Maximal value of bbox width for isolated element. If zero, element_isolated_width_max_ratio is used. Size in points.
element_isolated_width_max_ratio 10 Maximal value of bbox width for isolated element multiply with average page font size.
element_isolated_similarity 0.7 Minimum similarity value when element is isolated normalized to interval [0,1].
element_isolated_image_w1 1 Image vs page area weight.
element_isolated_image_w2 1 Elements isolated similarity weight.
element_isolated_image_w3 1 Images area vs join image area weight.
element_isolated_image_similarity 0.7 Minimum similarity value when isolated elements can be added to an image.
element_line_w1 1 Line width weight.
element_line_width_max 8 Maximal value of line width. If zero, element_line_width_max_ratio is used. Size in points.
element_line_width_max_ratio 1 Maximal value of line width multiply with average page font size.
element_line_similarity 0.6 Minimum similarity value when element is recognized as line normalized to interval [0,1].
element_alignment_ratio 0.5 Ration between baseline and bounding box alignments. Bounding box alignment precision is multiplied with element_alignment_ratio.
rect_image_similarity 0.7 Minimum similarity value when rectangle is image normalized to interval [0,1].
image_background_text 1 Text bounding box expansion.
image_overlap_distance 1 Maximum distance value when graphic page objects can be joined. Distance in points.
image_join_distance 8 Maximum distance value when graphic page objects can be joined. Distance in points.
char_clip_ratio 0.5 Minimal ratio of the clipping area of the character comparing to it’s original size.
word_space_width_ratio 0.6 Default char space width ratio (percentage) based on the font size.
word_space_width_min_ratio 0.1 Minimal acceptable space width ratio.
word_space_type 0 The space’s advance width detection type. If 0, font space width is calculated as average of all characters widths. If -1, maximum width is used. If -2, space character width is used. If there is a number greater then 0, font size(em) multiplied by this number is used.
word_space_distance_max 0 Maximum distance between chunks. If zero, word_space_distance_max_ratio is used. Size in points.
word_space_distance_max_ratio 10 Maximum distance between chunks. The value is multiplied by simple word spacing between words.
word_space_update 0.1 Maximum distance of word spacing. Size in points.
word_distance 0 Maximum distance value in word spacing detection. If zero, word_distance_ratio is used. Size in points.
word_distance_ratio 0.1 Maximum distance value in word spacing detection. The value is multiplied by word font size.
word_space_splitter_ratio 2 Minimum space before splitter. The value is multiplied by most used font size.
word_splitter_distance 4 Maximum threshold value for word splitters detections. Real distance in points.
text_line_baseline_ratio 0.1 Maximum baseline shift. Value multiplies minimal font. Baseline shift moves individual characters up or down in relation to other text on the same line.
text_line_underline_distance 2.6 Distance of the underline line and text baseline. Size in points.
text_line_underline_char_distance_ratio 0.1 Distance of the underline line start/end point and character bounding box. The value is multiplied by line font size. Size in points.
text_line_subscript_space_ratio 0.5 Subscript space multiplied with word space.
text_line_subscript_font_ratio 1 Mamimum subscript font size multiplied with line font size.
text_line_subscript_len 6 Maximum number of characters in subscript.
text_line_join_font_size_distance 0 Distance of two fonts in points, when two lines with different fonts can be join.
text_line_distance_max 0 Maximum distance between lines. If zero, text_line_distance_max_ratio is used. Size in points.
text_line_distance_max_ratio 3.6 Maximum distance between lines. The value is multiplied by line font size.
text_line_join_distance 2 Maximum threshold value in line spacing detection for specific font size. The higher value allows creating paragraph with variable line spacings. The value is multiplied by font size.
text_chunk_distance_max 0 Maximum distance between chunks. If zero, text_chunk_distance_max_ratio is used. Size in points.
text_chunk_distance_max_ratio 6 Maximum distance between chunks. The value is multiplied by simple word spacing between words.
text_chunk_distance 0 Maximum distance value when text chunks are vertically aligned. If zero, text_chunk_distance_ratio is used. Size in points.
text_chunk_distance_ratio 0.42 Maximum distance value when text chunks are vertically aligned. The value is multiplied by page font width.
text_chunk_simple_distance 0.4 Maximum distance value when text chunks create simple line. Normalized to interval [0,1].
text_chunk_word_distance 0.1 Maximum distance value when single line text has to be split to words. Normalized to interval [0,1].
text_height 8 Minimal text height on the page.
text_simple_similarity 0.96 Minimum similarity value when text lines create a simple paragraph normalized to interval [0,1].
text_justify_similarity 0.96 Minimum similarity value when text lines create a justify paragraph normalized to interval [0,1].
text_table_similarity 0.65 Minimum similarity value when text lines create a table normalized to interval [0,1].
text_paragraph_similarity 0.7 Minimum similarity value when text is paragraph normalized to interval [0,1].
text_split_distance 0.2 Dissimilarity boundary value when text lines creates a paragraph.
text_heading_h1 0 Minimal font size for H1.
text_heading_h2 0 Minimal font size for H2.
text_heading_h3 0 Minimal font size for H3.
text_column_similarity 0.7 Minimum similarity value that text creates a column normalized to interval [0,1].
label_alignment_h 2 Maximum deviation of horizontal label alignment.
label_distance_ratio 10 Distance of the label and text. Is multiplied with the page most used font size.
label_baseline_ration 0.14 Multiplies minimal font. Maximum deviation of horizontal label aligned to text.
label_image_w1 1 Labels alignment weight. Accepted only if labels overlap.
label_image_w2 1 Labels and sibling distance weight.
label_image_w3 1 Siblings font size difference weight. Accepted only if it’s not equal.
label_image_w4 0.25 Labels width difference weight.
label_image_w5 0.25 Labels height difference weight.
label_image_w6 1 Label and sibling height weight.
label_image_w7 1 Label width weight.
label_image_w8 0 Labels type weight. If set to 1, accept only same type labels.
label_image_w9 0 Labels graphic state weight. If set to 1, labels have to have same graphics state.
label_image_w10 0 Siblings text state weight. If set to 1, siblings have to have same text state.
label_image_width_min 0 Minimal value of bbox width for label object. If zero, label_image_width_min_ratio is used. Size in points.
label_image_width_min_ratio 1 Minimal value of bbox width for label objec multiply with average page font size.
label_image_width_max 0 Maximal value of bbox width for label object. If zero, label_image_width_max_ratio is used. Size in points.
label_image_width_max_ratio 6 Maximal value of bbox width for label object multiply with average page font size.
label_image_distance 8 Minimum distance value when graphic page object is label. Distance in points.
label_word_w1 1 Labels alignment weight.
label_word_w2 1 Labels and sibling distance weight.
label_word_distance 0 Maximum distance value when a word is a label. Distance in points.
label_word_distance_ratio 1 Maximum distance value when a word is a label. The value is multiplied by avarage page font width.
label_sibling_dist_ratio 1.2 Minimum distance of left lebel sibling. Ratio is multiplied with right sibling distance.
toc_word_w1 1 Tocs words alignment weight.
toc_word_distance 0.6 Minimum distance value when word is a label. Distance in points.
graphic_table_alignment_distance 0.8 Maximum alignment distance value when elements can create a table. Distance in points.
graphic_table_split_w1 1 Table texts paragraph weight.
graphic_table_split_w2 1 Table texts horizontal alignment weight.
graphic_table_split_w3 1 Columns width weight.
graphic_table_split_w4 0.5 Number of columns weight.
graphic_table_split_w5 0.5 Number of rows weight.
graphic_table_split_w6 1 Page area weight.
graphic_table_split_col_max 5 Maximal number of columns when table can be split.
graphic_table_split_row_max 5 Maximal number of rows when table can be split.
graphic_table_split_similarity 0.7 Minimum similarity value when graphic table has to be split.
graphic_table_split_layout_similarity 0.7 Minimum similarity value when graphic table has to be split.
graphic_table_chart_similarity 0.3 Minimum similarity value when graphic table is a char.
graphic_table_image_w1 -1 Images area weight. If -1, number of images is used.
graphic_table_image_w2 -1 Images weight. If -1, number of images is used.
graphic_table_image_w3 -1 Chart similarity weight. If -1, number of paths is used.
graphic_table_image_w4 1 Texts vertical alignment weight.
graphic_table_image_w5 1 Table size weight.
graphic_table_image_similarity 0.7 Minimum similarity value when graphic table has an image.
sect_table_row_alignment_type 1 Table row alignment type [0 – strong, 1 – average, 2 – weak].
sect_table_col_alignment_type 1 Table column alignment type [0 – strong, 1 – average, 2 – weak].
sect_table_col_similarity_type 0 Table column similarity type [0 – column alignment distance, 1 – element distance, 2 – element size, 3 – max].
sect_table_col_distance 0.8 Maximum deviation value for detection nearest distancies for table columns. Real distance in points.
sect_table_col_similarity 0.36 Minimum similarity value when elements create table column.
sect_table_alignment_type 2 Table column alignment type [0 – strong, 1 – average, 2 – weak]. Select strong for strictly aligned table elements.
sect_table_alignment_distance 0.4 Maximum threshold value for detection section tables.
sect_table_text_col_w1 1 Text column paragraph weight.
sect_table_text_col_w2 1 Text column width weight.
sect_table_text_col_width_min 0 Minimal value of bbox width for text in table column. If zero, sect_table_text_col_width_min_ratio is used. Size in points.
sect_table_text_col_width_min_ratio 1 Minimal value of bbox width for text in table column multiply with average page font size.
sect_table_text_col_width_max 0 Maximal value of bbox width for text in table column. If zero, sect_table_text_col_width_max_ratio is used. Size in points.
sect_table_text_col_width_max_ratio 8 Maximal value of bbox width for text in table column multiply with average page font size.
sect_table_image_col_w1 1 Image column weight.
sect_table_image_col_gs 1 If set to 1, image column has to have same graphics state.
sect_table_image_col_width_min 0 Minimal value of bbox width for image in table column. If zero, sect_table_image_col_width_min_ratio is used. Size in points.
sect_table_image_col_width_min_ratio 1 Minimal value of bbox width for image in table column multiply with average page font size.
sect_table_image_col_width_max 0 Maximal value of bbox width for image in table column. If zero, sect_table_image_col_width_max_ratio is used. Size in points.
sect_table_image_col_width_max_ratio 4 Maximal value of bbox width for image in table column multiply with average page font size.
sect_table_image_col_height_min 0 Minimal value of bbox height for image in table column. If zero, sect_table_image_col_height_min_ratio is used.
sect_table_image_col_height_min_ratio 1 Minimal value of bbox height for image in table column multiply with average page font size.
sect_table_image_col_height_max 0 Maximal value of bbox height for image in table column. If zero, sect_table_image_col_height_max_ratio is used.
sect_table_image_col_height_max_ratio 2 Maximal value of bbox height for image in table column multiply with average page font size.
sect_table_column_similarity 0.5 Minimum similarity value when elements create table column.
sect_table_image_similarity_w1 1 Sect table image similarity area weight.
sect_table_image_similarity_w2 1 Sect table image similarity chart weight.
sect_table_image_similarity 0.7 Minimum similarity value when section table is image normalized to interval [0,1].
sect_table_paragraph_similarity 0.7 Minimum similarity value when section table is paragraph normalized to interval [0,1].
table_span_col_ratio 0.1 Intersection percentage of colspan element. Possible values from interval [0,1].
table_span_row_ratio 0.2 Intersection percentage of rowspan element. Possible values from interval [0,1].
table_alignment_h 1 Maximum deviation (in points) of horizontal table aligned elements.
table_alignment_v 4 Maximum deviation (in points) of vertical table aligned elements.
table_line_intersection 1 Expansion (in points) for lines intersection. It’s used in table detection.
table_detect_form 1 Recognize form fields as tables.
table_detect_sect 1 Texts (not graphic) tables detection. Possible values: 0
caption_distance 80 Distance of the caption and the image/table.
caption_alignment_h 4 Maximum deviation (in points) in caption and nearest element alignment.
caption_alignment_v 4 Maximum deviation (in points) in caption and nearest element alignment.
mc_detect 1 Update elements language, alternate description and actual text based on kb. Default value is set to 1 but can be turn to 0 due to optimization – when alternate description is not required.
rd_sort 0 Sort elements: 0 – inbuild, 1 – original content positions, 2 – by x and y coordinates.
rd_sort_direction 0 Sort elements: 0 – inbuild, 1 – prefere columns, 2 – prefere rows.
rd_column_distance 0.8 Maximum threshold value for columns detection. Real distance in points.

Example:

{
  "template": {
    "pagemap": [
      {
        "rect_image_similarity": 0.7,
        "preflight_columns": 1,
        "background_color_blue": 255
      }
    ]
  }
}

Regular Expressions

key value
regex_hyphen -$
regex_bullet ^[\\u2010\\u2011\\u2212\\u005E\\u005B\\ uF0A7\\uF097\\uF0BB\\u25CF\\u2022\\u25D8 \\u25CB\\u25D9\\u2023\\u2043\\uF0B7\\u22 12\\u204C\\u204D\\u25E6\\u29BE\\u29BF\\u 21E8\\u25BA\\u25C4\\u2219\\u25A0\\uF06C\ \u25A1\\u005D\\u25C6]$
regex_bullet_font (Wingdings)\|(Symbol)
regex_label ^[\\[\\(]?((M{0,4}(CM\|CD\|D?C{0,3})(XC\|XL\|L?X{0,3})(IX\|IV\|V?I{0,3}))\|(\\d+)\|([a-zA-Z]))[\\)\\]\\. ]$
label_chars .()[]
regex_decimal_numbering ^[\\[\\(]?(?:\\d{1,4}\\.){0,5}\\d{0,4}\\s?[\\)\\]\\.]?$
regex_roman_numbering ^[\\[\\(]?M{0,4}(CM\|CD\|D?C{0,3})(XC\|XL\|L?X{0,3})(IX\|IV\|V?I{0,3})[\\)\\]\\.]?$
regex_letter_numbering ^[\\[\\(]?[A-Za-z][\\)\\]\\.]$
regex_filling [._]{2,}
regex_filling_chars ._
regex_page_number (^\\d+$)\|(^M{0,4}(CM\|CD\|D?C{0,3})(XC\|XL\|L?X{0,3})(IX\|IV\|V?I{0,3})$)
regex_first_cap ^[A-Z]
regex_terminal [\\.\\!\\?]$
regex_table_caption ((^table)\|(^tab\\.))
regex_image_caption ((^image)\|(^img\\.)\|(^figure)\|(^fig\ \.))
regex_chart_caption ((^chart)\|(^map))
regex_note_caption ((^source\\:)\|(^note\\:))
regex_toc_caption ((^content)\|(^toc))
regex_colon :$
regex_comma [,;]$
regex_letter ^[A-Za-z]$
number_chars -+.,%\\u20AC$\\u00A5\\u00A3
numbering_splitter_chars .()[]

Example:

{
  "template": {
    "pagemap_regex": [
      {
        "regex_roman_numbering": "^[\\[\\(]?M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})[\\)\\]\\.]?$",
        "regex_chart_caption": "((^chart)|(^map))"
      }
    ]
  }
}

Functions

form_object_process

The test is triggered when the page form object is parsed.

keys and values:

path_object_process

The test is triggered when the page path object is parsed.

keys and values:

object_update

The test is triggered when the page image object is tested.

keys and values:

image_object_process

The test is triggered when the page image object is parsed.

keys and values:

line_update

Updates a line element after detecting horizontal and vertical lines.

keys and values:

rect_update

Updates a rectangle element after detecting rectangles.

keys and values:

word_update

Updates a word element after detecting words.

keys and values:

word_spacing

Detects the spacing between two words.

keys and values:

text_line_add_word

The test is triggered when forming text lines from words.

keys and values:

text_line_update

Updates a text line element after detecting text lines.

keys and values:

text_line_split_filling

Splits a text line based on filling.

keys and values:

text_line_neighbours

Test if two neighbours text lines can form a paragraph.

keys and values:

text_update

Updates the text element after detecting paragraphs.

keys and values:

element_create

Create user-defined elements.

keys and values:

element_update

Updates an element after detecting basic elements.

keys and values:

table_recognize

The test is triggered during the table recognition process for detecting columns and rows. This test can be called multiple times!

keys and values:

caption_join

Finds the element for text that has been marked as a caption.

keys and values:

table_update

Updates a table after the whole process od table detection is done.

keys and values:

table_detect_header

Defines the rules for detecting table headers and tests whether the text is a header in a table.

keys and values:

table_split

Updates the table after the entire table detection process is completed.

keys and values:

single_instance_detect

Detects handles for elements that require a single occurrence only.

keys and values:

alt_update

Sets an alternate description for the element. The alternate description is established in a specific order. To skip a step, set the default value to false for that step.

keys and values:

actual_text_update

Sets the actual text for the element. The actual text is established in a specific order. To skip a step, set the default value to false for that step.

keys and values:

artifact_update

Marks an element as an artifact.

keys and values:

reading_order_detect

Detect reading order options

keys and values:

label_update

Update elements marked as labels to include them as part of the list.

keys and values:

list_update

Tests if a list is correct.

keys and values:

tag_list

Handles the process of tagging lists.

keys and values:

tag_image

Handles the process of tagging images. For repurposing and accessibility purposes, a Figure element should have either an Alt entry or an ActualText entry in its structure element dictionary. If both are absent, the default behavior is to tag the Figure with an empty alt attribute.

keys and values:

tag_table

Handles the process of tagging tables. For repurposing and accessibility purposes, a table should have headers. If no headers are detected, the default behavior is to leave the table without any <th> elements.

keys and values:

tag_annot

Handles the process of tagging annotations.

keys and values:

tag_update

Updates the tag after it has been created..

keys and values:

annot_update

Updates the annotation tag after it has been created.

keys and values:

Schema

statement

The if statement type of the query. According to the statement the query evaluation stops upon pass or not.

  • values:
    • ['$if', '$elif', '$else']
  • defaule value: $if

keys and values:

  • “$if”
  • “$elif”
  • “$else”

$if

Can by used in all functions. General logic of IF statement.

  • type: statement

$elif

Can by used in all functions. General logic of ELSE IF statement.

  • type: statement

$else

Can by used in all functions. General logic of ELSE statement.

  • type: statement

query

The query defines thresholds and operations for a pagemap detection.

  • type: query

keys and values:

  • paramparams:

    • pds_object_params

      • A parameter that represents PdsObject. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
    • pde_element_params

      • A parameter that represents PdeElement. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
    • pds_struct_elem_params

      • A parameter that represents PdsStructElem. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
    • pdf_annot_params

      • A parameter that represents PdfAnnots. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
    • pdf_rect

    • pdf_rgb

    • int

      • Parameter that represents integer.
    • bool

      • Parameter that represents boolean value.
    • float

      • Parameter that represents floating value.
    • string

      • Parameter that represents string value.
  • varparams:

  • logical_operators

param

Define the number and type of input parameters.

  • type: query_param

keys and values:

int

Parameter that represents integer.

  • type: int

bool

Parameter that represents boolean value.

  • type: bool

float

Parameter that represents floating value.

  • type: float

string

Parameter that represents string value.

  • type: string

var

User defined variables. Use scheme:arithmetic_operators to define variables

  • type: var

keys and values:

logical_operators

Available logical operators.

  • type: string
  • values:
    • ['$and', '$or', '$not']

keys and values:

  • “$and” params:

    • “$and”

    • “$or”

    • “$not”

    • condition

  • “$or” params:

    • “$and”

    • “$or”

    • “$not”

    • condition

  • “$not” params:

    • “$and”

    • “$or”

    • “$not”

    • condition

$and

Logical AND.

  • type: logical_operator

keys and values:

  • “$and”
  • “$or”
  • “$not”
  • condition

$or

Logical OR.

  • type: logical_operator

keys and values:

  • “$and”
  • “$or”
  • “$not”
  • condition

$not

Logical NOT.

  • type: logical_operator

keys and values:

  • “$and”
  • “$or”
  • “$not”
  • condition

comparison_operators

Available comparison operators.

  • type: string
  • values:
    • ['$eq', '$ne', '$lt', '$lte', '$gt', '$gte', '$regex', '$in', '$nin']

keys and values:

  • “$eq”
  • “$ne”
  • “$lt”
  • “$lte”
  • “$gt”
  • “$gte”
  • “$regex”
  • “$in”
  • “$nin”

$eq

Equality comparison operator.

  • type: comparison_operator
  • types: [“int“, “float“, “string“]

$ne

Nequality comparator operator.

  • type: comparison_operator
  • types: [“int“, “float“, “string“]

$lt

Less than comparison operator.

  • type: comparison_operator
  • types: [“int“, “float“]

$lte

Less than or equals operator.

  • type: comparison_operator
  • types: [“int“, “float“]

$gt

Greater than operator.

  • type: comparison_operator
  • types: [“int“, “float“]

$gte

Greater than or equals operator.

  • type: comparison_operator
  • types: [“int“, “float“]

$regex

Regular expression predicate.

  • type: comparison_operator
  • types: [“string“]

$in

Contain value operator.

  • type: comparison_operator
  • types: []

$nin

Not contain value operator.

  • type: comparison_operator
  • types: []

arithmetic_operators

Available arithmetic_operators.

  • type: string
  • values:
    • ['$sum', '$diff', '$abs', '$mul', '$div', '$min', '$max', '$mod']

keys and values:

$sum

Sum of two numbers.

  • type: arithmetic_operator
  • types: [“int“, “float“]

keys and values:

$diff

Difference of two numbers.

  • type: arithmetic_operator
  • types: [“int“, “float“]

keys and values:

$abs

Absolute value of a single number.

  • type: arithmetic_operator
  • types: [“int“, “float“]

keys and values:

$mul

Multiply two numbers.

  • type: arithmetic_operator
  • types: [“int“, “float“]

keys and values:

$div

division two numbers.

  • type: arithmetic_operator
  • types: [“int“, “float“]

keys and values:

$min

Get minimal value from any number of values.

  • type: arithmetic_operator
  • types: [“int“, “float“]

keys and values:

$max

Get maximal value from any number of values.

  • type: arithmetic_operator
  • types: [“int“, “float“]

keys and values:

$mod

Get the modulus from a division of values.

  • type: arithmetic_operator
  • types: [“int“, “int“]

keys and values:

pds_object_params

List of all pds_object types, can be used as parameter in QUERY->PARAM.

keys and values:

pds_text

Text page object

keys and values:

pds_struct_elem_params

List of all pds_tag types, can be used as parameter in QUERY->PARAM.

keys and values:

pdf_annot_params

List of all pdf_annot types, can be used as parameter in QUERY->PARAM.

keys and values:

pde_element_params

List of all pde_element types, can be used as parameter in QUERY->PARAM.

keys and values:

general_vars

General variables can be used without parameters. It represents general state during the processing. It contains information about the current page and the document and can be used in any query.

  • type: string

keys and values:

  • “$page_num”
  • “$page_width”
  • “$page_height”
  • “$page_rotation”
  • “$doc_num_pages”
  • “$doc_lang”
  • “$doc_title”

$page_num

Page number.

  • type: int

$page_width

Page cropbox width.

  • type: float

$page_height

Page cropbox height.

  • type: float

$page_rotation

Page rotation.

  • type: int
  • values:
    • [0, 90, 180, 270]

$doc_num_pages

Document number of pages.

  • type: int

$doc_lang

Document language.

  • type: string

$doc_title

Document title.

  • type: string

values

General values used in JSON default template.

keys and values:

actual_text

Actual text.

  • type: string

alt

Alternate description typically used for Figure tags.

  • type: string

lang

The language identifier of the document or tag.

  • type: string

id

The unique identifier of the tag.

  • type: string

expansion

The expanded form of an abbreviation.

  • type: string

has_content

A value identifying whether the object or tag has associated page content.

  • type: bool
  • values:
    • ['true', 'false']

tag_type

Tag type defined by a string or regular expression. Use .* to match all tags.

  • type: string
  • values:
    • ['Annot', 'Art', 'Artifact', 'Aside', 'BibEntry', 'BlockQuote', 'Caption', 'Code', 'Div', 'Document', 'DocumentFragment', 'Em', 'FENote', 'Figure', 'Form', 'Formula', 'H', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'Index', 'L', 'Lbl', 'LBody', 'LI', 'Link', 'NonStruct', 'Note', 'P', 'Part', 'Private', 'Quote', 'RB', 'Reference', 'RP', 'RT', 'Ruby', 'Sect', 'Span', 'Strong', 'Sub', 'Table', 'TBody', 'TD', 'TFoot', 'TH', 'THead', 'Title', 'TOC', 'TOCI', 'TR', 'Warichu', 'WP', 'WT']

annot_type

Annotation type defined by a string or regular expression. Use .* to match all annotations.

  • type: string
  • values:
    • ['Text', 'Link', 'FreeText', 'Line', 'Square', 'Circle', 'Polygon', 'PolyLine', 'Highlight', 'Underline', 'Squiggly', 'StrikeOut', 'Stamp', 'Caret', 'Ink', 'Popup', 'FileAttachment', 'Sound', 'Movie', 'Widget', 'Screen', 'PrinterMark', 'TrapNet', 'Watermark', '3D', 'Redact', 'Projection', 'RichMedia']

contents

A string value specifying the annotation contents.

  • type: string

annot_flag

A comma-delimited string value specifying the annotation flags.

  • type: string
  • values:
    • ['invisible', 'hidden', 'print', 'no_zoom', 'no_rotate', 'no_view', 'read_only', 'locked', 'toggle', 'contents']

title

Title.

  • type: string

angle

Angle.

  • type: float

bbox

Parameter that represents the bounding box of an object, formatted as an array: [left, bottom, right, top].

  • type: bbox

keys and values:

cell_column

The column number of the cell in the table.

  • type: int

cell_row

The row number of the cell in the table.

  • type: int

col_num

Number of columns in the table.

  • type: int

children_num

Number of associated child page objects.

  • type: int

artifact

True if object has content mark Artifact, false otherwise.

  • type: bool
  • values:
    • ['true', 'false']

mcid

MCID content mark number is exists, -1 otherwise.

  • type: int

fill_color

The fill color of an object.

  • type: rgb

keys and values:

stroke_color

The stroke color of an object.

  • type: rgb

keys and values:

flag

The flag value defines a specific property for an object, which is essential for further processing.

  • type: string
  • values:
    • ['no_join', 'no_split', 'artifact', 'header', 'footer', 'splitter', 'no_table', 'table']

font_name

The name of the font used in the text object.

  • type: string

font_size

The size of the font used in the text object.

  • type: float

red

The red component of an RGB color.

  • type: int

green

The green component of an RGB color.

  • type: int

blue

The blue component of an RGB color.

  • type: int

Marks the object as a table header.

  • type: bool
  • values:
    • ['true', 'false']

heading

Sets the text heading style.

  • type: string
  • values:
    • ['normal', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'h8', 'note', 'title']

width

The object’s width dimension.

  • type: float

height

The object’s height dimension.

  • type: float

label

Marks the element as a list label.

  • type: string
  • values:
    • ['', 'label', 'li_1', 'li_2', 'li_3', 'li_4', 'label_no']

left

The left coordinate of the object.

  • type: float

The left coordinate of the object.

  • type: float

top

The top coordinate of the object.

  • type: float

bottom

The bottom coordinate of the object.

  • type: float

pdf_rect

Parameter that represents the bounding box of an object, formatted as an array: [left, bottom, right, top].

  • type: rec

keys and values:

pdf_rgb

Parameter that represents the RGB color of an object, formatted as an array: [red, green, blue].

  • type: rgb

keys and values:

reflow

Text reflow. If set to false, each line is treated as a new line.

  • type: bool
  • values:
    • ['true', 'false']

row_num

The number of rows in the table.

  • type: int

table_type

The table type represented as a value from the PdfTableType enum.

  • type: string
  • values:
    • ['graphic', 'isolated', 'row', 'col', 'form']

tag

The tag name to be used.

  • type: string

text

The text to be used as a value.

  • type: string

text_flag

The flag to be used for the text element, specifying a value similar to the regex flags.

  • type: string
  • values:
    • ['table_caption', 'image_caption', 'chart_caption', 'note_caption', 'filling', 'uppercase']

text_line_flag

The flag to be used for the text line element, specifying a value similar to the regex flags.

  • type: string
  • values:
    • ['hyphen', 'new_line', 'indent', 'terminal', 'drop_cap', 'filling', 'uppercase']

type

Marks the object or element with a type, aiding in the recognition of the correct element type.

  • type: string
  • values:
    • ['pds_object', 'pds_text', 'pds_path', 'pds_image', 'pds_shading', 'pds_form', 'pde_element', 'pde_text', 'pde_text_line', 'pde_word', 'pde_text_run', 'pde_image', 'pde_container', 'pde_list', 'pde_line', 'pde_rect', 'pde_table', 'pde_cell', 'pde_toc', 'pde_header', 'pde_footer', 'pde_form_field', 'pde_annot']

word_flag

The flag to be used for the word element, specifying a value similar to the regex flags.

  • type: string
  • values:
    • ['hyphen', 'bullet', 'colon', 'number', 'subscript', 'superscript', 'terminal', 'capital', 'image', 'decimal_num', 'roman_num', 'letter_num', 'page_num', 'filling', 'uppercase', 'comma', 'no_unicode']

suffix

Container holding all unique suffixes used for naming in JSON default template

keys and values:

condition

Conditions types used in the query

keys and values:

condition_value

{0_width : 100}

comparison

{0_width : {$lt : 100}

keys and values:

  • “$eq”

comparison_array

{0_width : [{$lt : 100}, {$gt : 100}, …]}

keys and values:

  • “$gt”
  • “$lt”

keywords

Container holding all unique keywords used in JSON default template

keys and values:

general

Holding general data like: version, date, id, SDK version, …

template

Holding all functions.

query

Can be used in all functions. Each QUERY must have child PARAM, which holding array of parameters to specified query objects.

param

Child of the QUERY. Each QUERY must include a PARAM that specifies the object types used for evaluation.

  • type: array_param

statement

The if statement should be used in function nodes. Based on the statement, the query evaluation stops upon pass or fail. If the if statement is not present, the condition is considered disabled.

  • type: string
  • values:
    • ['$if', '$elif', '$else']

disable

Can by used in all main functions nodes. If value is true, node is not executed. Default value is false

  • type: bool
  • values:
    • ['true', 'false']

purpose

Describes the user-defined purpose or description of the QUERY.

  • type: string

insert

Values to be added as the default for the node.

keys and values:


Posted

in

Tags: