Fine-Tune Auto-Tagging and Ensure PDF/UA Compliance
PDFix Template
—
Table of Contents
General Settings
General template settings
| key | type | value |
|---|---|---|
rtl |
bool | False |
substructure_form_xobject |
bool | True |
page_tag |
string | NonStruct |
debug_pagemap_stop |
string |
Example:
{
"template": {
"settings": {
"rtl": false,
"substructure_form_xobject": true,
"page_tag": "NonStruct",
"debug_pagemap_stop": ""
}
}
}Threshold Values
—
| key | value | desciption |
|---|---|---|
preflight_artifact_font_size_min |
32 |
Minimum font size for artifact |
preflight_artifact_w1 |
1 |
Horizontal alignment weight. |
preflight_artifact_w2 |
1 |
Vertical alignment weight. |
preflight_artifact_w3 |
1 |
Element width weight. |
preflight_artifact_w4 |
1 |
Element height(for images) or font size(for text) weight. |
preflight_artifact_w5 |
1 |
Page numbers weight. |
preflight_artifact_distance |
0.7 |
Maximum distance<0,1> when elements can be an artifact/header/footer. |
preflight_artifact_cluster_points |
2 |
Minimal number of points in preflight_artifact_distance radian. |
concurrent_threads |
0 |
The number of concurrent threads. If zero, the number of concurrent threads supported by the implementation is used. If it’s set to 1, no parallel algorithms are used. |
text_only |
0 |
Process only texts in pagemap. |
rotation_detect |
1 |
Detect and correct page rotation for reading. |
background_color_red |
255 |
Page background color – red. |
background_color_green |
255 |
Page background color – green. |
background_color_blue |
255 |
Page background color – blue. |
background_color_diff |
2 |
Page background color max color component difference. |
bbox_expansion |
2 |
Bounding box expansion – half of kTrTextHeight. |
angle_deviation |
0.015707963267949 |
Maximum angle deviation for horizontal and vertical elements. |
header_ratio |
0.15 |
Maximum percentage of a header height. Possible values from interval [0,1]. |
footer_ratio |
0.15 |
Maximum percentage of a footer height. Possible values from interval [0,1]. |
artifact_w1 |
1 |
Artifact page border distance weight. |
artifact_w2 |
1 |
Artifact image area weight. |
artifact_border_distance_max |
2 |
Maximum distance of artifact to page border. |
artifact_similarity |
0.7 |
Minimum similarity value when object or element is an artifact normalized to interval [0,1]. |
path_object_max |
2000 |
Maximum number of subsequence path objects that are still paths. |
path_object_min |
100 |
Minimum number of subsequence path objects that are still paths. |
initial_element_expansion |
1 |
Initial element bounding box expansion when searching children. Size in points. If its zero, a half of default page font size is used. |
initial_element_overlap |
0.5 |
Minimum percentage of covered area of element by the initial element. |
annot_char_overlap |
0.05 |
Minimum percentage of covered area of character by the annotation. |
isolated_text_ratio |
10 |
Maximum isolated text width ratio. Is multiplied with the font size. |
isolated_text |
80 |
Maximum isolated text width. |
isolated_element_ratio |
6 |
Maximum isolated element width/height ratio. Is multiplied with the font size. |
element_isolated_w1 |
1 |
Element paragraph weight. |
element_isolated_w2 |
1 |
Element width weight. |
element_isolated_caption |
1 |
If set to 1 and element contains caption(table, image, chart, note) do not mark it as isolated element. |
element_isolated_width_min |
0 |
Minimal value of bbox width for isolated element. If zero, element_isolated_width_min_ratio is used. Size in points. |
element_isolated_width_min_ratio |
4 |
Minimal value of bbox width for isolated element multiply with average page font size. |
element_isolated_width_max |
0 |
Maximal value of bbox width for isolated element. If zero, element_isolated_width_max_ratio is used. Size in points. |
element_isolated_width_max_ratio |
10 |
Maximal value of bbox width for isolated element multiply with average page font size. |
element_isolated_similarity |
0.7 |
Minimum similarity value when element is isolated normalized to interval [0,1]. |
element_isolated_image_w1 |
1 |
Image vs page area weight. |
element_isolated_image_w2 |
1 |
Elements isolated similarity weight. |
element_isolated_image_w3 |
1 |
Images area vs join image area weight. |
element_isolated_image_similarity |
0.7 |
Minimum similarity value when isolated elements can be added to an image. |
element_line_w1 |
1 |
Line width weight. |
element_line_width_max |
8 |
Maximal value of line width. If zero, element_line_width_max_ratio is used. Size in points. |
element_line_width_max_ratio |
1 |
Maximal value of line width multiply with average page font size. |
element_line_similarity |
0.6 |
Minimum similarity value when element is recognized as line normalized to interval [0,1]. |
element_alignment_ratio |
0.5 |
Ration between baseline and bounding box alignments. Bounding box alignment precision is multiplied with element_alignment_ratio. |
rect_image_similarity |
0.7 |
Minimum similarity value when the rectangle should be an image normalized to interval [0,1]. |
rect_line_similarity |
0.5 |
Minimum similarity value when the rectangle should be a line normalized to interval [0,1]. |
image_background_text |
1 |
Text bounding box expansion. |
image_overlap_distance |
1 |
Maximum distance value when graphic page objects can be joined. Distance in points. |
image_join_distance |
8 |
Defines the maximum allowed distanc (in points) between small images for them to be considered joinable. These parameters help fine-tune the grouping of small image elements into a cohesive larger visual block based on their spatial proximity. |
char_clip_ratio |
0.5 |
Minimal ratio of the clipping area of the character comparing to it’s original size. |
word_space_width_ratio |
0.6 |
The word_space_width_ratio is a multiplier that determines the threshold for identifying inter-word spaces by comparing the gap between characters to the typical width of a space character. It scales the space width so that small variations in spacing can be interpreted as either a valid word separator or a mere character gap. |
word_space_width_min_ratio |
0.1 |
The word_space_width_min_ratio is an additional multiplier that sets a minimum threshold for the allowed space between words. It ensures that, even when minimal character spacing is detected, the computed gap used to determine word boundaries does not fall below a baseline value relative to the font size. |
word_space_distance_max |
0 |
Maximum word space distance in points. |
word_space_distance_max_ratio |
0 |
Maximum word space distance. The value is multiplied by word font size. |
word_space_ratio |
1 |
Ratio that defines if the text line is simple or justify. |
word_space_update_min |
0.2 |
Minimum ratio of detected word spacing. |
word_space_update_max |
4 |
Maximum ratio of detected word spacing. If set to 0, update word spacif from lines is not applied. |
word_space_update_distance |
0.04 |
Distance for clustering word spaces in text line update. |
word_splitter_ratio |
2 |
Minimum space before splitter. The value is multiplied by most used font size. |
word_splitter_distance |
4 |
Maximum threshold value for word splitters detections. Real distance in points. |
word_overlap |
0.9 |
Minimum overlap percentage (0-1) required between bounding boxes to consider words as duplicates. A word must cover at least this percentage of another word’s area to be considered overlapping. |
text_line_baseline_ratio |
0.1 |
Maximum baseline shift. Value multiplies minimal font. Baseline shift moves individual characters up or down in relation to other text on the same line. |
text_line_underline_distance |
2.6 |
Distance of the underline line and text baseline. Size in points. |
text_line_underline_char_distance_ratio |
0.1 |
Distance of the underline line start/end point and character bounding box. The value is multiplied by line font size. Size in points. |
text_line_subscript_font_ratio |
1 |
This ratio is used to calculate the maximum allowed baseline difference for joining a subscript with its main word. Specifically, multiply the word’s font size by this ratio to get a threshold. |
text_line_join_font_size_distance |
0 |
Distance of two fonts in points, when two lines with different fonts can be join. |
text_line_distance_max |
0 |
Maximum distance between lines. If zero, text_line_distance_max_ratio is used. Size in points. |
text_line_distance_max_ratio |
2 |
Maximum distance between lines. The value is multiplied by line font size. |
text_line_join_distance |
2 |
Maximum threshold value in line spacing detection for specific font size. The higher value allows creating paragraph with variable line spacings. The value is multiplied by font size. |
text_line_chunk_distance_max |
0 |
Maximum distance between chunks. If zero, text_chunk_distance_max_ratio is used. Size in points. |
text_line_chunk_distance_max_ratio |
6 |
Maximum distance between chunks. The value is multiplied by simple word spacing between words. |
text_line_chunk_distance |
0 |
A fixed threshold parameter used by the clustering algorithm to group word spaces in a line. When set to a nonzero value, it directly defines the threshold that determines whether adjacent word spaces are similar enough to be considered part of the same cluster. If zero, word_distance_ratio is used. Size in points. |
text_line_chunk_distance_ratio |
0.4 |
A relative multiplier that comes into play when the fixed threshold (word_distance) is zero. It calculates the threshold by multiplying the line’s font size by the ratio, thereby adapting the clustering sensitivity to the text size. |
text_chunk_distance |
0 |
Maximum distance value when text chunks are vertically aligned. If zero, text_chunk_distance_ratio is used. Size in points. |
text_chunk_distance_ratio |
0.42 |
Maximum distance value when text chunks are vertically aligned. The value is multiplied by page font width. |
text_chunk_simple_distance |
0.4 |
Maximum distance value when text chunks create simple line. Normalized to interval [0,1]. |
text_chunk_word_distance |
0.1 |
Maximum distance value when single line text has to be split to words. Normalized to interval [0,1]. |
text_height |
8 |
Minimal text height on the page. |
text_simple_similarity |
0.96 |
Minimum similarity value when text lines create a simple paragraph normalized to interval [0,1]. |
text_justify_similarity |
0.96 |
Minimum similarity value when text lines create a justify paragraph normalized to interval [0,1]. |
text_table_similarity |
0.65 |
Minimum similarity value when text lines create a table normalized to interval [0,1]. |
text_paragraph_similarity |
0.7 |
Minimum similarity value when text is paragraph normalized to interval [0,1]. |
text_split_distance |
0.2 |
Dissimilarity boundary value when text lines creates a paragraph. |
text_column_similarity |
0.7 |
Minimum similarity value that text creates a column normalized to interval [0,1]. |
label_image_detect |
1 |
Graphic labels detection. Possible values: 0 |
label_word_detect |
1 |
Texts labels detection. Possible values: 0 |
label_alignment_h |
2 |
Maximum deviation of horizontal label alignment. |
label_distance_ratio |
10 |
Distance of the label and text. Is multiplied with the page most used font size. |
label_baseline_ration |
0.14 |
Multiplies minimal font. Maximum deviation of horizontal label aligned to text. |
label_image_w1 |
1 |
Controls how much vertical alignment matters when clustering labels. A higher value enforces stricter alignment, while a lower value allows more variation. |
label_image_w2 |
1 |
Controls how much the distance between a label and its associated text influences clustering. A higher value enforces stricter proximity, ensuring labels are closely linked to their text. |
label_image_w3 |
1 |
This weight controls how much the label’s width consistency matters in clustering. A higher value enforces that labels should have the same width, while a lower value allows more variation in width between labels. |
label_image_w4 |
1 |
This weight determines how important the height consistency of labels is when clustering. A higher value enforces that labels should have the same height, while a lower value allows more flexibility in height differences. |
label_image_w5 |
0.5 |
This weight adjusts how important the height relationship is between the image label and its associated text. A higher value means the height alignment between the label and the text is more significant in clustering decisions. |
label_image_width_min |
0 |
Specifies a fixed minimum width in points. If set to zero, the label_image_width_min_ratio is used instead.. |
label_image_width_min_ratio |
0 |
Defines the minimum width as a multiple of the average font size. Useful when label size varies with font size. |
label_image_width_max |
0 |
Specifies a fixed maximum width in points. If set to zero, the label_image_width_max_ratio is used instead. |
label_image_width_max_ratio |
6 |
Defines the maximum width as a multiple of the average page font size. This ratio is applied when label_image_width_max is zero. |
label_image_distance |
4 |
Clustering threshold in points that decides when labels should be grouped together. A higher value makes clustering more flexible, allowing distant labels to merge, while a lower value keeps clusters tight and separate. |
label_word_w1 |
1 |
Controls how much vertical alignment matters when clustering labels. A higher value enforces stricter alignment, while a lower value allows more variation. |
label_word_w2 |
1 |
Controls how much the distance between a label and its associated text influences clustering. A higher value enforces stricter proximity, ensuring labels are closely linked to their text. |
label_word_dist_sibling_ratio |
4 |
This threshold, defined as a ratio multiplied by a siblings font size, sets the maximum gap allowed between a label and its sibling element to be joined together. If the distance exceeds this value, the label and its sibling remain separate. |
label_word_distance |
0 |
Clustering threshold in points that decides when labels should be grouped together. A higher value makes clustering more flexible, allowing distant labels to merge, while a lower value keeps clusters tight and separate. |
label_word_distance_ratio |
1 |
Clustering threshold value that decides when labels should be grouped together. The value is multiplied by avarage page font width. |
toc_detect |
1 |
TOC detection. Possible values: 0 |
toc_word_distance |
0 |
Controls how much vertical alignment matters when clustering TOC words. A higher value enforces stricter alignment, ensuring TOC elements are well-structured. |
toc_word_distance_ratio |
1 |
Threshold ratio that determines when TOC entries should be clustered together. The value is multiplied by the average page font width. |
graphic_table_detect |
1 |
Graphic tables detection. Possible values: 0 |
graphic_table_detect_row |
1 |
Row graphic tables detection. |
graphic_table_detect_col |
1 |
Column graphic tables detection. |
graphic_table_alignment_distance |
0.8 |
Maximum alignment distance value when elements can create a table. Distance in points. |
graphic_table_split_w1 |
1 |
Table texts paragraph weight. |
graphic_table_split_w2 |
1 |
Table texts horizontal alignment weight. |
graphic_table_split_w3 |
1 |
Columns width weight. |
graphic_table_split_w4 |
0.5 |
Number of columns weight. |
graphic_table_split_w5 |
0.5 |
Number of rows weight. |
graphic_table_split_w6 |
1 |
Page area weight. |
graphic_table_split_col_max |
5 |
Maximal number of columns when table can be split. |
graphic_table_split_row_max |
5 |
Maximal number of rows when table can be split. |
graphic_table_split_similarity |
0.7 |
Minimum similarity value when graphic table has to be split. |
graphic_table_split_layout_similarity |
0.7 |
Minimum similarity value when graphic table has to be split. |
graphic_table_chart_similarity |
0.3 |
Minimum similarity value when graphic table is a char. |
graphic_table_image_w1 |
-1 |
Images area weight. If -1, number of images is used. |
graphic_table_image_w2 |
-1 |
Images weight. If -1, number of images is used. |
graphic_table_image_w3 |
-1 |
Chart similarity weight. If -1, number of paths is used. |
graphic_table_image_w4 |
1 |
Texts vertical alignment weight. |
graphic_table_image_w5 |
1 |
Table size weight. |
graphic_table_image_similarity |
0.7 |
Minimum similarity value when graphic table has an image. |
text_table_detect |
1 |
Texts (not graphic) tables detection. Possible values: 0 |
text_table_detect_row |
1 |
Row texts (not graphic) tables detection. |
text_table_detect_col |
1 |
Column texts (not graphic) tables detection. |
text_table_row_alignment_type |
1 |
Table row alignment type [0 – strong, 1 – average, 2 – weak]. |
text_table_col_alignment_type |
1 |
Table column alignment type [0 – strong, 1 – average, 2 – weak]. |
text_table_col_similarity_type |
0 |
Table column similarity type [0 – column alignment distance, 1 – element distance, 2 – element size, 3 – max]. |
text_table_col_distance |
0.8 |
Maximum deviation value for detection nearest distancies for table columns. Real distance in points. |
text_table_col_similarity |
0.36 |
Minimum similarity value when elements create table column. |
text_table_alignment_type |
2 |
Table column alignment type [0 – strong, 1 – average, 2 – weak]. Select strong for strictly aligned table elements. |
text_table_alignment_distance |
0.4 |
Maximum threshold value for detection text tables. |
text_table_text_col_w1 |
1 |
Text column paragraph weight. |
text_table_text_col_w2 |
1 |
Text column width weight. |
text_table_text_col_width_min |
0 |
Minimal value of bbox width for text in table column. If zero, text_table_text_col_width_min_ratio is used. Size in points. |
text_table_text_col_width_min_ratio |
1 |
Minimal value of bbox width for text in table column multiply with average page font size. |
text_table_text_col_width_max |
0 |
Maximal value of bbox width for text in table column. If zero, text_table_text_col_width_max_ratio is used. Size in points. |
text_table_text_col_width_max_ratio |
8 |
Maximal value of bbox width for text in table column multiply with average page font size. |
text_table_image_col_w1 |
1 |
Image column weight. |
text_table_image_col_gs |
1 |
If set to 1, image column has to have same graphics state. |
text_table_image_col_width_min |
0 |
Minimal value of bbox width for image in table column. If zero, text_table_image_col_width_min_ratio is used. Size in points. |
text_table_image_col_width_min_ratio |
1 |
Minimal value of bbox width for image in table column multiply with average page font size. |
text_table_image_col_width_max |
0 |
Maximal value of bbox width for image in table column. If zero, text_table_image_col_width_max_ratio is used. Size in points. |
text_table_image_col_width_max_ratio |
4 |
Maximal value of bbox width for image in table column multiply with average page font size. |
text_table_image_col_height_min |
0 |
Minimal value of bbox height for image in table column. If zero, text_table_image_col_height_min_ratio is used. |
text_table_image_col_height_min_ratio |
1 |
Minimal value of bbox height for image in table column multiply with average page font size. |
text_table_image_col_height_max |
0 |
Maximal value of bbox height for image in table column. If zero, text_table_image_col_height_max_ratio is used. |
text_table_image_col_height_max_ratio |
2 |
Maximal value of bbox height for image in table column multiply with average page font size. |
text_table_column_similarity |
0.5 |
Minimum similarity value when elements create table column. |
text_table_image_similarity_w1 |
1 |
Sect table image similarity area weight. |
text_table_image_similarity_w2 |
1 |
Sect table image similarity chart weight. |
text_table_image_similarity |
0.7 |
Minimum similarity value when text table is image normalized to interval [0,1]. |
text_table_paragraph_similarity |
0.7 |
Minimum similarity value when text table is paragraph normalized to interval [0,1]. |
table_update_delete_empty |
1 |
Delete empty rows and cols. |
table_update_split_by_cell |
0 |
Split elements that should be originally splitted, It usually happens when some paragraph is recognized instead of single lines or images(bullets) are joined together. |
table_update_split_by_row |
0 |
Split table texts to lines. |
table_update_split_label |
0 |
Split labels in tables. |
table_update_span_empty |
1 |
Span empty cells. |
table_update_span_row |
0 |
Join rows based on the maximum row span |
table_update_span_row_first |
0 |
If set to true, rows are merged together first using span |
table_update_join |
0 |
Join texts in a single cell. |
table_update_cell_header |
1 |
Detect headers. |
table_span_col_ratio |
0.1 |
Intersection percentage of colspan element. Possible values from interval [0,1]. |
table_span_row_ratio |
0.2 |
Intersection percentage of rowspan element. Possible values from interval [0,1]. |
table_alignment_h |
1 |
Maximum deviation (in points) of horizontal table aligned elements. |
table_alignment_v |
4 |
Maximum deviation (in points) of vertical table aligned elements. |
table_line_intersection |
1 |
Expansion (in points) for lines intersection. It’s used in table detection. |
form_table_detect |
1 |
Recognize form fields as tables. |
caption_distance |
80 |
Distance of the caption and the image/table. |
caption_alignment_h |
4 |
Maximum deviation (in points) in caption and nearest element alignment. |
caption_alignment_v |
4 |
Maximum deviation (in points) in caption and nearest element alignment. |
mc_detect |
1 |
Update elements language, alternate description and actual text based on kb. Default value is set to 1 but can be turn to 0 due to optimization – when alternate description is not required. |
rd_sort |
0 |
Sort elements: 0 – inbuild, 1 – original content positions, 2 – by x and y coordinates, 3 – by rd_index. |
rd_sort_direction |
0 |
Sort elements: 0 – inbuild, 1 – prefere columns, 2 – prefere rows. |
rd_column_distance |
0.8 |
Maximum threshold value for columns detection. Real distance in points. |
Example:
{
"template": {
"pagemap": [
{
"graphic_table_image_w5": 1,
"text_table_image_col_height_max_ratio": 2,
"preflight_artifact_distance": 0.7
}
]
}
}Regular Expressions
—
| key | value |
|---|---|
regex_hyphen |
\\w+-$ |
regex_bullet |
^[\\u2010\\u2011\\u2212\\u005E\\u005B\\ uF0A7\\uF097\\uF0BB\\u25CF\\u2022\\u25D8 \\u25CB\\u25D9\\u2023\\u2043\\uF0B7\\u22 12\\u204C\\u204D\\u25E6\\u29BE\\u29BF\\u 21E8\\u25BA\\u25C4\\u2219\\u25A0\\uF06C\ \u25A1\\u005D\\u25C6]$ |
regex_bullet_font |
(Wingdings)\|(Symbol) |
regex_label |
^[\\[\\(]?((M{0,4}(CM\|CD\|D?C{0,3})(XC\|XL\|L?X{0,3})(IX\|IV\|V?I{0,3}))\|(\\d+)\|([a-zA-Z]))[\\)\\]\\. ]$ |
label_chars |
.()[] |
regex_decimal_numbering |
^[\\[\\(]?(?:\\d{1,4}\\.){0,5}\\d{0,4}\\s?[\\)\\]\\.]?$ |
regex_roman_numbering |
^[\\[\\(]?M{0,4}(CM\|CD\|D?C{0,3})(XC\|XL\|L?X{0,3})(IX\|IV\|V?I{0,3})[\\)\\]\\.]?$ |
regex_letter_numbering |
^[\\[\\(]?[A-Za-z][\\)\\]\\.]$ |
regex_filling |
[._]{2,} |
regex_filling_chars |
._ |
regex_page_number |
(^\\d+$)\|(^M{0,4}(CM\|CD\|D?C{0,3})(XC\|XL\|L?X{0,3})(IX\|IV\|V?I{0,3})$) |
regex_first_cap |
^[A-Z] |
regex_terminal |
[\\.\\!\\?]$ |
regex_table_caption |
((^table)\|(^tab\\.)) |
regex_image_caption |
((^image)\|(^img\\.)\|(^figure)\|(^fig\ \.)) |
regex_chart_caption |
((^chart)\|(^map)) |
regex_note_caption |
((^source\\:)\|(^note\\:)) |
regex_toc_caption |
((^content)\|(^toc)) |
regex_colon |
:$ |
regex_comma |
[,;]$ |
regex_letter |
^[A-Za-z]$ |
number_chars |
-+.,%\\u20AC$\\u00A5\\u00A3 |
numbering_splitter_chars |
.()[] |
Example:
{
"template": {
"pagemap_regex": [
{
"label_chars": ".()[]",
"regex_colon": ":$"
}
]
}
}Functions
element_create
Create user-defined elements.
keys and values:
object_update
The test is triggered when the page content object is tested.
keys and values:
text_run_update
Updates a text run element after processing text objects.
keys and values:
text_run_neighbours
This test is triggered when forming text lines from textrun.
keys and values:
line_update
Updates a line element after detecting horizontal and vertical lines.
keys and values:
rect_update
Updates a rectangle element after detecting rectangles.
keys and values:
element_graphic_neighbours
Test if two neighbours path elements can form a single graphic table.
keys and values:
element_graphic_update
Updates line, rects and graphic table element after detecting.
keys and values:
word_update
Updates a word element after detecting words.
keys and values:
word_neighbours
This test is triggered when forming text lines from words.
keys and values:
text_line_update
Updates a text line element after detecting text lines.
keys and values:
text_line_neighbours
Test if two neighbours text lines can form a paragraph.
keys and values:
text_update
Updates the text element after detecting paragraphs.
keys and values:
image_update
Updates an image after detecting basic images from page objects.
keys and values:
element_update
Updates an element after detecting basic elements.
keys and values:
table_update
Updates a table after the whole process od table detection is done.
keys and values:
cell_update
Updates a table cell after the whole process od table detection is done.
keys and values:
table_split
Updates the table after the entire table detection process is completed.
keys and values:
alt_update
Sets an alternate description for the element. The alternate description is established in a specific order. To skip a step, set the default value to false for that step.
keys and values:
actual_text_update
Sets the actual text for the element. The actual text is established in a specific order. To skip a step, set the default value to false for that step.
keys and values:
artifact_update
Marks an element as an artifact.
keys and values:
label_update
Update elements marked as labels to include them as part of the list.
keys and values:
list_update
Tests if a list is correct.
keys and values:
tag_image
Handles the process of tagging images. For repurposing and accessibility purposes, a Figure element should have either an Alt entry or an ActualText entry in its structure element dictionary. If both are absent, the default behavior is to tag the Figure with an empty alt attribute.
keys and values:
tag_table
Handles the process of tagging tables. For repurposing and accessibility purposes, a table should have headers. If no headers are detected, the default behavior is to leave the table without any
keys and values:
tag_update
Updates the tag after it has been created..
keys and values:
annot_update
Updates the annotation tag after it has been created.
keys and values:
Schema
statement
The if statement type of the query. According to the statement the query evaluation stops upon pass or not.
- values:
['$if', '$elif', '$else']
- defaule value: $if
keys and values:
- “$if”
- “$elif”
- “$else”
$if
Can by used in all functions. Applies a rule when a condition is true.
- type: statement
$elif
Can by used in all functions. Applies a rule when a condition is true.
- type: statement
$else
Can by used in all functions. Applies a rule when a condition is not true.
- type: statement
query
The query defines thresholds and operations for a pagemap detection.
- type: query
keys and values:
-
“param” params:
-
- A parameter that represents PdsObject. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
-
- A parameter that represents PdeElement. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
-
- A parameter that represents PdsStructElem. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
-
- A parameter that represents PdfAnnots. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
-
“pdf_rect”
-
“pdf_rgb”
-
“int”
- Parameter that represents integer.
-
“bool”
- Parameter that represents boolean value.
-
“float”
- Parameter that represents floating value.
-
“string”
- Parameter that represents string value.
-
-
“var” params:
- “0_value”
param
Define the number and type of input parameters.
- type: query_param
keys and values:
- “pds_object_params“
- “pde_element_params“
- “pds_struct_elem_params“
- “pdf_annot_params“
- “pdf_rect“
- “pdf_rgb“
- “int“
- “bool“
- “float“
- “string“
int
Parameter that represents integer.
- type: int
bool
Parameter that represents boolean value.
- type: bool
float
Parameter that represents floating value.
- type: float
string
Parameter that represents string value.
- type: string
var
User defined variables. Use macros to define variables
- type: var
keys and values:
- “0_value”
logical_operators
Available logical operators.
- type: string
- values:
['$and', '$or', '$not']
keys and values:
-
“$and” params:
-
“$and”
-
“$or”
-
“$not”
-
-
“$or” params:
-
“$and”
-
“$or”
-
“$not”
-
-
“$not” params:
-
“$and”
-
“$or”
-
“$not”
-
$and
Logical AND. All sub-conditions must be true.
- type: logical_operator
keys and values:
- “$and”
- “$or”
- “$not”
- “condition“
$or
Logical OR. At least one sub-condition must be true.
- type: logical_operator
keys and values:
- “$and”
- “$or”
- “$not”
- “condition“
$not
Logical NOT.
- type: logical_operator
keys and values:
- “$and”
- “$or”
- “$not”
- “condition“
comparison_operators
Available comparison operators.
- type: string
- values:
['$eq', '$ne', '$lt', '$lte', '$gt', '$gte', '$regex', '$in', '$nin']
keys and values:
- “$eq”
- “$ne”
- “$lt”
- “$lte”
- “$gt”
- “$gte”
- “$regex”
- “$in”
- “$nin”
$eq
Equal to value.
$ne
Not equal to value.
$lt
Less then value.
$lte
Less or equals then value.
$gt
Greater then value
$gte
Greater or equals then value.
$regex
Regular expression predicate.
- type: comparison_operator
- types: [“string“]
$in
Contain value operator.
- type: comparison_operator
- types: [“bbox“]
$nin
Not contain value operator.
- type: comparison_operator
- types: [“bbox“]
pds_object_params
List of all pds_object types, can be used as parameter in QUERY->PARAM.
keys and values:
-
“pds_text” params:
-
“pds_path“
-
“pds_form“
-
“pds_object” params:
pds_text
Text page object
keys and values:
pds_struct_elem_params
List of all pds_tag types, can be used as parameter in QUERY->PARAM.
keys and values:
-
“pds_struct_elem” params:
pdf_annot_params
List of all pdf_annot types, can be used as parameter in QUERY->PARAM.
keys and values:
-
“pdf_annot” params:
pde_element_params
List of all pde_element types, can be used as parameter in QUERY->PARAM.
keys and values:
-
“pde_text” params:
-
“pde_word“
-
“pde_image” params:
-
“pde_list“
-
“pde_rect“
-
“pde_cell” params:
-
“pde_toc“
-
“pde_line” params:
-
“pde_table” params:
-
“pde_element” params:
general_vars
General variables can be used without parameters. It represents general state during the processing. It contains information about the current page and the document and can be used in any query.
- type: string
keys and values:
- “$page_num”
- “$page_width”
- “$page_height”
- “$page_font_size”
- “$page_min_font_size”
- “$page_max_font_size”
- “$page_rotation”
- “$page_rtl”
- “$page_anchor”
- “$doc_num_pages”
- “$doc_lang”
- “$doc_title”
- “$doc_anchor”
$page_num
Page number.
- type: int
$page_width
Page cropbox width.
- type: float
$page_height
Page cropbox height.
- type: float
$page_font_size
Average font size on the page.
- type: float
$page_min_font_size
Minimal font size on the page.
- type: float
$page_max_font_size
Maximal font size on the page.
- type: float
$page_rotation
Page rotation.
- type: int
- values:
[0, 90, 180, 270]
$page_rtl
Page contains RTL content.
- type: bool
$page_anchor
Page already detected anchors.
- type: string
$doc_num_pages
Document number of pages.
- type: int
$doc_lang
Document language.
- type: string
$doc_title
Document title.
- type: string
$doc_anchor
Document already detected anchors.
- type: string
values
General values used in JSON default template.
keys and values:
-
“type“
-
“alt“
-
“lang“
-
“id“
-
“tag_type“
-
“contents“
-
“title“
-
“name“
-
“angle“
-
“bbox” params:
-
“cell_row“
-
“col_num“
-
“artifact“
-
“mcid“
-
“has_fill“
-
“fill_color” params:
-
“stroke_color” params:
-
“flag“
-
“red“
-
“green“
-
“blue“
-
“heading“
-
“width“
-
“height“
-
“label“
-
“left“
-
“right“
-
“top“
-
“bottom“
-
“pdf_rect” params:
-
“pdf_rgb” params:
-
“reflow“
-
“row_num“
-
“text“
type
Type.
- type: string
- values:
['pds_object', 'pds_text', 'pds_path', 'pds_image', 'pds_shading', 'pds_form', 'pde_element', 'pde_text', 'pde_text_line', 'pde_word', 'pde_text_run', 'pde_image', 'pde_container', 'pde_list', 'pde_line', 'pde_rect', 'pde_table', 'pde_cell', 'pde_toc', 'pde_header', 'pde_footer', 'pde_form_field', 'pde_annot', 'pds_struct_elem', 'pdf_annot']
alt
Alternate description typically used for Figure tags.
- type: string
actual_text
Actual text.
- type: string
lang
The language identifier.
- type: string
id
The unique identifier of the tag.
- type: string
associated_header
The unique identifier of the associated header. For more associated headers use composed string a|b|c|d
- type: string
expansion
The expanded form of an abbreviation.
- type: string
has_content
A value identifying whether the object or tag has associated page content.
- type: bool
- values:
['true', 'false']
tag_type
Tag type defined by a string or regular expression. Use .* to match all tags.
- type: string
- values:
['Annot', 'Art', 'Artifact', 'Aside', 'BibEntry', 'BlockQuote', 'Caption', 'Code', 'Div', 'Document', 'DocumentFragment', 'Em', 'FENote', 'Figure', 'Form', 'Formula', 'H', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'Index', 'L', 'Lbl', 'LBody', 'LI', 'Link', 'NonStruct', 'Note', 'P', 'Part', 'Private', 'Quote', 'RB', 'Reference', 'RP', 'RT', 'Ruby', 'Sect', 'Span', 'Strong', 'Sub', 'Table', 'TBody', 'TD', 'TFoot', 'TH', 'THead', 'Title', 'TOC', 'TOCI', 'TR', 'Warichu', 'WP', 'WT']
annot_type
Annotation type defined by a string or regular expression. Use .* to match all annotations.
- type: string
- values:
['Text', 'Link', 'FreeText', 'Line', 'Square', 'Circle', 'Polygon', 'PolyLine', 'Highlight', 'Underline', 'Squiggly', 'StrikeOut', 'Stamp', 'Caret', 'Ink', 'Popup', 'FileAttachment', 'Sound', 'Movie', 'Widget', 'Screen', 'PrinterMark', 'TrapNet', 'Watermark', '3D', 'Redact', 'Projection', 'RichMedia']
contents
A string value specifying the annotation contents.
- type: string
annot_flag
A comma-delimited string value specifying the annotation flags.
- type: string
- values:
['invisible', 'hidden', 'print', 'no_zoom', 'no_rotate', 'no_view', 'read_only', 'locked', 'toggle', 'contents']
title
Title.
- type: string
name
Unique name to identify element later.
- type: string
angle
Angle.
- type: float
bbox
Parameter that represents the bounding box of an object, formatted as an array: [left, bottom, right, top]. Each coordinate can be defined by a float number, general variables, anchor variables, or mathematical functions with previously defined variables. Each bounding box can be associated with only one anchor.
- type: bbox
keys and values:
cell_column
The column number of the cell in the table.
- type: int
cell_row
The row number of the cell in the table.
- type: int
cell_row_span
The cell row span.
- type: int
cell_column_span
The cell column span.
- type: int
cell_scope
The cell scope.
- type: string
- values:
['row', 'column', 'both']
col_num
Number of columns in the table.
- type: int
children_num
Number of associated child objects.
- type: int
object_num
Number of associated page objects.
- type: int
artifact
True if object has content mark Artifact, false otherwise.
- type: bool
- values:
['true', 'false']
mcid
MCID content mark number is exists, -1 otherwise.
- type: int
has_fill
True if fill color is set
- type: bool
- values:
['true', 'false']
fill_color
The fill color of an object.
- type: rgb
keys and values:
has_stroke
True if stroke color is set
- type: bool
- values:
['true', 'false']
stroke_color
The stroke color of an object.
- type: rgb
keys and values:
flag
The flag value defines a specific property for an object, which is essential for further processing.
- type: string
- values:
['no_join', 'no_split', 'artifact', 'header', 'footer', 'splitter', 'no_table', 'no_image', 'no_expand', 'continuous', 'anchor']
numbering
Set the list numbering attribute.
- type: string
- values:
['None', 'Unordered', 'Disc', 'Circle', 'Square', 'Ordered', 'Decimal', 'UpperRoman', 'LowerRoman', 'UpperAlpha', 'LowerAlpha', 'Description']
single_instance
Properties that are compared delimited by |. If the element with same properties already exists, only first instance is tagged.
- type: string
- values:
['type', 'width', 'height', 'left', 'right', 'top', 'bottom', 'bbox', 'font_size', 'font_name', 'text', 'fill_color', 'stroke_color', 'angle', 'alt', 'actual_text', 'flag', 'word_flag', 'text_line_flag', 'text_flag', 'lang', 'cell_column', 'cell_row', 'cell_column_span', 'cell_row_span', 'cell_scope', 'row_num', 'col_num']
word_space
Update words space for the font in points.
- type: float
font_name
The name of the font used in the text object.
- type: string
font_size
The size of the font used in the text object.
- type: float
red
The red component of an RGB color.
- type: int
green
The green component of an RGB color.
- type: int
blue
The blue component of an RGB color.
- type: int
cell_header
Marks the object as a table header.
- type: bool
- values:
['true', 'false']
cell_associated_header
Cell associated headers delimited by |.
- type: string
heading
Sets the text heading style.
- type: string
- values:
['normal', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'h8', 'note', 'title']
width
The object’s width dimension.
- type: float
height
The object’s height dimension.
- type: float
label
Marks the element as a list label.
- type: string
- values:
['label', 'li_1', 'li_2', 'li_3', 'li_4', 'label_no']
left
The left coordinate of the object.
- type: float
right
The left coordinate of the object.
- type: float
top
The top coordinate of the object.
- type: float
bottom
The bottom coordinate of the object.
- type: float
baseline_x
The baseline x coordinate of the text object.
- type: float
baseline_y
The baseline y coordinate of the text object.
- type: float
pdf_rect
Parameter that represents the bounding box of an object, formatted as an array: [left, bottom, right, top].
- type: rec
keys and values:
pdf_rgb
Parameter that represents the RGB color of an object, formatted as an array: [red, green, blue].
- type: rgb
keys and values:
reflow
Text reflow. If set to false, each line is treated as a new line.
- type: bool
- values:
['true', 'false']
row_num
The number of rows in the table.
- type: int
table_type
The table type represented as a value from the PdfTableType enum.
- type: string
- values:
['graphic', 'isolated', 'row', 'col', 'form']
text
The text to be used as a value.
- type: string
text_flag
The flag to be used for the text element, specifying a value similar to the regex flags.
- type: string
- values:
['table_caption', 'image_caption', 'chart_caption', 'note_caption', 'filling', 'uppercase', 'new_line', 'no_new_line']
text_line_flag
The flag to be used for the text line element, specifying a value similar to the regex flags.
- type: string
- values:
['hyphen', 'new_line', 'indent', 'terminal', 'drop_cap', 'filling', 'uppercase', 'no_new_line']
text_state_flag
The flag to be used for the text text_state_flag.
- type: string
- values:
['underline', 'strikeout', 'highlight', 'subscript', 'superscript', 'no_unicode', 'white_space', 'unicode']
word_flag
The flag to be used for the word element, specifying a value similar to the regex flags.
- type: string
- values:
['hyphen', 'bullet', 'colon', 'number', 'subscript', 'superscript', 'terminal', 'capital', 'image', 'decimal_num', 'roman_num', 'letter_num', 'page_num', 'filling', 'uppercase', 'comma', 'no_unicode']
suffix
Container holding all unique suffixes used for naming in JSON default template
keys and values:
condition
Conditions types used in the query
keys and values:
-
“comparison” params:
- “$eq”
-
“comparison_array” params:
-
“$gt”
-
“$lt”
-
condition_value
{0_width : 100}
comparison
{0_width : {$lt : 100}
keys and values:
- “$eq”
comparison_array
{0_width : [{$lt : 100}, {$gt : 100}, …]}
keys and values:
- “$gt”
- “$lt”
keywords
Container holding all unique keywords used in JSON default template
keys and values:
general
Holding general data like: version, date, id, SDK version, …
template
Holding all functions.
query
Can be used in all functions. Each QUERY must have child PARAM, which holding array of parameters to specified query objects.
param
Child of the QUERY. Each QUERY must include a PARAM that specifies the object types used for evaluation.
- type: array_param
statement
The if statement should be used in function nodes. Based on the statement, the query evaluation stops upon pass or fail. If the if statement is not present, the condition is considered disabled.
- type: string
- values:
['$if', '$elif', '$else']
disable
Can by used in all main functions nodes. If value is true, node is not executed. Default value is false
- type: bool
- values:
['true', 'false']
purpose
Describes the user-defined purpose or description of the QUERY.
- type: string
insert
Values to be added as the default for the node.
keys and values:
math_expressions
Mathemical functions to define custom variable.
- type: string
- values:
['SUM()', 'MINUS()', 'ABS()', 'MULTIPLY()', 'DIVIDE()', 'MIN()', 'MAX()', 'MOD()', 'FLOOR()', 'CEILING()']