Configurations for Text Recognition

Prev Next

Use these configurations to define the settings for text recognition.

Text: Advanced Settings

For some documents, it may be necessary to define further properties for text recognition in the configuration, which is found under Text > Advanced Settings.

This is not necessary for most documents and is also not recommended. However, if it is necessary to define additional properties, DocuWare Support will be happy to assist you in setting the correct properties for your specific documents.

Here you will find the properties listed with the possible values:

Property

Possible values

AutoDeskew

  • true (default), false

Optional. If set to true, the document pages are deskewed before text recognition is done. The original document pages will not be modified.

AutoRotate

  • true (default), false

If set to true, the document pages are rotated before text recognition is done. The original document will not be modified.

DespeckleMode

  • true (default), false

If set to true, the document pages are despeckled before text recognition is done. The original document page will not be modified.

FaxImageMode

  • true, false (default)

To be used only with Module property set to MOR. If the image file to be loaded is a fax message transmitted in Standard or Draft mode with a low resolution, set value to true.

FillingMethod

Defines the text font to be recognized.

  • DEFAULT: No restrictions for recognition modules.

  • DASHDIGIT: See for example Dash Digit Font (Module: MAT).

  • DRAFTDOT9: Denotes a 9-pin draft dot-matrix printout (Module: PLUS3, PLUS2, DOT, MTX).

  • DRAFTDOT24: Denotes a 24-pin draft dot-matrix printout (Module: PLUS3, PLUS2, MOR, FRX, MTX)

  • DOTDIGIT: See for example Dot Digit Font (Module: MAT).

  • OCRA: See for example OCRA font (Module: MOR, MTX, MAT, RER).

  • OCRB: See for example OCRB font (Module: MOR, MTX, MAT, RER).

  • OMNIFONT (default): Denotes a machine printed text with any font not highly stylized (Module: PLUS3, PLUS2, MOR, FRX, MTX).

  • OMR: Denotes a zone with one or more checkboxes that are judged to be marked or unmarked (Module: OMR).

Filter

Specifies a character set to be recognized. Parameter: flags. Value: hexadecimal ranging from 0x01 to 2F.

  • 0x01: Digits

  • 0x02: Uppercase letters

  • 0x04: Lowercase letters

  • 0x08: Punctuation characters, other characters

  • 0x10: Other characters

  • 0x20: Characters specified in the FilterPlus option

Samples:

  • 0x06: All characters of an alphabet

  • 0x07: Alphanumeric characters

  • 0x21: Numbers and characters that are defined in FilterPlus

  • 0x1F: All characters (default)

  • 0x2F: All characters (default) plus the characters specified in FilterPlus

FilterPlus

  • Any string (default: empty string)

Specifies a set of individual characters enlarging the character set of the characters that can be recognized. The string should contain characters that are not part of the selected languages.

Module

  • AUTO (default)

  • ASIAN: Provides recognition services for the CCJK languages with horizontal or vertical text direction: Japanese, Korean, Traditional Chinese and Simplified Chinese. Also recognizes Arabic text. Can handle short embedded English texts within either CCJK or Arabic text.

  • DOT: Designed only for draft-quality 9-pin dot-matrix texts.

  • FRX: Recognizes machine-printed text, i.e. from printed publications, laser or inkjet printers, and electric typewriters, output from mechanical typewriters in good condition, LQ or NLQ output from dot-matrix printers.

  • MAT: Reads certain groups of fixed-font characters specifically designed for text recognition or imaging applications where no two characters have similar shapes. Each group of characters has its own fill method.

  • MOR: Recognizes machine printed text, i.e. from printed publications, laser or ink-jet printers, and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It can also be used for LQ or NLQ output from dot-matrix printers.

  • MTX: Recognizes machine printed text; i.e. from printed publications, laser or ink-jet printers, and electric typewriters, good quality output from mechanical typewriters, letter or near-letter quality output from dot matrix printers, and draft quality output.

  • OMR: Used for recognizing optical marks (checkmarks) in questionnaires, ballot papers, educational tests, and reporting or ordering sheets, where the documents to be processed are form-like and usually filled by hand.

  • PLUS2 and PLUS3: The PLUS2W and PLUS3W engines are voting engines that combine the results of the other OMNIFONT text recognition engines of the CSDK. In different trade-off modes, they use different engine combinations. This recognition module recognizes machine-printed text, i.e., from printed publications, laser or inkjet printers, and electric typewriters as well as output from mechanical typewriters in good condition.

NonGriddedTableDetect

  • true (default), false

If set to true, tables that have no grid lines will be detected more confidently.

OcrPageMaximum

  • Any integer (default: 25)

Specifies the number of pages that are extracted by OCR/Dynapdf in DocuWare Desktop. If this key is set it overwrites the value OcrPageMaximum. Only used if text recognition settings are used in a document processing configuration of DocuWare.

ProcessingMode

  • AUTO (default)

  • NORMAL

  • GRAPHICS_ONLY

  • PDF_PM_TEXT_ONLY

  • PDF_PM_TEXT_ONLY_EXT

  • PDF_PM_AS_IMAGE

Should only to be used if the text recognition settings are used in a document processing configuration of DocuWare.

RecognitionMode

  • ALWAYSRECOGNIZE (default): Combines the characters from the text recognition result with the PDF text.

  • ALWAYSGETTEXT: Uses the PDF text.

  • ALWAYSRECOGNIZEASIMAGE: Uses the PDF text relying on the text recognition result only to determine the spaces between words (quickest).

  • MOSTLYGETTEXT: Same as ALWAYSGETTEXT mode. Only if a font character coding problem is detected on a PDF page, it is equal to ALWAYSRECOGNIZE.

Specifies the usage of text data coming from normal PDF files. Should only be used if the text recognition settings are used in a document processing configuration of DocuWare.

RejectionSymbol

  • Any character that is not part of the text to be recognized (default: '~')

Specifies the character to be used as a symbol for the unrecognized and thus rejected characters in the final output document.

ReturnAllLines

  • true (default), false

If set to true, all lines that can be detected by text recognition will be returned in the textshot (including the lines in tables).

SureText

  • true, false (default)

If set to true, also text in zones that are marked as noisy are recognized.

ThresholdForImageConversion

  • 0-255 (default: 128)

Determines which pixels are converted to black or white during the text recognition pre-processing. Can be used when bright characters with low contrast are not recognized because they are converted to white in the image pre-processing. Using a value lower than 128 can remove bright lines, for example, and this can affect the recognition quality.

TreatGraphicAsFlow

  • true, false (default)

If set to true, graphic zones will be treated as flow zones. Should be specified if the textshot has red (graphic) zones containing text.

UseFreeFormInPageDescriptor

  • true, false (default)

If set to true, better results can be achieved for zones containing characters of different sizes. In addition, the ZonehandlingModule property must be set to STANDARD.

UseOcrForNativePdf

  • true, false (default)

Should only be applied if the text recognition settings are used in a document processing configuration of DocuWare. If set to true, for native PDFs, Kofax (former Nuance) Toolkit is used for text extraction instead of DynaPDF. In combination with RecognitionMode = ALWAYSRECOGNIZEASIMAGE and ProcessingMode = PDF_PM_AS_IMAGE, this setting enforces text extraction to be done with Kofax CSDK, all pages of a native PDF are rendered and OCR technology is used for text extraction. The text contained in the native PDF will be ignored. Using this setting is appropriate for native PDFs where text extraction with DynaPDF Toolkit returns nonsense characters.

ZonehandlingModule

  • AUTO (default), uses LEGACY if TradeOff is set to FAST, otherwise uses STANDARD.

  • LEGACY

  • STANDARD

  • FAST

Specifies the algorithm used for decomposing the page layout. Changing the page layout algorithm can be helpful if no text is recognized for an area of a document page that contains text.