Use these configurations to define the settings for text recognition.
Text: Advanced Settings
For some documents, it may be necessary to define further properties for text recognition in the configuration, which is found under Text > Advanced Settings.
This is not necessary for most documents and is also not recommended. However, if it is necessary to define additional properties, DocuWare Support will be happy to assist you in setting the correct properties for your specific documents.
Here you will find the properties listed with the possible values:
Property | Possible values |
AutoDeskew |
Optional. If set to true, the document pages are deskewed before text recognition is done. The original document pages will not be modified. |
AutoRotate |
If set to true, the document pages are rotated before text recognition is done. The original document will not be modified. |
DespeckleMode |
If set to true, the document pages are despeckled before text recognition is done. The original document page will not be modified. |
FaxImageMode |
To be used only with Module property set to MOR. If the image file to be loaded is a fax message transmitted in Standard or Draft mode with a low resolution, set value to true. |
FillingMethod | Defines the text font to be recognized.
|
Filter | Specifies a character set to be recognized. Parameter: flags. Value: hexadecimal ranging from 0x01 to 2F.
Samples:
|
FilterPlus |
Specifies a set of individual characters enlarging the character set of the characters that can be recognized. The string should contain characters that are not part of the selected languages. |
Module |
|
NonGriddedTableDetect |
If set to true, tables that have no grid lines will be detected more confidently. |
OcrPageMaximum |
Specifies the number of pages that are extracted by OCR/Dynapdf in DocuWare Desktop. If this key is set it overwrites the value OcrPageMaximum. Only used if text recognition settings are used in a document processing configuration of DocuWare. |
ProcessingMode |
Should only to be used if the text recognition settings are used in a document processing configuration of DocuWare. |
RecognitionMode |
Specifies the usage of text data coming from normal PDF files. Should only be used if the text recognition settings are used in a document processing configuration of DocuWare. |
RejectionSymbol |
Specifies the character to be used as a symbol for the unrecognized and thus rejected characters in the final output document. |
ReturnAllLines |
If set to true, all lines that can be detected by text recognition will be returned in the textshot (including the lines in tables). |
SureText |
If set to true, also text in zones that are marked as noisy are recognized. |
ThresholdForImageConversion |
Determines which pixels are converted to black or white during the text recognition pre-processing. Can be used when bright characters with low contrast are not recognized because they are converted to white in the image pre-processing. Using a value lower than 128 can remove bright lines, for example, and this can affect the recognition quality. |
TreatGraphicAsFlow |
If set to true, graphic zones will be treated as flow zones. Should be specified if the textshot has red (graphic) zones containing text. |
UseFreeFormInPageDescriptor |
If set to true, better results can be achieved for zones containing characters of different sizes. In addition, the ZonehandlingModule property must be set to STANDARD. |
UseOcrForNativePdf |
Should only be applied if the text recognition settings are used in a document processing configuration of DocuWare. If set to true, for native PDFs, Kofax (former Nuance) Toolkit is used for text extraction instead of DynaPDF. In combination with RecognitionMode = ALWAYSRECOGNIZEASIMAGE and ProcessingMode = PDF_PM_AS_IMAGE, this setting enforces text extraction to be done with Kofax CSDK, all pages of a native PDF are rendered and OCR technology is used for text extraction. The text contained in the native PDF will be ignored. Using this setting is appropriate for native PDFs where text extraction with DynaPDF Toolkit returns nonsense characters. |
ZonehandlingModule |
Specifies the algorithm used for decomposing the page layout. Changing the page layout algorithm can be helpful if no text is recognized for an area of a document page that contains text. |