Search
Close this search box.

OCR PDF or images ​

Generate searchable PDF from an image PDF or scanned images.

Input Parameters

Required Parameters

Source file content:
Data Type = string (byte – base 64 string)
Content of the file to OCR

Source file name with extension:
Data Type = string (byte – base 64 string)
The source file name with extension or just the extension (with a leading period ‘.’)

Optional Parameters

Password:
Data Type = string
The password to open the source PDF file

Language:
Data Type = string
Selecting one of the option below sets the language to be used for the OCR processing. The default language is English.

				
					    "English"
    "German"
    "French"
    "Russian"
    "Swedish"
    "Spanish"
    "Italian"
    "Russian_English"
    "Ukrainian"
    "Serbian"
    "Croatian"
    "Polish"
    "Danish"
    "Portuguese"
    "Dutch"
    "Czech"
    "Romanian"
    "Hungarian"
    "Bulgarian"
    "Slovenian"
    "Latvian"
    "Lithuanian"
    "Estonian"
    "Turkish"

				
			

Auto-rotate:
Data Type = boolean
Auto rotate the image – this will ensure all text oriented normally

Binarize:
Data Type = integer
This value should generally only be used under guidance from technical support. It can control the way that color images are processed and force binarization with a particular threshold. A value of 200 has been shown to generally give good results in testing, but this should be confirmed with “typical” customer documents. By setting this to -1 an alternative method is used which will attempt to separate the text from any background images or colors. This can give improved OCR results for certain documents such as newspaper and magazine pages.

Black pixel limit:
Data Type = float
Contact technical support (support@aquaforest.com) for guidance on using this property.

Blank page threshold:
Data Type = integer
Use this to set the minimum number of “On Pixels” that must be present in the image for a page not to be considered blank. A value of -1 will turn off blank page detection.

Box size:
Data Type = integer
This option is ideal for forms where sometimes boxes around text can cause an area to be identified as graphics. This option removes boxes from the temporary copy of the imaged used by the OCR engine. It does not remove boxes from the final image. Technically, this option removes connected elements with a minimum area (in pixels and defined by this property). This option is currently only applied for bi-tonal images.

Deskew:
Data Type = boolean
Deskew (straighten) the image.

Despeckle:
Data Type = integer
This removes all disconnected elements within the image that have height or width in pixels less than the specified figure. The maximum value is 9 and the default value is 0.

Grayscale quality:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Jbig2EncFlags:
Data Type = string
These are the flags that will be passed to the application used to generate JBIG2 versions of images used in PDF generation (assuming this compression is enabled). This option should generally only be used under guidance from technical support.

LibTiffSavePageAsBmp:
Data Type = boolean
Sometimes if there is an image which is 1bpp and has LZW compression, the pre-processing can cause the color of the image to be inverted (black to white and white to black). Set this to true to avoid this.

Maximum deskew:
Data Type = float
Maximum angle by which a page will be deskewed. This option should generally only be used under guidance from technical support (support@aquaforest.com).

Minimum deskew confidence:
Data Type = string
This option should generally only be used under guidance from technical support (support@aquaforest.com).

Morph:
Data Type = string
Morphological options that will be applied to the binarized image before OCR. If set to empty none is applied. Common options include those listed below but for more options please contact support@aquaforest.com:

Possible values

				
					
    d2.2: 2x2 dilation applied to all black pixel areas, useful for faint prints.
    e2.2: 2x2 erosion applied to all black pixel areas, useful for heavy prints.
    c2.2: closing process that performs a 2x2 dilation followed by a 2x2 erosion with the result that holes and gaps in the characters are filled.

				
			

Contact technical support (support@aquaforest.com) for guidance on using this property.

Remove Blank Pages:
Data Type = boolean
Remove blank pages when BlankPageThreshold is greater than -1 and ConvertToTiff is true.

Remove Lines:
Data Type = boolean
Remove lines from images for better recognition.

Save Pre-despeckle:
Data Type = boolean
This will use the original image (i.e. before applying pre-processing) in the output PDF.

Compress PDF (MRC):
Data Type = boolean
This enables Mixed Raster Compression which can dramatically reduce the output size of PDFs comprising color scans. Note that this option is only suitable when the source is not a PDF or using ConvertToTiff.

Mrc Background Factor:
Data Type = integer
Sampling size for the background portion of the image. The higher the number, the larger the size of the image blocks used for averaging which will result in a reduction in size but also quality. Default value is 3

Mrc Foreground Factor:
Data Type = integer
Sampling size for the foreground portion of the image. The higher the number, the larger the size of the image blocks used for averaging which will result in a reduction in size but also quality. Default value is 3

Mrc Quality:
Data Type = integer
JPEG quality setting (percentage value 1 – 100) for use in saving the background and foreground images. Default value is 75

Pdf To Image Bpp:
Data Type = string
The Bits Per Pixel to use for the rasterized PDF page when using engine 1. This only applies for documents that are processed using ConvertToTiff. The default value for this property is taken from the PDF page.

Possible values

				
					    "Bpp_1"
    "Bpp_24"
				
			

Pdf To Image Compression:
Data Type = string
The compression to set to the images extracted or rasterized from each page of the source PDF file. These images are then OCRed to create the searchable PDF. The default value for this property is taken from each page in the source PDF file.

Possible values

				
					    "CCITT4"
    "LZW"
				
			

PDF To Image DPI:
Data Type = string
The DPI to set to the images rasterized from each page of the source PDF file. These images are then OCRed to create the searchable PDF. The default value for this property is taken from each page in the source PDF file.

Possible values

				
					    "DPI_72"
    "DPI_100"
    "DPI_150"
    "DPI_200"
    "DPI_300"
    "DPI_400"
    "DPI_500"
    "DPI_600"
				
			

Pdf To Image Force Vector Check:
Data Type = boolean
This setting is useful when dealing with documents that contains vector objects (e.g. CAD drawings). By default, pages that contain only vector objects are rasterized. Pages that do not have any images but contain vector objects as well as electronic text are skipped from rasterization. However, sometimes there can be a page that contains vector objects (CAD drawings) but its title may be in electronic text. To force rasterizing pages like these, set this property to true.

Pdf To Image Include Text:
Data Type = boolean
When set to False this will prevent the conversion of real text (i.e. electronically generated as opposed to text that is part of a scanned image) from being rendered in the page images extracted from the PDF. This is because the text is already searchable and so generally does not require OCR. The value can be set to True however if the OCR is required on this real text.

Pdf To Image Max Res:
Data Type = integer
The maximum resolution of the rasterized images. If the resolution retrieved from the PDF page is bigger than this value, it will be set to this value. The default value for this property is 600.

Pdf To Image Min Res:
Data Type = integer
The minimum resolution of the rasterized images. If the resolution retrieved from the PDF page is lower than this value, it will be set to this value. The default value for this property is 200.

No Pictures:
Data Type = boolean
By default, if an area of the document is identified as a graphic area then no OCR processing is run on that area. However, certain documents may include areas or boxes that are identified as “graphic” or “picture” areas but that actually do contain useful text. Setting NoPictures to True will cause it to ignore areas identified as pictures whilst setting it to False will force OCR of areas identified as pictures.

Tables:
Data Type = boolean
This option when set to true, tries to OCR within table cells.

Text Layer Filter Height:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Text Layer Filter Height Inverted:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Text Layer Filter Percentage:
Data Type = float
Contact technical support (support@aquaforest.com) for guidance on using this property.

Text Layer Filter Percentage Inverted:
Data Type = float
Contact technical support (support@aquaforest.com) for guidance on using this property.

Text Layer Filter Ratio:
Data Type = float
Contact technical support (support@aquaforest.com) for guidance on using this property.

Text Layer Filter Ratio Inverted:
Data Type = float
Contact technical support (support@aquaforest.com) for guidance on using this property.

Text Layer Filter Width:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Text Layer Filter Width Inverted:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Text Layer Max Boxes:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Author:
Data Type = string
Set a custom Author in the output PDF document properties.

Creation Date:
Data Type = string
Set a custom creation date in the output PDF document properties. The date string must be in the format ‘yyyy-MM-dd HH:mm:ss’.

Modified Date:
Data Type = string
Set a custom modified date in the output PDF document properties. The date string must be in the format ‘yyyy-MM-dd HH:mm:ss’.

Retain creation date:
Data Type = boolean
Retains the creation date of the source file in the output PDF document properties.

Retain modified date:
Data Type = boolean
Retains the modified date of the source file in the output PDF document properties.

Retain bookmarks:
Data Type = boolean
Retains any bookmarks from the source file in the output when using ConvertToTiff.

Retain metadata:
Data Type = boolean
Retains any metadata from the source file in the output when using ConvertToTiff.

Retain viewer preferences:
Data Type = boolean
Retains any PDF Viewer Preferences, Page Mode and Page Layout from source file in the output when using ConvertToTiff.

Dotmatrix:
Data Type = boolean
Set this to true to improve recognition of dot-matrix fonts. Default value is false. If set to true for non dot-matrix fonts then the recognition can be poor.

Enable debug output:
Data Type = boolean
Enables debug output.

PDF/A Output:
Data Type = boolean
Whether or not to output as PDF/A.

PDF/A Version:
Data Type = string
The PDF/A version.

Possible values

				
					    "PDF_A1b"
    "PDF_A2b"
    "PDF_A3b"
				
			

Validate PDF/A:
Data Type = boolean
Whether or not to validate the PDF/A document after conversion

Convert To Tiff:
Data Type = boolean
Each page in the PDF document is rasterized to a TIFF image.

Create Process:
Data Type = boolean
Set this to true if you want to launch process through pinvoke.

Error mode:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Restart Engine Every:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Tidy-up mode:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Dictionary Lookup:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Flip detect:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Heuristics:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Word match threshold:
Data Type = float
Contact technical support (support@aquaforest.com) for guidance on using this property.

Aquaforest Image Timeout:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

MRC Timeout:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

OCR Timeout:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Ocr Process Setup Timeout:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Pipe Client Connection Timeout:
Data Type = integer
Contact technical support (support@aquaforest.com) for guidance on using this property.

Output Parameters

Processed file content:
Data Type = string (byte – base 64 string)
PDF File generated by the Aquaforest PDF converter.

Log file content:
Data Type = string
The log contents of the operation.

Error message:
Data Type = string
Error message

Is Successful:
Data Type = boolean
Whether the operation was successful or not.

License Info:
Data Type = string
Information about your API subscription key, it contains:

LicenseType
CallsRemaining
CallsMade
RenewalDate