Aquaforest PDF Connector User Guide


The Aquaforest PDF Extractor contains a group of actions that use the information available in PDF files to perform some simple operations for Office 365 and Flow.

Getting Started
Top

Create account

First of all, you need to Create an Aquaforest PDF API Account. This account is used to manage Aquaforest PDF Actions and Aquaforest PDF API. Use your active email address, because the subscription will be linked to this address. If you already have an account, just sign in here.

Generate API key

  • Login to the developer portal and go to the products page, then click on the product you want to subscribe to.
  • Click the subscribe button.
  • Click the confirm button to confirm your request or the cancel button to cancel the request.
  • View your keys in the profile page

Licensing

The table below shows the various licensing options and thier limitations.

Standard (Free)

Sign up
  • 500 Monthly Operations
  • 3MBFile Size Limit
  • 1 call per page for OCR Actions
  • 1 call for every 10 pages processed in Split by Barcode Actions
  • 1 per page for other Actions

Custom

Contact support@aquaforest.com for details on the custom plan

Microsoft Flow
Top

Microsoft Flow Setup

  • When adding a new action in flow, search for Aquaforest PDF. This will show you a list of the available Aquaforest Microsoft Flow Actions.
  • You will be asked for a Connection Name and an API Key, give your connection a name and use the primary key generated in the Generate API key section.

Microsoft Flow Actions
Top

Get Text From PDF

Extracts text from a PDF files in a smart way, the extracted information can be used to rename the file in flow, it can also be used as an input to other processes. Properties like the location of the text on the page and regular expressions can be used to fine tune the result.

Input Parameters

Required Parameters
  • File Name:
    Data Type = string
    The name of the source file, this will be used for the file name template.
  • File Content:
    Data Type = string (byte - base 64 string)
    The content of the source file, this should be converted to a base64 string if you are passing it from code, otherwise Microsoft Flow handles this aspect.
  • Text Result Template:
    Data Type = string
    Template for the output text result if a text match is found, any occurrence of variables in the list below will be replaced by the appropriate value at runtime.
    • %VALUE1%:The text extracted from the first zone that was extracted, if no zone was provided all the text in the page will be returned.
    • %VALUE2%, ..., %VALUEn%The text extracted from the nth zone that was extracted.
  • No Text Match Template:
    Data Type = string
    Template for the text to be returned if a text match is not found
Optional Parameters
  • Text Zones:
    Data Type = Object []
    A collection of variables that can be used to extract text information from PDF files, each member of this collection contains the properties listed below. Each member of this collection should produce a text output that corresponds to %VALUEn% of the Text Result Template discussed above.
    • Text Location:
      Data Type = string
      This represents the coordinates of a rectangle that covers the text you want us to extract. You can use <<<<<<< HEAD this ======= this >>>>>>> 6547b900a99dd1b571be33d8bceb056db5c2ceb3 page to get the coordinates in relation to your input files.
    • Text Page Number:
      Data Type = integer
      Provide a page number to extract text from, if empty we will try each page until we get a match.
    • Text Pattern:
      Data Type = string
      If a regular expression is provided here, we will match any extracted text to it and return the match.
    • Text Select:
      Data Type = string
      Use this to refine the text you extract more, select an option that matches you requirements
      • text in zone: This option will select all the text that was extracted.
      • word after value: If this option is selected, kingfisher will return the word that appears immediately after the expression supplied below.
      • word before value: If this option is selected, kingfisher will return the word that appears immediately before the expression supplied below.
      • all text in line after value: If this option is selected, kingfisher will return all the words that appear on the same line after the expression supplied below.
      • all text in line before value: If this option is selected, kingfisher will return all the words that appear on the same line before the expression supplied below.
      • all text in zone after value: If this option is selected, kingfisher will return all the words that appear in the selected zone after the expression supplied below.
      • all text in zone before value: If this option is selected, kingfisher will return all the words that appear in the selected zone before the expression supplied below.
    • Text Value:
      Data Type = string[]
      Provide one or more value(s) here to be used with the property above, we will return the first text value that matches the rule stated above.

Output Parameters

  • Text Result:
    Data Type = string
    A string generated from apply the extracted text to the file template provided.
  • Success:
    Data Type = boolean
    A boolean value specifying if the operation was successful or not.
  • Licence:
    Data Type = string
    Information about your API subscription key, it contains:
    • LicenseType
    • CallsRemaining
    • RenewalDate
  • Error:
    Data Type = string
    Contains the Error message returned by the operation if any exist.

Get Barcode Value

Extracts barcode from a PDF files in a smart way, the extracted information can be used to rename the file in flow, it can also be used as an input to other processes. Properties like the location of the barcode on the page, the barcode format and regular expressions can be used to fine tune the result.

Input Parameters

Required Parameters
  • File Name:
    Data Type = string
    The name of the source file, this will be used for the file name template.
  • File Content:
    Data Type = string (byte - base 64 string)
    The content of the source file, this should be converted to a base64 string if you are passing it from code, otherwise Microsoft Flow handles this aspect.
  • Barcode Result Template:
    Data Type = string
    Template for the output text result if a barcode is found, any occurrence of variables in the list below will be replaced by the appropriate value at runtime.
    • %VALUE1%:The text extracted from the first zone that was extracted, if no zone was provided all the text in the page will be returned.
    • %VALUE2%, ..., %VALUEn%The text extracted from the nth zone that was extracted.
  • No Barcode Template:
    Data Type = string
    Template for the output text result if no barcode is found
Optional Parameters
  • Barcode Zones:
    Data Type = Object []
    A collection of variables that can be used to extract barcode information from PDF files, each member of this collection contains the properties listed below. Each member of this collection should produce a text output that corresponds to %VALUEn% of the Barcode Result Template discussed above.
    • Barcode Location:
      Data Type = string
      This represents the coordinates of a rectangle that covers the barcode you want us to extract. You can use <<<<<<< HEAD this ======= this >>>>>>> 6547b900a99dd1b571be33d8bceb056db5c2ceb3 page to get the coordinates in relation to your input files.
    • Barcode Page Number:
      Data Type = integer
      Provide a page number to extract barcode from, if empty we will try each page until we get a match.
    • Barcode Pattern:
      Data Type = string
      If a regular expression is provided here, we will match any extracted barcode to it and return the match.
    • Barcode Type:
      Data Type = string[]
      Specify the types of Barcode you want to identify
      "All 1D", "AZTEC", "CODABAR", "CODE 128", "CODE 39","CODE 93", "DATA MATRIX", "EAN 13", 
      "EAN 8", "ITF","MAXICODE", "MSI", "PDF 417", "PLESSEY", "QR CODE","RSS 14", "RSS EXPANDED",
      "UPC A", "UPC E", "UPC EAN EXTENSION"
                                                      

Output Parameters

  • Barcode:
    Data Type = string
    A string generated from applying the extracted text to the file template provided.
  • Success:
    Data Type = boolean
    A boolean value specifying if the operation was successful or not.
  • Licence:
    Data Type = string
    Information about your API subscription key, it contains:
    • LicenseType
    • CallsRemaining
    • RenewalDate
  • Error:
    Data Type = string
    Contains the Error message returned by the operation if any exist.

Split PDF By Barcode

Uses barcode values in PDF files to split the PDF file, you can also generate filenames for the split files based on the barcode values

Input Parameters

Required Parameters
  • File Name:
    Data Type = string
    The name of the source file, this will be used for the file name template.
  • File Content:
    Data Type = string (byte - base 64 string)
    The content of the source file, this should be converted to a base64 string if you are passing it from code, otherwise Microsoft Flow handles this aspect.
  • Text Result Template:
    Data Type = string
    Template for the output text result if a text match is found, any occurrence of variables in the list below will be replaced by the appropriate value at runtime.
    • %VALUE1%:The text extracted from the first zone that was extracted, if no zone was provided all the text in the page will be returned.
    • %VALUE2%, ..., %VALUEn%The text extracted from the nth zone that was extracted.
  • No Text Match Template:
    Data Type = string
    Template for the text to be returned if a text match is not found
Optional Parameters
  • Text Zones:
    Data Type = Object []
    A collection of variables that can be used to extract text information from PDF files, each member of this collection contains the properties listed below. Each member of this collection should produce a text output that corresponds to %VALUEn% of the Text Result Template discussed above.
    • Text Location:
      Data Type = string
      This represents the coordinates of a rectangle that covers the text you want us to extract. You can use <<<<<<< HEAD this ======= this >>>>>>> 6547b900a99dd1b571be33d8bceb056db5c2ceb3 page to get the coordinates in relation to your input files.
    • Text Page Number:
      Data Type = integer
      Provide a page number to extract text from, if empty we will try each page until we get a match.
    • Text Pattern:
      Data Type = string
      If a regular expression is provided here, we will match any extracted text to it and return the match.
    • Text Select:
      Data Type = string
      Use this to refine the text you extract more, select an option that matches you requirements
      • text in zone: This option will select all the text that was extracted.
      • word after value: If this option is selected, kingfisher will return the word that appears immediately after the expression supplied below.
      • word before value: If this option is selected, kingfisher will return the word that appears immediately before the expression supplied below.
      • all text in line after value: If this option is selected, kingfisher will return all the words that appear on the same line after the expression supplied below.
      • all text in line before value: If this option is selected, kingfisher will return all the words that appear on the same line before the expression supplied below.
      • all text in zone after value: If this option is selected, kingfisher will return all the words that appear in the selected zone after the expression supplied below.
      • all text in zone before value: If this option is selected, kingfisher will return all the words that appear in the selected zone before the expression supplied below.
    • Text Value:
      Data Type = string[]
      Provide one or more value(s) here to be used with the property above, we will return the first text value that matches the rule stated above.

Output Parameters

  • Split Output Files:
    Data Type = object[]
    Array of Split Files with their corresponding file names.
    • File Content:
      Data Type = string (byte - base 64 string)
      A base 64 string representation of the spilt file.
    • File Name:
      Data Type = string
      File name for the split file above
  • Success:
    Data Type = boolean
    A boolean value specifying if the operation was successful or not.
  • Licence:
    Data Type = string
    Information about your API subscription key, it contains:
    • LicenseType
    • CallsRemaining
    • RenewalDate
  • Error:
    Data Type = string
    Contains the Error message returned by the operation if any exist.

Split PDF By Text

Uses text matches in PDF files to split the PDF file, you can also generate filenames for the split files based on the barcode text matches

Input Parameters

Required Parameters
  • File Name:
    Data Type = string
    The name of the source file, this will be used for the file name template.
  • File Content:
    Data Type = string (byte - base 64 string)
    The content of the source file, this should be converted to a base64 string if you are passing it from code, otherwise Microsoft Flow handles this aspect.
  • Text Result Template:
    Data Type = string
    Template for the output text result if a text match is found, any occurrence of variables in the list below will be replaced by the appropriate value at runtime.
    • %VALUE1%:The text extracted from the first zone that was extracted, if no zone was provided all the text in the page will be returned.
    • %VALUE2%, ..., %VALUEn%The text extracted from the nth zone that was extracted.
  • No Text Match Template:
    Data Type = string
    Template for the text to be returned if a text match is not found
Optional Parameters
  • Text Zones:
    Data Type = Object []
    A collection of variables that can be used to extract text information from PDF files, each member of this collection contains the properties listed below. Each member of this collection should produce a text output that corresponds to %VALUEn% of the Text Result Template discussed above.
    • Text Location:
      Data Type = string
      This represents the coordinates of a rectangle that covers the text you want us to extract. You can use <<<<<<< HEAD this ======= this >>>>>>> 6547b900a99dd1b571be33d8bceb056db5c2ceb3 page to get the coordinates in relation to your input files.
    • Text Page Number:
      Data Type = integer
      Provide a page number to extract text from, if empty we will try each page until we get a match.
    • Text Pattern:
      Data Type = string
      If a regular expression is provided here, we will match any extracted text to it and return the match.
    • Text Select:
      Data Type = string
      Use this to refine the text you extract more, select an option that matches you requirements
      • text in zone: This option will select all the text that was extracted.
      • word after value: If this option is selected, kingfisher will return the word that appears immediately after the expression supplied below.
      • word before value: If this option is selected, kingfisher will return the word that appears immediately before the expression supplied below.
      • all text in line after value: If this option is selected, kingfisher will return all the words that appear on the same line after the expression supplied below.
      • all text in line before value: If this option is selected, kingfisher will return all the words that appear on the same line before the expression supplied below.
      • all text in zone after value: If this option is selected, kingfisher will return all the words that appear in the selected zone after the expression supplied below.
      • all text in zone before value: If this option is selected, kingfisher will return all the words that appear in the selected zone before the expression supplied below.
    • Text Value:
      Data Type = string[]
      Provide one or more value(s) here to be used with the property above, we will return the first text value that matches the rule stated above.

Output Parameters

  • Split Output Files:
    Data Type = object[]
    Array of Split Files with their corresponding file names.
    • File Content:
      Data Type = string (byte - base 64 string)
      A base 64 string representation of the spilt file.
    • File Name:
      Data Type = string
      File name for the split file above
  • Success:
    Data Type = boolean
    A boolean value specifying if the operation was successful or not.
  • Licence:
    Data Type = string
    Information about your API subscription key, it contains:
    • LicenseType
    • CallsRemaining
    • RenewalDate
  • Error:
    Data Type = string
    Contains the Error message returned by the operation if any exist.

OCR PDF or Images

Generate searchable PDF from an image PDF or scanned images.

Input Parameters

Required Parameters
  • Source file content:
    Data Type = string (byte - base 64 string)
    Content of the file to OCR
  • Source file name with extension:
    Data Type = string (byte - base 64 string)
    The source file name with extension or just the extension (with a leading period '.')
Optional Parameters
  • Password:
    Data Type = string
    The password to open the source PDF file
  • Auto-rotate:
    Data Type = boolean
    Auto rotate the image – this will ensure all text oriented normally
  • Binarize:
    Data Type = integer
    This value should generally only be used under guidance from technical support. It can control the way that color images are processed and force binarization with a particular threshold. A value of 200 has been shown to generally give good results in testing, but this should be confirmed with \"typical\" customer documents. By setting this to -1 an alternative method is used which will attempt to separate the text from any background images or colors. This can give improved OCR results for certain documents such as newspaper and magazine pages.
  • Black pixel limit:
    Data Type = float
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Blank page threshold:
    Data Type = integer
    Use this to set the minimum number of \"On Pixels\" that must be present in the image for a page not to be considered blank. A value of -1 will turn off blank page detection.
  • Box size:
    Data Type = integer
    This option is ideal for forms where sometimes boxes around text can cause an area to be identified as graphics. This option removes boxes from the temporary copy of the imaged used by the OCR engine. It does not remove boxes from the final image. Technically, this option removes connected elements with a minimum area (in pixels and defined by this property). This option is currently only applied for bi-tonal images.
  • Deskew:
    Data Type = boolean
    Deskew (straighten) the image.
  • Despeckle:
    Data Type = integer
    This removes all disconnected elements within the image that have height or width in pixels less than the specified figure. The maximum value is 9 and the default value is 0.
  • Grayscale quality:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Jbig2EncFlags:
    Data Type = string
    These are the flags that will be passed to the application used to generate JBIG2 versions of images used in PDF generation (assuming this compression is enabled). This option should generally only be used under guidance from technical support.
  • LibTiffSavePageAsBmp:
    Data Type = boolean
    Sometimes if there is an image which is 1bpp and has LZW compression, the pre-processing can cause the colour of the image to be inverted (black to white and white to black). Set this to true to avoid this.
  • Maximum deskew:
    Data Type = float
    Maximum angle by which a page will be deskewed. This option should generally only be used under guidance from technical support (support@aquaforest.com).
  • Minimum deskew confidence:
    Data Type = string
    This option should generally only be used under guidance from technical support (support@aquaforest.com).
  • Morph:
    Data Type = string
    Morphological options that will be applied to the binarized image before OCR. If set to empty none is applied. Common options include those listed below but for more options please contact support@aquaforest.com > *d2.2 – 2x2 dilation applied to all black pixel areas, useful for faint prints.
  • Remove Blank Pages:
    Data Type = boolean
    Remove blank pages when BlankPageThreshold is greater than -1 and ConvertToTiff is true.
  • Remove Lines:
    Data Type = boolean
    Remove lines from images fpr better recognition.
  • Save Pre-despeckle:
    Data Type = boolean
    This will use the original image (i.e. before applying pre-processing) in the output PDF.
  • Compress PDF (MRC):
    Data Type = boolean
    This enables Mixed Raster Compression which can dramatically reduce the output size of PDFs comprising color scans. Note that this option is only suitable when the source is not a PDF or using ConvertToTiff.
  • Mrc Background Factor:
    Data Type = integer
    Sampling size for the background portion of the image. The higher the number, the larger the size of the image blocks used for averaging which will result in a reduction in size but also quality. Default value is 3
  • Mrc Foreground Factor:
    Data Type = integer
    Sampling size for the foreground portion of the image. The higher the number, the larger the size of the image blocks used for averaging which will result in a reduction in size but also quality. Default value is 3
  • Mrc Quality:
    Data Type = integer
    JPEG quality setting (percentage value 1 - 100) for use in saving the background and foreground images. Default value is 75
  • Pdf To Image Bpp:
    Data Type = string
    The Bits Per Pixel to use for the rasterized PDF page when using engine 1. This only applies for documents that are processed using ConvertToTiff. The default value for this property is taken from the PDF page.
    Possible values
                                                            "Bpp_1", 
                                                            "Bpp_24"
                                                
  • Pdf To Image Compression:
    Data Type = string
    The compression to set to the images extracted or rasterized from each page of the source PDF file. These images are then OCRed to create the searchable PDF. The default value for this property is taken from each page in the source PDF file.
    Possible values
                                                            "CCITT4",
                                                            "LZW"
                                                    
  • PDF To Image DPI:
    Data Type = string
    The DPI to set to the images rasterized from each page of the source PDF file. These images are then OCRed to create the searchable PDF. The default value for this property is taken from each page in the source PDF file.
    Possible values
                                                        "DPI_72",
                                                        "DPI_100",
                                                        "DPI_150",
                                                        "DPI_200",
                                                        "DPI_300",
                                                        "DPI_400",
                                                        "DPI_500",
                                                        "DPI_600"
                                                    
  • Pdf To Image Force Vector Check:
    Data Type = boolean
    This setting is useful when dealing with documents that contains vector objects (e.g. CAD drawings). By default, pages that contain only vector objects are rasterized. Pages that do not have any images but contain vector objects as well as electronic text are skipped from rasterization. However, sometimes there can be a page that contains vector objects (CAD drawings) but its title may be in electronic text. To force rasterizing pages like these, set this property to true.
  • Pdf To Image Include Text:
    Data Type = boolean
    When set to False this will prevent the conversion of real text (i.e. electronically generated as opposed to text that is part of a scanned image) from being rendered in the page images extracted from the PDF. This is because the text is already searchable and so generally does not require OCR. The value can be set to True however if the OCR is required on this real text.
  • Pdf To Image Max Res:
    Data Type = integer
    The maximum resolution of the rasterized images. If the resolution retrieved from the PDF page is bigger than this value, it will be set to this value. The default value for this property is 600.
  • Pdf To Image Min Res:
    Data Type = integer
    The minimum resolution of the rasterized images. If the resolution retrieved from the PDF page is lower than this value, it will be set to this value. The default value for this property is 200.
  • No Pictures:
    Data Type = boolean
    By default, if an area of the document is identified as a graphic area then no OCR processing is run on that area. However, certain documents may include areas or boxes that are identified as \"graphic\" or \"picture\" areas but that actually do contain useful text. Setting NoPictures to True will cause it to ignore areas identified as pictures whilst setting it to False will force OCR of areas identified as pictures.
  • Tables:
    Data Type = boolean
    This option when set to true, tries to OCR within table cells.
  • Text Layer Filter Height:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Text Layer Filter Height Inverted:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Text Layer Filter Percentage:
    Data Type = float
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Text Layer Filter Percentage Inverted:
    Data Type = float
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Text Layer Filter Ratio:
    Data Type = float
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Text Layer Filter Ratio Inverted:
    Data Type = float
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Text Layer Filter Width:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Text Layer Filter Width Inverted:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Text Layer Max Boxes:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Author:
    Data Type = string
    Set a custom Author in the output PDF document properties.
  • Creation Date:
    Data Type = string
    Set a custom creation date in the output PDF document properties. The date string must be in the format 'yyyy-MM-dd HH:mm:ss'.
  • Modified Date:
    Data Type = string
    Set a custom modified date in the output PDF document properties. The date string must be in the format 'yyyy-MM-dd HH:mm:ss'.
  • Retain creation date:
    Data Type = boolean
    Retains the creation date of the source file in the output PDF document properties.
  • Retain modified date:
    Data Type = boolean
    Retains the modified date of the source file in the output PDF document properties.
  • Retain bookmarks:
    Data Type = boolean
    Retains any bookmarks from the source file in the output when using ConvertToTiff.
  • Retain metadata:
    Data Type = boolean
    Retains any metadata from the source file in the output when using ConvertToTiff.
  • Retain viewer preferences:
    Data Type = boolean
    Retains any PDF Viewer Preferences, Page Mode and Page Layout from source file in the output when using ConvertToTiff.
  • Dotmatrix:
    Data Type = boolean
    Set this to true to improve recognition of dot-matrix fonts. Default value is false. If set to true for non dot-matrix fonts then the recognition can be poor.
  • Enable debug output:
    Data Type = boolean
    Enables debug output.
  • PDF/A Output:
    Data Type = boolean
    Whether or not to output as PDF/A.
  • PDF/A Version:
    Data Type = string
    The PDF/A version.
    Possible values
                                                            "PDF_A1b",
                                                            "PDF_A2b",
                                                            "PDF_A3b"
                                                        
  • Validate PDF/A:
    Data Type = boolean
    Whether or not to validate the PDF/A document after conversion
  • Convert To Tiff:
    Data Type = boolean
    Each page in the PDF document is rasterized to a TIFF image.
  • Create Process:
    Data Type = boolean
    Set this to true if you want to launch process through pinvoke.
  • Error mode:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Restart Engine Every:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Tidy-up mode:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Dictionary Lookup:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Flip detect:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Heuristics:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Word match threshold:
    Data Type = float
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Aquaforest Image Timeout:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • MRC Timeout:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • OCR Timeout:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Ocr Process Setup Timeout:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.
  • Pipe Client Connection Timeout:
    Data Type = integer
    Contact technical support (support@aquaforest.com) for guidance on using this property.

Output Parameters

  • Processed file content:
    Data Type = string (byte - base 64 string)
    PDF File generated by the Aquaforest PDF converter.
  • Log file content:
    Data Type = string
    The log contents of the operation.
  • Error message:
    Data Type = string
    Error message
  • Is Successful:
    Data Type = boolean
    Whether the operation was successful or not.