Estimating OCR Conversion Processing Time with Autobahn DX

Estimating OCR Conversion Processing Time with Autobahn DX

At Aquaforest we are often asked questions such as “I have 1 million documents I need to convert – how long will it take?” or “I need to convert 30,000 documents per day – how many servers will I need?”.  This article gives a straightforward method that can be used to provide broad estimates for conversion times.

Step 1 – Scope the Conversion

The first step is to collect the following information about the conversion exercise and the processing options that are going to be used

Document-Related Data

  • Number of documents
  • Average number of pages per document
  • Image Color (Black & White, Grayscale, Full Color)
  • Type of input file (TIFF, PDF …)
  • Typical resolution (200dpi, 300dpi…)
  • Typical page size (US Letter, A4, A3…)
  • Typical text density

Processing Options

  • Auto-rotate
  • Deskew
  • Line Removal
  • Despeckle
  • PDF/A
  • Compression

Step 2 – Estimating Complexity

OCR and compression on Step 1 a broad “complexity rating” can be assigned to the project by reviewing the guide below.

Low Medium High Very High
TIFF Image Source200 DPI
US Letter/A4
Medium Text Density
Bitonal / Black & White ImagesDeskew
Line Removal
PDF/A
PDF File Source300+ DPIHigh Text DensityGrayscale Images Auto-rotate
Bitonal Compression
Color TIFF or PDF ImagesLarge Format Color MRC Compression

Step 3 – Estimating Pages per CPU Core Hour (PPCCH)

OCR and compression processing can be highly CPU-intensive so Aquaforest recommends using a high-performance server with Intel  i5 or better processors.

Complexity PPCCH
Low 3000
Medium 1500
High 900
Very High 600

Step 4– Determining how many CPU Cores can be used

Whilst the theoretical maximum number of CPU Cores that are available is Number of Servers x Number of CPU Cores per server there are a number of factors that will reduce this.

Firstly, even if the system is largely dedicated for OCR use, it is recommended to leave at least 1 CPU Core available for non-OCR system use.  In addition if other applications and services are running it is prudent to be conservative about the number of cores available.

Secondly, by default Autobahn DX jobs will only use one CPU core.  It is possible to increase this to a maximum of 10 cores (using the Cores setting in the Convert TIFF to PDF or OCR PDF Job Steps).  Note that this assumes an Autobahn DX Multi-Core license is available.

Another option that is less often used is the Threads setting which is intended for use with documents that have a large number of pages (200+) as it works by splitting the document into 2 or 4 chunks and processing each in parallel.

Yes another alternate approach is to configure multiple concurrent jobs in Autobahn DX – ie have 2 or 4 jobs that can run in parallel.

Step 5 – Estimating Time Required

The number of hours required is

P / (PPCCH*C)

Where P=Number of Pages, C=Number of usage CPU Cores, PPCCH=Pages per CPU Core Hour

For example :

A medium complexity conversion job to process 500,000 pages with the job making use of two CPU cores  : 500,000 / (1500 * 2) = 167 hours.

 

The following two tabs change content below.
Neil Pitman founded Aquaforest Limited in 2001 and is the chief architect for the company’s PDF and OCR software products used by thousands of organizations ranging from NASA to the Dutch Ministerie van Justitie. Neil has 30 years’ experience in the software industry in the UK and USA in the areas of database systems, document management and software development tools and has served on the IDT committees of the British Standards Institute (BSI) and was a co-author of the BSI’s 2007 publication on the Long Term Preservation of Digital Documents.

Latest posts by Neil Pitman (see all)