Home Products Purchase Download Support Resellers Developer Tools Company

How to create searchable PDF files with ViewCompanion Premium

Optical Character Recognition (OCR) with Tesseract in ViewCompanion

OCR (Optical Character Recognition) technology enables users to convert various types of documents—such as scanned paper documents, PDFs, or images captured by a digital camera—into editable and searchable data. OCR reduces the need for manual data entry, significantly saving time and minimizing human errors.

ViewCompanion Premium can convert scanned PDF documents into searchable PDFs using optical character recognition (OCR) with the Tesseract OCR engine. OCR is also supported for image formats such as TIFF and other raster image files.
Tesseract is a free optical character recognition engine (OCR) and is included in the Premium 64-bit installation. Tesseract is a highly effective OCR tool for several reasons. First, it’s open-source and completely free, making it accessible for both personal and commercial use. It supports over 100 languages and can even be trained for additional ones.
The picture below shows a typical scanned document opened in ViewCompanion:

If your scanned document is old, it may have stains, browning or other age-related deterioration. Browning, also known as foxing, as shown in the above picture, can first be removed using the built-in Defoxing or Binarization filters.
Please note that if your file is a scanned PDF you will have to press the Edit PDF as Image button first before using this tool or the OCR function.

Since we primary need the text, we've used the Binarization filter in this demonstration, which will result in a black and white image, as shown below:

After running the filter there may still be remaining stain that was not removed.
You can remove remaining noise by using the clear area tool and the clear polygon tool.
When you're ready to run the OCR locate the OCR button found in the Premium tab, as shown below:

After a while you will be prompted to enter a file name for the resulting PDF file. When the OCR conversion is complete you can open the resulting PDF file in for example Acrobat to verify that it's now searchable:

Tesseract OCR Engine

Starting with version 17 ViewCompanion Premium 64-bit includes Tesseract OCR engine.
If you're using an older version you can download the Tesseract installer for Windows from UB Mannheim:
https://github.com/UB-Mannheim/tesseract/wiki

ViewCompanion Premium

ViewCompanion Premium is a powerful tool for viewing, converting, and processing PDF, CAD, and image files, with advanced OCR support powered by the Tesseract engine. It can extract text from scanned documents and images, making files searchable and easy to reuse.
In addition to OCR, it offers features like PDF conversion, batch processing, and document cleanup to streamline your workflow.

Do you need to add OCR to your own application?

scConverter is a flexible SDK that can be easily integrated into your application using a COM interface or standard DLL import.
With built-in support for the Tesseract OCR engine, it can convert image files into searchable PDF documents with accurate text recognition.
Learn more about the scConverter SDK:

scConverter Conversion SDK Description