Cleaning Up Scanned Documents with Open Source Tools

by kaeru — published 2021/11/22 22:06:00 GMT+8, last modified 2021-11-24T16:00:41+08:00

As more and more Malaysian government information goes off-line with the current government, there is an increasing amount of work needed to scan and digitize documents. In current digital landscape of Malaysia, documents that are not available on-line, may as well be inaccessible to the public. Sifting through hard copies of large amounts of information is also not really feasible proposition for researchers. Digital formats allow the public and researchers to quickly search and categorize hundreds of thousands of pages of documents.

The source of the digitized documents may not necessarily be always nicely scanned, OCR’ed and in PDF format. More often then not, we can expect it to be text taken by camera phones too. These images need to be cleaned up somewhat before we can make them available and usable.

Tools and Utilities Used

ScanTailor
pdf2ppm
img2pdf
ocrmypd f
tesseract
GNU parallel
pdftk

Skewed text of scanned Malaysian parliamentary document Example of skewed text from scanned Malaysian parliamentary documents

Splitting PDF pages into images

Often scanned images are in PDF format, often without OCR, which need to be split before processing.

Using pdftoppm:

pdftoppm -tiff -r 300 file.pdf imagename

Where 300 is the DPI resolution and imgname is the prefix, which can be anything you like.

pdftoppm is part of the poppler-utils package

Deskewing

Once all the PDF images are split, you will then need to deskew them, detect content, split pages (if scanned as dual page book form) and then to finally output them nicely formatted with margins.

The brilliant tool ScanTailor will do this all automatically for single or multiple pages. It can be installed via Flatpak.

Scantailor deskewing text — Scantailor deskewing automatically

deskewed text — Deskewed image after running ScanTailor

Putting it all together

Scantailor will generate as the final output the cleaned up scans as TIFF images, which need to be converted into PDF and joined together. We can use the img2pdf utility for this.

img2pdf *.tif -o joined_document.pdf

Create PDF with OCR Text with ocrmypdf

tesseract is a command line OCR tools that supports multiple languages, ocrmypdf converts PDFs into images that tesseract uses and then merges the resulting text back into a PDF with OCR text that users can search and copy and past text from.

ocrmypdf -l msa+eng input.pdf output_ocr.pdf

In this example, for better text accuracy via -l to use pre-trained languages Bahasa Melayu and English, for Malaysian documents that are often a mix of both. A list of language codes can be found on the tesseract page.

Both utilities are part of Ubuntu/Debian packages. You can pick specific language/scripts for tesseract-ocr, but apt-get install teesseract-ocr-all will install all of them.

The final result

The skewed document now is now readable and searchable by both people and computers, with accurate OCR text.

deskewed parliamentary text — Cleaned up document on pardocs.sinarproject.org

Plain text preview — plain text thanks to OCR step on pardocs.sinarproject.org

Other useful tools

Another useful command line tool we mentioned earlier, to merge, split and fix PDF documents is pdftk.