Cleaning Up Scanned Documents with Open Source Tools
As more and more Malaysian government information goes off-line with the current government, there is an increasing amount of work needed to scan and digitize documents. In current digital landscape of Malaysia, documents that are not available on-line, may as well be inaccessible to the public. Sifting through hard copies of large amounts of information is also not really feasible proposition for researchers. Digital formats allow the public and researchers to quickly search and categorize hundreds of thousands of pages of documents.
The source of the digitized documents may not necessarily be always nicely scanned, OCR’ed and in PDF format. More often then not, we can expect it to be text taken by camera phones too. These images need to be cleaned up somewhat before we can make them available and usable.
Tools and Utilities Used
- ScanTailor
- pdf2ppm
- img2pdf
- ocrmypdf
- tesseract
- GNU parallel
- pdftk
Splitting PDF pages into images
Often scanned images are in PDF format, often without OCR, which need to be split before processing.
Using pdftoppm:
pdftoppm -tiff -r 300 file.pdf imagename
Where 300 is the DPI resolution and imgname is the prefix, which can be anything you like.
pdftoppm is part of the poppler-utils package
Deskewing
Once all the PDF images are split, you will then need to deskew them, detect content, split pages (if scanned as dual page book form) and then to finally output them nicely formatted with margins.
The brilliant tool ScanTailor will do this all automatically for single or multiple pages. It can be installed via Flatpak.
Putting it all together
Scantailor will generate as the final output the cleaned up scans as TIFF images, which need to be converted into PDF and joined together. We can use the img2pdf
utility for this.
img2pdf *.tif -o joined_document.pdf
Create PDF with OCR Text with ocrmypdf
tesseract
is a command line OCR tools that supports multiple languages, ocrmypdf
converts PDFs into images that tesseract
uses and then merges the resulting text back into a PDF with OCR text that users can search and copy and past text from.
ocrmypdf -l msa+eng input.pdf output_ocr.pdf
In this example, for better text accuracy via -l to use pre-trained languages Bahasa Melayu and English, for Malaysian documents that are often a mix of both. A list of language codes can be found on the tesseract page.
Both utilities are part of Ubuntu/Debian packages. You can pick specific language/scripts for tesseract-ocr
, but apt-get install teesseract-ocr-all
will install all of them.
The final result
The skewed document now is now readable and searchable by both people and computers, with accurate OCR text.
Other useful tools
Another useful command line tool we mentioned earlier, to merge, split and fix PDF documents is pdftk
.