Making sense of scanned books


I am, as of late, frequently in receipt of PDFs containing scanned images of text from diverse sources.

There are at least two things that I like to do with these PDFs: one is to optimise them for use with an ebook. I use the excellect k2pdfopt to achieve this end. The other is to OCR them for personal use, particularly for journal aricles for which the text content of PDFs is garbled to such an extent it is impossible to add highlight annotations. For this task I use tesseract.

Now, both of these tools work best when the source is cleaned up, and consists of rectilinear columns of text. This is infrequently the case when it comes to hand-prepared scans of books.


In order to remedy this, I use unpaper, which processes image files, removing scanner artifacts such as those left behind from misaligning the page on the scanner glass, deskewing non-rectilinear pages, and even splitting two page spreads into single leaves.

unpaper does a great job, but is designed to take PPM format image files as input. So your first step is to get the source PDF into PPM. There is a utility called pdftoppm that achieves this, though I believe it is possible with the excellent ImageMagick (convert). I use ImageMagick to re-create the PDF again from the cleaned-up and deskewed PPMs.

The workflow is thus:

pdftoppm scan.pdf scan-ppm 

which results in sequentially numbered files corresponding to each PDF page (e.g. scan-ppm-001.ppm, …), then:

unpaper scan.ppm-%03d.ppm out-%03d.ppm

where the strf entity indicates that we’d like enumeration to be zero-padded to the left up to three digits, and then:

convert out-*.ppm scan-fixed.pdf