March 2024 (5 months ago)

OCRing a scanned book

§
5 min read (818 words)
· · ·

I recently found a scanned PDF of an older book. It was somewhat low-resolution with scaling or compression artefacts around the text. Many pages were skewed. The contrast was not high.

First Attempt (not good)

The basic way to OCR a PDF is to use tesseract. There are other options available out there, but they do not seem as consistent or functional. I tried some ML based models and looked at the documentation for some of the Chinese-made OCR options.

tesseract cannot just OCR a PDF to a PDF. First, the images must be converted using pdftoppm. This creates a bunch of files, which need to be looped over with a for loop, running tesseract on them. You can then use pdfunite to combine these OCR’d PDFs.

Unfortunately, tesseract doesn’t work well on skewed text, so when you highlight the text in Zotero, “t h e h i g h l i g h t l o o k s l i k e t h i s.’

Other Thoughts

I also found that the pdftoppm images showed hidden parts of the PDF. There was someone’s hand and the scanning table there! It was necessary to crop the PDF. imagemagick didn’t seem to work quite well. It changed the resolution after cropping. I ended up using ghostscript after getting the size of the PDF and then changing the CropBox. You have to specify the -dUseCropBox command, otherwise it

Ignore the above. Do not use CropBox, MediaBox, etc. to try removing hidden parts of a PDF because they don’t work. Even though I got rid of the extra part using the -dUseCropBox flag and verified it by running pdfinfo -box file.pdf, when I open it in Inkscape I can clearly see the extra parts of the image. That’s because these bounding boxes don’t relate to the underlying data, and it can’t be changed that way. pdftoppm ends up using these bounding boxes to create PNGs and does not reveal the hidden part naturally, but I realized it wasn’t deleted when the resultant PDF was the same size.

Hi

Here is what -dUseCropBox is supposed to do btw.

Sets the page size to the CropBox rather than the MediaBox. Unlike the other “page boundary” boxes, CropBox does not have a defined meaning, it simply provides a rectangle to which the page contents will be clipped (cropped). By convention, it is often, but not exclusively, used to aid the positioning of content on the (usually larger, in these cases) media.

Though this command isn’t useful for actually cropping the PDF, I realized that it could be combined with pdftoppm to turn the PDF into a bunch of images with the extra parts removed.

gs -o output.pdf -sDEVICE=pdfwrite -dUseCropBox -f test.pdf

It was also necessary to split the page in half. ghostscript is a bit too complicated for this, I went with mutool instead. This command was really fast somehow.

mutool poster -x 2 output-2.pdf output.pdf

Note: the above commands were specific for my workflow and I won’t go into detail about the specific usage. Use a LLM, man pages, the —help option, and documentation to do what works for your case.

A Better Option

I found ocrmypdf to be the best option out there. It uses tesseract under the hood, but it adds in a lot of other useful features. I found the deskewing to be most helpful. It used to support removing the background (and maybe denoising along with it?), but it is not supported anymore due to outdated dependencies.

ocrmypdf cautions against certain tools:

We caution against using ImageMagick or Ghostscript to convert images to PDF, since they may transcode images or produce downsampled images, sometimes without warning.

One downisde was that it didn’t appear to have any options to increase the contrast or upsample.

Final Workflow

It should look something like this:

  • Have original PDF
  • Crop any hidden parts of PDF to reduce image size
  • Upsample, reduce noise, increase contrast if needed
  • Split into 2 pages
  • Use ocrmypdf with deskewing options
  • Create table of contents

Other things I tried

For upsampling, I looked at video2x but settled on using the using Real-ESRGAN on the command line. There was also a script I found on reddit called readablePDF which depends on this tool called scantailor, and I tried installing it using a forked brew version but one of the buttons on the GUI didn’t work. It comes as a program rather than the command line.

Upcoming

I haven’t yet done upsampling, reducing noise, increasing contrast, or the table of contents in a way that is satisfiable yet. I tried using MacOS’s default export PDF with a Quartz filter to increase the contrast, but it wasn’t so legible. I think it needs to be upsampled and denoised, otherwise the pixels around the text make the contrast increase unclear. For table of contents, I’m not sure what tools can directly edit that, but I’m going to explore pandoc and other tools. Maybe you can convert from one format to another, edit it in an ePub editor like Sigil, and then put it back to PDF. But it might be messy, idk.

I look forward to sharing my results after a good solution.