I recently found a scanned PDF of an older book. It was somewhat low-resolution with scaling or compression artefacts around the text. Many pages were skewed. The contrast was not high.
First Attempt (not good)
The basic way to OCR a PDF is to use tesseract
. There are other options available out there, but they do not seem as consistent or functional. I tried some ML based models and looked at the documentation for some of the Chinese-made OCR options.
tesseract
cannot just OCR a PDF to a PDF. First, the images must be converted using pdftoppm
. This creates a bunch of files, which need to be looped over with a for loop, running tesseract
on them. You can then use pdfunite
to combine these OCR’d PDFs.
Unfortunately, tesseract doesn’t work well on skewed text, so when you highlight the text in Zotero, “t h e h i g h l i g h t l o o k s l i k e t h i s.’
Other Thoughts
I also found that the pdftoppm
images showed hidden parts of the PDF. There was someone’s hand and the scanning table there! It was necessary to crop the PDF. imagemagick
didn’t seem to work quite well. It changed the resolution after cropping. I ended up using ghostscript
after getting the size of the PDF and then changing the CropBox
. You have to specify the -dUseCropBox
command, otherwise it
Ignore the above. Do not use CropBox, MediaBox, etc. to try removing hidden parts of a PDF because they don’t work. Even though I got rid of the extra part using the -dUseCropBox
flag and verified it by running pdfinfo -box file.pdf
, when I open it in Inkscape I can clearly see the extra parts of the image. That’s because these bounding boxes don’t relate to the underlying data, and it can’t be changed that way. pdftoppm
ends up using these bounding boxes to create PNGs and does not reveal the hidden part naturally, but I realized it wasn’t deleted when the resultant PDF was the same size.
Here is what -dUseCropBox
is supposed to do btw.
Sets the page size to the CropBox rather than the MediaBox. Unlike the other “page boundary” boxes, CropBox does not have a defined meaning, it simply provides a rectangle to which the page contents will be clipped (cropped). By convention, it is often, but not exclusively, used to aid the positioning of content on the (usually larger, in these cases) media.
Though this command isn’t useful for actually cropping the PDF, I realized that it could be combined with pdftoppm
to turn the PDF into a bunch of images with the extra parts removed.
gs -o output.pdf -sDEVICE=pdfwrite -dUseCropBox -f test.pdf
It was also necessary to split the page in half. ghostscript
is a bit too complicated for this, I went with mutool
instead. This command was really fast somehow.
mutool poster -x 2 output-2.pdf output.pdf
Note: the above commands were specific for my workflow and I won’t go into detail about the specific usage. Use a LLM, man pages, the —help option, and documentation to do what works for your case.
A Better Option
I found ocrmypdf
to be the best option out there. It uses tesseract
under the hood, but it adds in a lot of other useful features. I found the deskewing to be most helpful. It used to support removing the background (and maybe denoising along with it?), but it is not supported anymore due to outdated dependencies.
ocrmypdf
cautions against certain tools:
We caution against using ImageMagick or Ghostscript to convert images to PDF, since they may transcode images or produce downsampled images, sometimes without warning.
One downisde was that it didn’t appear to have any options to increase the contrast or upsample.
Final Workflow
It should look something like this:
- Have original PDF
- Crop any hidden parts of PDF to reduce image size
- Upsample, reduce noise, increase contrast if needed
- Split into 2 pages
- Use
ocrmypdf
with deskewing options - Create table of contents
Other things I tried
For upsampling, I looked at video2x but settled on using the using Real-ESRGAN on the command line. There was also a script I found on reddit called readablePDF which depends on this tool called scantailor
, and I tried installing it using a forked brew version but one of the buttons on the GUI didn’t work. It comes as a program rather than the command line.
Upcoming
I haven’t yet done upsampling, reducing noise, increasing contrast, or the table of contents in a way that is satisfiable yet. I tried using MacOS’s default export PDF with a Quartz filter to increase the contrast, but it wasn’t so legible. I think it needs to be upsampled and denoised, otherwise the pixels around the text make the contrast increase unclear. For table of contents, I’m not sure what tools can directly edit that, but I’m going to explore pandoc
and other tools. Maybe you can convert from one format to another, edit it in an ePub editor like Sigil
, and then put it back to PDF. But it might be messy, idk.
I look forward to sharing my results after a good solution.