ocrPDF

The program ocrPDF adds a text layer to a graphics-only PDF file. The tesseract OCR engine is used for text detection.

Frequently asked questions

Why not use the “pdf” output mode of tesseract directly? Why not use the program OCRmyPDF?

While both competing programs do a fine job in general, they both re-compress the image data found in the PDF file. For PDFs that contain JBIG2 encoded data, this will often lead to an increase in file size by a factor of about 10. The program ocrPDF is able to deal with JBIG2 data and increases file size only moderately, to the extent needed to include the text data.