본문 바로가기

카테고리 없음

How To Ocr An Existing Pdf



OCR stands for “Optical Character Recognition” and it is how a computer (or iPad) converts a picture of a document into a fully searchable PDF file. Why is this important? Because the next step in my workflow is that I want to pull the PDF into GoodReader or PDF Expert so that I can annotate the file with text highlights, underline, etc.

Active3 months ago

I am looking for an offline scriptable tool that makes an existing PDF file searchable by running OCR on it, replacing the original non-searchable file with the searchable version, and can run unattended.

E.g., www.pdfscannerapp.com - does exactly what I need, but it's GUI only - not scriptable.

I am aware that Evernote makes PDF files searchable, but they remain searchable only when within Evernote.

I am not looking for perfect OCR, even a moderately acceptable OCR is fine, but I would prefer a small utility rather than a bulky software package.

(I am aware of a similar, but different question on AD: Looking for Software to Scan or Convert to Searchable and Signable PDF - however, I don't need to sign or fill PDFs, and my requirement is that the solution is scriptable)

EDIT:

1) Several utilities allow structured text extraction, however in order to be extracted, the text must be there; I am mainly referring to PDFs that are wrapped bitmaps, as is the case with plain PDFs generated by scanners.

2) I am not necessarily looking for a free solution, and I would be more than happy to pay for a good utility that just does what I need, but I am not looking for bulky applications with a million features that include an OCR feature but whose cost does not justify buying them just for the OCR functionality.

3) As stated above, I am not looking for perfect OCR, just a moderately acceptable OCR. Unfortunately, in my experience, tesseract is really below that threshold. I define 'moderately acceptable' an OCR that can, say, OCR an utility bill so that at least the account number (customer number) is recognized correctly.

EDIT: 'scriptable' or 'automatable', that is, able to be triggered automatically and run unattended without human input whatsoever.

Community
magmamagma
6431 gold badge5 silver badges11 bronze badges

13 Answers

It's not entirely clear to me what your requirements are for being able to 'script' this from the 'command line'.

How To Ocr An Existing Pdf

If you are talking about automation, then that is possible with any number of utilities.

ABBYY FineReader Express + Keyboard Maestro + Hazel

I use ABBYY FineReader Express + Keyboard Maestro + Hazel like so:

  1. Hazel monitors a given folder for any new PDFs

  2. if a PDF is found, it is opened in 'ABBYY FineReader Express'

  3. Keyboard Maestro then automates the process of turning the PDF into a Searchable PDF (OCR) and saves the file to a different directory.

Now, if you don't own Hazel and Keyboard Maestro already, your initial costs are going to rise pretty quickly (although I depend on both so much I consider them a bargain).

PDFPen + AppleScript + Folder Actions

You could do something similar with PDFPen (or PDFPenPro) and folder actions and AppleScript. See https://gist.github.com/prenagha/1355037 for one example.

Marco Arment did a survey of OCR apps for Mac and found that PDFPen had great results and was easy to automate.

A google search for 'PDFpen applescript OCR' will turn up a number of alternatives.

TJ LuomaTJ Luoma
12.9k3 gold badges44 silver badges83 bronze badges

What you want is Tesseract OCR. It's an open source OCR that is maintained by Google and supports a variety of platforms. It also has a native command line interface. It's exactly what you're looking for and available from the Mac ports project as well as homebrew.

Project Home: https://github.com/tesseract-ocr

How to install on OS X: http://blog.matt-swain.com/post/26419042500/installing-tesseract-ocr-on-mac-os-x-lion

Usage Example: tesseract -l eng input.pdf output

CousinCocaine
6,6529 gold badges40 silver badges66 bronze badges
Daniel KocevskiDaniel Kocevski

Disclaimer:NOT AN OCR SOLUTION (but this answer is still useful to extract text from pdf)

There is an Apache Software Foundation project called Apache Tika:

A toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries

They support PDF text extraction using PDFBox:

allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities

And they recently also added support for OCR (via Tesserac)

For a text based solution, PDFBox makes very simple to extract text from a PDF:

  • Download the pdfbox-app package from https://pdfbox.apache.org/downloads.html
  • run the ExtractText command on it:

    java -jar pdfbox-app-x.y.z.jar ExtractText myNiceBook.pdf myNiceBook.txt

It also has some other nice options that you can see in ExtractText docs.

brutuscatbrutuscat

I would recommend DEVONThink Pro Office. It is an excellent application and has very good AppleScript support. Alas only the 'Pro Office' version has the OCR capability - so you'll have to shell out £100 ($150).

It would be overkill if you're only using it for scripted OCR - but it's a very good app.

[edit] - ah just re-read your post - it would definitely be overkill!

If you just want OCR from the shell, you could try talking to ABBY whose engine DEVON licences:

DiggoryDiggory
5732 gold badges6 silver badges16 bronze badges

You can make your existing PDF searchable by converting it into text file. You need for that at least Imagemagick, Ghostscript (for PDF conversion) and Tesseract OCR tool.

https://coemenrioser.tistory.com/7. Some command-line example:

This can be extended further to your needs.

To install required tools, on OSX you may install it via Homebrew:

On Linux use apt-get or yum instead of brew.

For more OCR tools, check: OCR on Linux systems

Related:

How To Ocr An Existing Pdf Document

Community
kenorbkenorb
7,7029 gold badges54 silver badges103 bronze badges

A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:

user127022user127022

Stackoverflow has related questions under PDF-parsing covering things such as PDFBox and Apache's TIKA that the PDFBox uses. The ruby code below extracts writing from PDF. You need to have good enough resolution for this type of codes to work robustly. So get a good enough scanner with large resolution and then see if some of the softwares work.

PES 2011 free. download full Version For PC setup with a single and direct download link. Pes 2011 crack download.

Examples

SO threads

[Edit]

I am not sure whether I understood your problem now. You want to add OCR layer to different kinds of material such as random photos, screenshots, PDFs without OCR layer and so on? I don't know the solution but I am sure someone knows so asked a specific question how to do it with Automator and some OCR software:

Community
hhhhhh
1,75621 gold badges49 silver badges81 bronze badges

For this type of self-directed application, I'm a big fan of Hazel.

It makes it extremely easy to script actions without needing to learn a more command line oriented tool like perl or python and paired with the OCR engine of your choice (mine is currently PDF Pen Pro) you should have no problems getting your files processed with minimal fuss.

Both of these are paid software, but the utility of both far extends past this one case. In my situation, with the labor involved in digitizing my past scanned records (and ongoing paper), the price of these far outweighs the time I would have spend programming this elsewhere and now that I own both tools, I can do many other tasks with them.

bmikebmike
169k47 gold badges310 silver badges669 bronze badges

PDFScannerApp does have an unofficial scripting support. Contact the author for the Automator action.

kenorb
7,7029 gold badges54 silver badges103 bronze badges
ndfndf

We're looking for long answers that provide some explanation and context. Don't just give a one-line answer; explain why your answer is right, ideally with citations. Answers that don't include explanations may be removed.

I use Adobe acrobat to OCR in batch. My duplex scanner can OCR after scanning but the OCR technology in acrobat is more accurate in my opinion. I just point to there folder that has no OCR then acrobat re saves the PDF as a searchable PDF now including a text layer. If I wanted to OCR via command line, I don't know of a way but I can automate the GUI end by using Autohotkey. Not as reliable nor fast as command line, but it does the job after you set up a workflow action to minimize the GUI interaction.

For Mac, apple script does what Autohotkey does on the PC although I haven't tried on my Mac yet.

Auto hot key comes with a recorder so most of the script writing is dinner for you with a littler bit of editing for refinement and perhaps looping if you want that.

I've been experimenting OCRing images but haven't automated the process fully yet through acrobat. Command line is ideal but haven't found a quality OCR engine that exceeds acrobat so I stick with acrobat for now.

SunSun
1361 gold badge6 silver badges20 bronze badges

I stumbled upon this recently: http://ocrkit.com/faq.html

You have to pay after 14 days though

CharltonCharlton

I got high quality Drag & Drop conversion working using Docker.

If you:

  1. install Docker for your Mac and
  2. then create a new Automator app
  3. with these contents inside a 'Run a Shell Script' action. Choose Pass Input: 'as arguments'

/bin/bash script text:

You should then be good to drag-and-drop PDFs onto it and and you'll get a similarly named PDF with '-ocr' appended to the file name.

I imagine it could be easily modified to return a file to Automator to copy somewhere as well. More details about the fine OCRmyPDF docker package. and main tool (also mentioned in a different answer).

You can test it in Automator itself with 'Get specified Finder items' action as input to this.

The first time it runs, it make take more time as it will need to download the Docker images for OCRmyPDF (invisibly). In Terminal, you can alternatively run docker pull jbarlow83/ocrmypdf to speed up the first run. A typical run takes about 10 seconds per high DPI page but has automatically text-to-speachable results even if there are tables or diagrams. Before OCRing, I crop using Sejda so nonsense margin words from other pages are removed.

The --force-ocr argument tells the tool to ignore and overwrite any earlier OCR attempts, which in my cases are usually only partial and useless.

thadkthadk

OCRKit has both AppleScript support and a CLI. From their help page:

AppleScript

You can also script OCRKit to integrate it into your specific workflow. For example process incoming files, via shared folder, from MFP copy machine, etc. and simply tell OCRKit to open and thus process is via AppleScript:

Command line

Since OCRKit version 2.5 direct command line scripting is supported. This greatly simplifies the use of OCRKit in batch processing, allows to set more options and is also more robust and cross-platform than AppleSCript.

Since OCRKit version 16.9 additional command line options are supported:

-r, --recursive directory

Scan directory recursively for new files. Skips files from OCRKit, with text layer or vector graphics.

--pattern 'regex'

Pattern used to match filenames during recursive scans. Defaults to %.pdf$, recommendation for TIFF is %.tiff?$

--log file

Perform Ocr On Pdf

Write log file information and statistics during recursive scan to file.

--password secret

Use secret password to decrypt PDF files during batch processing.

--test-run [ fast ]

Only run test batch processing in test mode to test PDF files or to obtain page count to estimate total processing time. 'fast' will only check the first page of each file, instead of going thru all pages for image and vector analyzation.

--tag name

Use extended attribute name to tag the processing state of files during batch processing. macos:OCRKit (%s) will use native macOS Finder tags instead, or simply macos:OCRKit not including the state attribute. The order of the state attribute are: started, analyzed, processed, and can also be encrypted.

xilopaintxilopaint

You must log in to answer this question.

How To Ocr A Pdf

Not the answer you're looking for? Browse other questions tagged pdfocr .