What is OCR? How to extract text from a scanned PDF

A scanned PDF is just a photograph of a page. The letters you see are pixels — the PDF does not know they represent words. OCR changes that.

What does OCR stand for?

OCR stands for Optical Character Recognition. It is the technology that analyses an image and identifies characters, turning a picture of text into actual, selectable, searchable text.

OCR is what happens when you photograph a receipt and your banking app reads the amount automatically. It is also what allows Google to search inside scanned books, and what courts use to digitise decades of paper records.

When do you need OCR?

You scanned a contract or invoice and cannot select or copy any text
You received a PDF that opens as a series of images rather than text
You want to search inside a document but the search returns nothing
You need to extract data from a scanned form or table

How OCR works

Modern OCR engines (including the open-source Tesseract engine used by AmorePDF) process an image in several steps:

Pre-processing — the image is straightened, de-skewed, and its contrast is enhanced
Segmentation — the engine identifies regions of text, separating them from images and white space
Character recognition — each character is compared against trained models to identify the most likely match
Post-processing — the output is checked against language dictionaries to correct errors

The result is a text layer that sits over or replaces the original image inside the PDF.

How to extract text from a scanned PDF

Our OCR PDF tool runs the entire recognition process in your browser using WebAssembly. No file is ever uploaded to a server.

Step 1 — Open the tool

Go to OCR PDF.

Step 2 — Load your scanned PDF

Drop the file onto the upload area. The tool loads a preview of the pages.

Step 3 — Select the language

Choose the language of the document. Accuracy improves significantly when the correct language model is selected. Common options include English, French, German, Italian, Spanish, Portuguese, and many more.

Step 4 — Run OCR and download

Click Extract text. Processing takes a few seconds per page. Download the result as a plain text file (.txt).

OCR accuracy: what to expect

Accuracy depends on scan quality:

| Scan quality | Expected accuracy | |---|---| | Clean, high-contrast (300 dpi+) | 98–99% | | Good quality (200 dpi) | 93–97% | | Low contrast or skewed | 80–92% | | Handwriting | 50–85% (varies widely) |

Printed text in good condition is recognised with very high accuracy. Handwriting is harder and results vary by style.

OCR vs. extracting existing text

If your PDF already contains a text layer — it was created digitally, not scanned — you do not need OCR. Use Extract Text instead, which copies the existing text directly without any image processing. It is faster and produces a perfectly accurate result.

Not sure which one you need? Open the PDF in a browser and try to select some text. If you can select and copy it, use Extract Text. If the cursor does not grab any text, the document is a scan and needs OCR.

Other OCR tools

OCR Images — run OCR on image files (PNG, JPG) rather than PDFs
Extract Scanned Text — optimised for multi-page scanned documents

Conclusion

OCR is a powerful but often overlooked tool. It turns a static image into a living document you can search, copy, and edit. AmorePDF runs OCR entirely in your browser — privately, for free, with no limits.

→ Extract text from a scanned PDF now