UtilitySmith
All posts
·10 min read·pdfocrscan

How to Make a Scanned PDF Searchable

You need to find the termination clause in a scanned contract. You press Ctrl+F, type a word, and get nothing. The PDF is technically a document — you can scroll it, print it, zoom in — but as far as your computer is concerned, the pages are just photographs. There's no text, only pixels.

This is the scanned PDF problem. OCR — Optical Character Recognition — is the fix.

Why Ctrl+F doesn't work on scanned PDFs

A PDF is a container format. It can hold text, vector graphics, fonts, images, form fields, and metadata — in almost any combination.

A PDF created from a word processor or design tool contains actual text data. Select text in it and you're interacting with real characters stored in the file. Ctrl+F searches those characters.

A scanned PDF is different. The scanner captures a photograph of each page. The PDF contains those photographs and nothing else. There's no text layer — no characters, no words, nothing for search to find. The text you can see on screen is visual information in an image, not text data in a file.

This is why almost everything about a scanned PDF is frustrating: you can't search it, select text from it, copy a quote, or have a screen reader read it aloud. It's a picture of a document pretending to be a document.

OCR fixes this by reading the images and reconstructing the text, then adding that text back to the PDF as a searchable, selectable layer.

What OCR actually does

OCR software analyses each page image, identifies characters, assembles them into words and lines, and records the position of each word on the page. It then writes that text into the PDF behind the image — invisible, but selectable.

The result is a searchable PDF: visually identical to the original scan, but with text that your computer can read. Press Ctrl+F and it finds words. Click to select and you can copy a sentence. A screen reader can read the whole document.

The key word is reconstructs. OCR is not retrieving text that was originally there — it's inferring it from pixel patterns. Confidence matters. A clean, high-contrast scan of a typeset document in a standard font will come back at 90–95% accuracy. A faxed photocopy of a photocopy might come back at 50%, with half the words wrong.

This is why a confidence score on each page is more useful than a single "it worked" signal.

The accuracy spectrum

OCR accuracy depends almost entirely on input quality. Three factors dominate:

Resolution. Text scanned at 300 DPI (dots per inch) gives OCR engines enough pixel detail to distinguish similar characters — l, 1, I; O and 0; rn and m. Below 150 DPI, those distinctions collapse and error rates spike.

Contrast. Black text on white paper is ideal. Coloured paper, aged yellowing, water damage, show-through from the other side of the page — all reduce contrast and increase errors.

Font regularity. Standard serif and sans-serif fonts at 10pt and above are what OCR engines are trained on. Unusual typefaces, very small text, all-caps blocks, and text that's been warped or stretched cause problems.

What typically works well: Modern office documents, printed contracts, utility bills, typed letters, invoices, academic papers, forms with printed labels.

What doesn't work: Handwriting (requires completely different machine learning models), very old documents with irregular typefaces, heavily degraded fax quality, stylised fonts used for headings or logos, text overlaid on images.

The honest approach is to run OCR, see the confidence score, and decide. A 92% result is generally reliable. A 64% result means roughly one word in three is wrong somewhere — usable for rough reference, not for quoting in a contract negotiation.

Which output format to choose

OCR produces text. What you do with that text depends on why you're OCRing the document.

Searchable PDF is usually the right answer for archiving and sharing. The original page images are preserved exactly — the document looks unchanged. An invisible text layer is added behind each page at the word positions OCR identified. You get search, selection, and copy-paste without altering the appearance. This is the format to use for signed contracts, scanned invoices, and any document where the original layout matters.

Plain text gives you just the words, without any layout structure. Paragraphs are preserved where OCR can identify them; tables collapse into a sequence of values. Use this when you need to paste content into another document, run it through a search tool, or process it programmatically.

Word document (.docx) produces editable paragraphs. Text is there and you can change it, but the original formatting — fonts, columns, tables, headers — isn't reconstructed. Use this when you need to edit the content rather than just read it.

The searchable PDF is the gold standard. It's the only format that preserves the original document faithfully while making it functional.

When OCR isn't what you need

Before running OCR, it's worth checking whether your PDF already has a text layer. A lot of people OCR PDFs that don't need it.

A PDF created by printing to PDF from Word or Google Docs already has selectable text. So do PDFs exported from InDesign, PDFs generated by accounting software, and most PDFs you download from websites.

The test: try selecting text on any page. If it works, you don't need OCR — just copy from the existing text layer.

If you drop the file into UtilitySmith's OCR tool, it checks automatically. If the PDF already has a text layer, it extracts the text directly — no OCR run, instant result, perfect accuracy.

The upload problem with online OCR tools

Search for "OCR PDF online" and you'll find dozens of tools. Most of them upload your file to a server, process it there, and send back the result.

This is worth pausing on. Scanned documents are frequently sensitive: contracts, payslips, medical records, bank statements, identity documents. Uploading them to an unknown server — whose data policies you haven't read, whose security you can't audit — is a significant privacy concession.

Browser-based OCR has become viable because WebAssembly allows a full OCR engine to run directly in your browser tab. The file never leaves your computer. The engine downloads once (around 10MB) and runs locally on every subsequent document. There's no upload, no server, no third party involved.

UtilitySmith's OCR tool was built this way because the document categories most often OCR'd — contracts, tax documents, medical records — are precisely the ones that shouldn't be uploaded to a random server. The trade-off is that it's slower than server-side processing: 3–4 seconds per page rather than sub-second. For most documents, that's acceptable. For a 20-page scan it's around a minute.

Practical tips for better results

Check the scan quality first. Open the PDF and zoom in to 200%. If individual letters look clear, OCR will do well. If they're blurry or broken, you'll get errors. Re-scanning at a higher resolution is sometimes faster than manually correcting OCR output.

300 DPI is the baseline. If you control the scanner, 300 DPI is the minimum for reliable OCR. 400–600 DPI produces better results on small or unusual text. Higher than 600 rarely improves accuracy and significantly increases file size.

Straighten the page. Pages scanned even a few degrees off-axis produce worse results. Most OCR tools include rotation correction, but a well-aligned scan saves processing time and reduces errors.

Scanned in landscape by mistake? Most OCR tools detect orientation automatically. If yours doesn't, rotate before OCRing.

Don't expect tables to come out as tables. OCR reconstructs characters, not layout structure. A table in the image becomes a sequence of values in the output. In plain text and DOCX output, columns merge confusingly. In searchable PDF output, the text at least appears in roughly the right positions. Proper table reconstruction from images requires different technology and isn't available in browser-based tools yet.

One page at a time for important documents. If you're OCRing a contract that you'll be quoting from, OCR a page, read the confidence score, skim the output for obvious errors, and note any questionable sections before relying on the text.

The short version

  • Scanned PDFs are images — they have no text data, which is why Ctrl+F doesn't work
  • OCR reads the images and adds an invisible text layer so the document becomes searchable
  • Accuracy depends on scan quality: 300 DPI, high contrast, standard fonts work well; handwriting and poor-quality scans don't
  • Choose searchable PDF to preserve the original layout; plain text or DOCX if you need to edit
  • Check first whether your PDF already has a text layer — if it does, you don't need OCR
  • Avoid tools that upload your documents to a server; browser-based OCR processes everything locally

Frequently asked

How do I know if a PDF is scanned or not? Try selecting text on the page. If you can highlight individual words, it has a text layer and doesn't need OCR. If you can only select the entire page as an image block, it's a scan.

Can I OCR a photo taken on my phone? Yes. A phone photo of a document works well if taken in good lighting, from directly above, with the text filling most of the frame. iPhone HEIC files are supported as well as JPG and PNG.

Is 90% accuracy good enough? Depends on the use case. For search — finding where a word appears — 90% accuracy is usually fine. For quoting or relying on specific numbers (amounts, dates, reference numbers), proofread the OCR output against the original before trusting it.

Will the searchable PDF look different from the original? No. The page images are preserved exactly as they are. The text layer is invisible at zero opacity. Open the output in any PDF viewer and it's visually identical to the original scan.

Why is OCR slow? OCR is image analysis. The engine examines every pixel of every page to identify characters. Each page typically takes 3–5 seconds. A 20-page document takes around a minute. This is inherent to the technology, not a limitation of any specific tool.

Can I OCR a multi-page PDF? Yes. The tool processes each page in sequence, showing a confidence score for each. The full output combines all pages with the text layer added throughout.

What happens with tables and columns? OCR reconstructs text in reading order but doesn't reconstruct table structure. In a searchable PDF, the text appears at roughly the right positions. In plain text or DOCX output, columns are usually interleaved confusingly. Proper table extraction requires separate technology.

Does OCR work on password-protected PDFs? No. Encrypted PDFs can't be read by any tool without the password. Remove the password first, then run OCR.

Can I make handwritten notes searchable? Not reliably. Handwriting recognition requires different machine learning models trained specifically on handwritten text — far larger and more complex than what runs in a browser. Printed text only.

Does OCR read right-to-left languages? This tool supports English only in its current version. Multi-language support including Arabic, Hebrew, and right-to-left scripts is planned for a future update.