Recent

Author Topic: Where to start with PDFs and OCR?  (Read 1684 times)

landline

  • Newbie
  • Posts: 5
Where to start with PDFs and OCR?
« on: May 29, 2020, 10:15:27 pm »
So I've got a  business problem I'm trying to solve.  It starts with pulling in transaction data that's provided by vendors in PDF form, but unfortunately the PDFs in question just contain scanned images rather than text.  I'll need to run OCR on the documents to pull out the text data before I do anything with it.  Luckily the PDFs are mostly standardized.

Unless there's a piece of code floating around that will do everything I want, it looks like I'll need to:

  • Extract the images out of the PDF file, probably one page/image at a time
  • Process that extracted image to pull the text data out, likely with something like TTesseractOCR4
  • Repeat until all pages have been processed

So, questions:

1.  How would you approach opening a PDF file and extracting its contents, assuming it's just full of scanned images?
2.  Is there a better tool than TTesseractOCR4 to turn those scans into text?
3.  In the perfect world, there would be a tool that could do all of this in one step - point it at a PDF file and have it kick out a text file, or filled TMemo, or a linked list of strings, or something.  Is there?

Thanks folks. 

Zittergie

  • Full Member
  • ***
  • Posts: 114
    • XiX Music Player
Re: Where to start with PDFs and OCR?
« Reply #1 on: May 30, 2020, 07:00:49 pm »
Hi,

I am working on a program that takes invoice PDF's and converts them to UBL-XML data.
I use MuPDF https://www.zittergie.be/software/mupdflib-for-pascal/ for reading and showing the PDF and use Tesseract for OCR.

It works very fine.

I will post some code next week.

Maybe we can make something universal out of it into a component, or units.

Greetz,
Bart

Be the difference that makes a difference

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
Re: Where to start with PDFs and OCR?
« Reply #2 on: May 30, 2020, 07:43:33 pm »
Before anything else, if I were you I'd take a very careful look at the toolchain that archive.org has put together for handling PDFs. I'm not sure whether they've got OCR in there, and a lot might depend on what OS you expect to run... which you've not bothered to tell us.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

landline

  • Newbie
  • Posts: 5
Re: Where to start with PDFs and OCR?
« Reply #3 on: May 30, 2020, 09:05:21 pm »
Thanks, folks.  2 places to look that I hadn't seen before.

And this will be running on Windows.  :)

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: Where to start with PDFs and OCR?
« Reply #4 on: May 30, 2020, 09:19:38 pm »
[...] I'm not sure whether they've got OCR in there [...]

Yes, they have; it's easy to see the standard OCR errors in their plain text files. ;)

Other people with a similar toolchain are bitsavers.org, though their OCRing is (still) a little "experimental".
« Last Edit: May 30, 2020, 09:22:36 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
Re: Where to start with PDFs and OCR?
« Reply #5 on: May 30, 2020, 10:19:12 pm »
[...] I'm not sure whether they've got OCR in there [...]

Yes, they have; it's easy to see the standard OCR errors in their plain text files. ;)

Other people with a similar toolchain are bitsavers.org, though their OCRing is (still) a little "experimental".

GRIN Actually, I might have been thinking of Bitsavers rather than archive.org, but I do tend to lump them together. The important thing is that various other people have looked at this, although it's obviously far easier if the PDF has been generated (e.g. is the output of a printing stage) rather than scanned... with printed-signed-scanned files being somewhere in the middle since at least they've not come from a dogeared manual.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

 

TinyPortal © 2005-2018