Recent

Author Topic: Where to start with PDFs and OCR?  (Read 445 times)

landline

  • New member
  • *
  • Posts: 5
Where to start with PDFs and OCR?
« on: May 29, 2020, 10:15:27 pm »
So I've got a  business problem I'm trying to solve.  It starts with pulling in transaction data that's provided by vendors in PDF form, but unfortunately the PDFs in question just contain scanned images rather than text.  I'll need to run OCR on the documents to pull out the text data before I do anything with it.  Luckily the PDFs are mostly standardized.

Unless there's a piece of code floating around that will do everything I want, it looks like I'll need to:

  • Extract the images out of the PDF file, probably one page/image at a time
  • Process that extracted image to pull the text data out, likely with something like TTesseractOCR4
  • Repeat until all pages have been processed

So, questions:

1.  How would you approach opening a PDF file and extracting its contents, assuming it's just full of scanned images?
2.  Is there a better tool than TTesseractOCR4 to turn those scans into text?
3.  In the perfect world, there would be a tool that could do all of this in one step - point it at a PDF file and have it kick out a text file, or filled TMemo, or a linked list of strings, or something.  Is there?

Thanks folks. 

Zittergie

  • Full Member
  • ***
  • Posts: 112
    • XiX Music Player
Re: Where to start with PDFs and OCR?
« Reply #1 on: May 30, 2020, 07:00:49 pm »
Hi,

I am working on a program that takes invoice PDF's and converts them to UBL-XML data.
I use MuPDF https://www.zittergie.be/software/mupdflib-for-pascal/ for reading and showing the PDF and use Tesseract for OCR.

It works very fine.

I will post some code next week.

Maybe we can make something universal out of it into a component, or units.

Greetz,
Bart

Be the difference that makes a difference

MarkMLl

  • Hero Member
  • *****
  • Posts: 925
Re: Where to start with PDFs and OCR?
« Reply #2 on: May 30, 2020, 07:43:33 pm »
Before anything else, if I were you I'd take a very careful look at the toolchain that archive.org has put together for handling PDFs. I'm not sure whether they've got OCR in there, and a lot might depend on what OS you expect to run... which you've not bothered to tell us.

MarkMLl
Turbo Pascal v1 on CCP/M-86, multitasking with LAN and graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.

landline

  • New member
  • *
  • Posts: 5
Re: Where to start with PDFs and OCR?
« Reply #3 on: May 30, 2020, 09:05:21 pm »
Thanks, folks.  2 places to look that I hadn't seen before.

And this will be running on Windows.  :)

lucamar

  • Hero Member
  • *****
  • Posts: 2920
Re: Where to start with PDFs and OCR?
« Reply #4 on: May 30, 2020, 09:19:38 pm »
[...] I'm not sure whether they've got OCR in there [...]

Yes, they have; it's easy to see the standard OCR errors in their plain text files. ;)

Other people with a similar toolchain are bitsavers.org, though their OCRing is (still) a little "experimental".
« Last Edit: May 30, 2020, 09:22:36 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.8/FPC 3.0.4 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

MarkMLl

  • Hero Member
  • *****
  • Posts: 925
Re: Where to start with PDFs and OCR?
« Reply #5 on: May 30, 2020, 10:19:12 pm »
[...] I'm not sure whether they've got OCR in there [...]

Yes, they have; it's easy to see the standard OCR errors in their plain text files. ;)

Other people with a similar toolchain are bitsavers.org, though their OCRing is (still) a little "experimental".

GRIN Actually, I might have been thinking of Bitsavers rather than archive.org, but I do tend to lump them together. The important thing is that various other people have looked at this, although it's obviously far easier if the PDF has been generated (e.g. is the output of a printing stage) rather than scanned... with printed-signed-scanned files being somewhere in the middle since at least they've not come from a dogeared manual.

MarkMLl
Turbo Pascal v1 on CCP/M-86, multitasking with LAN and graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.

 

TinyPortal © 2005-2018