So I've got a business problem I'm trying to solve. It starts with pulling in transaction data that's provided by vendors in PDF form, but unfortunately the PDFs in question just contain scanned images rather than text. I'll need to run OCR on the documents to pull out the text data before I do anything with it. Luckily the PDFs are mostly standardized.
Unless there's a piece of code floating around that will do everything I want, it looks like I'll need to:
- Extract the images out of the PDF file, probably one page/image at a time
- Process that extracted image to pull the text data out, likely with something like TTesseractOCR4
- Repeat until all pages have been processed
So, questions:
1. How would you approach opening a PDF file and extracting its contents, assuming it's just full of scanned images?
2. Is there a better tool than TTesseractOCR4 to turn those scans into text?
3. In the perfect world, there would be a tool that could do all of this in one step - point it at a PDF file and have it kick out a text file, or filled TMemo, or a linked list of strings, or something. Is there?
Thanks folks.