Forum > General
Need a way to search a pdf for text
(1/1)
wpflum:
Is there anything out there that will either allow me to directly search a pdf, that contains text not just images, for a text string?? Baring finding something like that how about something that will allow me, in Lazarus, to convert the first few pages to text so I can search them in code.
What I'm looking to do is write a program that will allow me to scan a directory that contains ebook pdfs, which may or may not be named correctly, and pull the isbn number out of the publisher page. Then I intend to write a scrapper to pull that info from some online source to rename the file with a standardized title and drop a summary into a database for later use in an ebook library program I want to write.
I know I can use pdf2text on linux, which is the platform the program will run on, but I'm not sure how I might integrate that into a Lazarus program and I'd rather have an internal solution so I can cross platform it if necessary.
Any ideas?
Leledumbo:
Here's a link explaining what a PDF structure like: http://www.planetpdf.com/developer/article.asp?ContentID=navigating_the_internal_struct
and the official one from adobe (WARNING: prepare aspirins):
http://www.adobe.com/devnet/pdf/pdf_reference.html
AFAIK there are a lot of PDF to text converter out there for Windows. Examples:
http://www.a-pdf.com/text/
http://www.somepdf.com/some-pdf-to-txt-converter.html
http://pdf-to-html-word.com/pdf-to-text/
You can use TProcess to execute the conversion and get the produced text files to be processed.
Chronos:
You could create some PDF reader component for Free Pascal or use some other program API like API of Acrobat Reader.
There are page about PDF creator in wiki http://wiki.freepascal.org/PowerPDF It would be not useful for your needs but it can be helpful.
http://wiki.lazarus.freepascal.org/fpvectorial is some package for reading vector images and can read PDF as well.
Another possibility could be translate some delphi component.
wpflum:
For the moment I went with using TProcess to run the external command pdftotext and piping back the results. I also used TProcess to run a perl program that uses some scrapers to pull the info off the web for each pdf that contains a good ISBN which is working well enough at the moment. I'd prefer to keep it all in pascal but if I screw around too much with trying to do this I'll NEVER get it done :D
Navigation
[0] Message Index