Forum > General

Need a way to search a pdf for text

(1/1)

wpflum:
Is there anything out there that will either allow me to directly search a pdf, that contains text not just images, for a text string?? Baring finding something like that how about something that will allow me, in Lazarus, to convert the first few pages to text so I can search them in code.

What I'm looking to do is write a program that will allow me to scan a directory that contains ebook pdfs, which may or may not be named correctly, and pull the isbn number out of the publisher page.  Then I intend to write a scrapper to pull that info from some online source to rename the file with a standardized title and drop a summary into a database for later use in an ebook library program I want to write. 

I know I can use pdf2text on linux, which is the platform the program will run on, but I'm not sure how I might integrate that into a Lazarus program and I'd rather have an internal solution so I can cross platform it if necessary.

Any ideas?

Leledumbo:
Here's a link explaining what a PDF structure like: http://www.planetpdf.com/developer/article.asp?ContentID=navigating_the_internal_struct
and the official one from adobe (WARNING: prepare aspirins):
http://www.adobe.com/devnet/pdf/pdf_reference.html

AFAIK there are a lot of PDF to text converter out there for Windows. Examples:
http://www.a-pdf.com/text/
http://www.somepdf.com/some-pdf-to-txt-converter.html
http://pdf-to-html-word.com/pdf-to-text/

You can use TProcess to execute the conversion and get the produced text files to be processed.

Chronos:
You could create some PDF reader component for Free Pascal or use some other program API like API of Acrobat Reader.

There are page about PDF creator in wiki http://wiki.freepascal.org/PowerPDF It would be not useful for your needs but it can be helpful.
http://wiki.lazarus.freepascal.org/fpvectorial is some package for reading vector images and can read PDF as well.

Another possibility could be translate some delphi component.  

wpflum:
For the moment I went with using TProcess to run the external command pdftotext and piping back the results.  I also used TProcess to run a perl program that uses some scrapers to pull the info off the web for each pdf that contains a good ISBN which is working well enough at the moment.  I'd prefer to keep it all in pascal but if I screw around too much with trying to do this I'll NEVER get it done  :D

Navigation

[0] Message Index

Go to full version