Lazarus
Programming => LCL => Topic started by: wittbo on September 17, 2020, 05:51:20 pm
-
I'm looking for a component (if there is one), which allows to extract all text from a given PDF. It should load the PDF from file and extract the pure text into a stringlist.
I know, there were one or two threads towards this direction nearly 8 years ago. Recently I saw something about fpPDF or fPDF; but I imagine, that these component(s) generate a PDF, not the reverse way.
Does anyone have some experience with this topic? Or some hints where to look for more information?
-
PDF can hold text in varios form. maybe as picture. so it is very complex. i did not found any component for this.
last time i used OCR for converting, because my PDF encapsulate text as a picture.
-
OK, thanks for the objection.
For my purpose it would be enough to limit myself to the PDFs with real text, no OCR.
-
You can use some commanlinetool with TProcess to do some conversion -> see pdftotext and poppler-utils
https://www.linuxuprising.com/2019/05/how-to-convert-pdf-to-text-on-linux-gui.html
i have seen this utils can work with simpler pdf, if the text is not grapical embedded
-
There are pdf2text solutions but you would need to call some executable from your program.
Other option could be to print the pdf automatically to a text printer and catch the output in file (which can be done in code) :D
(But for that a text-printer driver should be present.)
You could use the txtwrite-device from Ghostscript. (for this you would need to supply a version of ghostscript, installed or portable).
Last option would be to decipher the pdf (which are streams) yourself.
Other topics:
https://forum.lazarus.freepascal.org/index.php?topic=46859.0
https://www.lazarusforum.de/viewtopic.php?t=2659
https://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf
-
You can use some commanlinetool with TProcess to do some conversion -> see pdftotext and poppler-utils
I've used pdftotext and pdftohtml on a massive scale to convert fairly complex legislation very successfully unless there were any images (eg for mathematical formulae) which then required manual intervention.
-
That sounds interesting. Does anyone know the price of pdf2text or is it free for private use?
-
Hi!
pdf2text and his friends are open source.
They come with nearly all Linux distros.
Have a look at
https://poppler.freedesktop.org/ (https://poppler.freedesktop.org/)
Winni
-
That sounds interesting. Does anyone know the price of pdf2text or is it free for private use?
It depends. You are on Mac? Is it standard installed on Mac (like it is often on Linux)?
https://en.m.wikipedia.org/wiki/Pdftotext
There are also lots of other versions floating around. You would need to look at the licenses to see if it's freely usable.
Calling a GPL program as executable from your closed sourced program is usually allowed.
Calling a GPL library from your closed sourced program usually not.
But that could spark a while different license discussion :D
-
Thanks to all for your help.
@rvk: yes, I'm on mac.
When looking for pdftotext for macOS I found the site of Carsten Bluem (https://www.bluem.net/files/pdftotext.dmg (https://www.bluem.net/files/pdftotext.dmg)), who extracted the pdftotext command line tool as a part of the "Xpdf" software (http://www.xpdfreader.com (http://www.xpdfreader.com)) with easy to use package installation. So, no license problem, it's open source.
-
So, no license problem, it's open source.
Open source doesn't necessarily mean no problem.
Xpdf is licensed under GPL v2 and GPL v3.
http://www.xpdfreader.com/opensource.html
You can't dynamic link (dll, .so etc.) to it without open sourcing your own software.
You can install it separately and call the executable and keep your own software closed source.
But even with open source you always need to examine the licence if you want to keep your software closed source. (With calling the executable you are 'safe' in this case.)
-
Xpdf is licensed under GPL v2 and GPL v3.
http://www.xpdfreader.com/opensource.html
Better link: https://www.glyphandcog.com/opensource.html according to my man pages (FreeBSD and Solaris).