Lazarus

Programming => LCL => Topic started by: wittbo on September 17, 2020, 05:51:20 pm

Title: Extract text from PDF
Post by: wittbo on September 17, 2020, 05:51:20 pm: I'm looking for a component (if there is one), which allows to extract all text from a given PDF. It should load the PDF from file and extract the pure text into a stringlist.
I know, there were one or two threads towards this direction nearly 8 years ago. Recently I saw something about fpPDF or fPDF; but I imagine, that these component(s) generate a PDF, not the reverse way.
Does anyone have some experience with this topic? Or some hints where to look for more information?
Title: Re: Extract text from PDF
Post by: af0815 on September 17, 2020, 07:46:24 pm: PDF can hold text in varios form. maybe as picture. so it is very complex. i did not found any component for this.
last time i used OCR for converting, because my PDF encapsulate text as a picture.
Title: Re: Extract text from PDF
Post by: wittbo on September 17, 2020, 09:47:02 pm: OK, thanks for the objection.
For my purpose it would be enough to limit myself to the PDFs with real text, no OCR.
Title: Re: Extract text from PDF
Post by: af0815 on September 17, 2020, 10:03:38 pm: You can use some commanlinetool with TProcess to do some conversion -> see pdftotext and poppler-utils

https://www.linuxuprising.com/2019/05/how-to-convert-pdf-to-text-on-linux-gui.html

i have seen this utils can work with simpler pdf, if the text is not grapical embedded
Title: Re: Extract text from PDF
Post by: rvk on September 17, 2020, 10:07:21 pm: There are pdf2text solutions but you would need to call some executable from your program.

Other option could be to print the pdf automatically to a text printer and catch the output in file (which can be done in code) :D
(But for that a text-printer driver should be present.)

You could use the txtwrite-device from Ghostscript. (for this you would need to supply a version of ghostscript, installed or portable).

Last option would be to decipher the pdf (which are streams) yourself.

Other topics:
https://forum.lazarus.freepascal.org/index.php?topic=46859.0
https://www.lazarusforum.de/viewtopic.php?t=2659
https://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf
Title: Re: Extract text from PDF
Post by: trev on September 18, 2020, 02:09:36 am: Quote from: af0815 on September 17, 2020, 10:03:38 pm
You can use some commanlinetool with TProcess to do some conversion -> see pdftotext and poppler-utils

I've used pdftotext and pdftohtml on a massive scale to convert fairly complex legislation very successfully unless there were any images (eg for mathematical formulae) which then required manual intervention.
Title: Re: Extract text from PDF
Post by: wittbo on September 19, 2020, 04:53:25 pm: That sounds interesting. Does anyone know the price of pdf2text or is it free for private use?
Title: Re: Extract text from PDF
Post by: winni on September 19, 2020, 05:01:44 pm: Hi!

pdf2text and his friends are open source.

They come with nearly all Linux distros.

Have a look at

https://poppler.freedesktop.org/ (https://poppler.freedesktop.org/)

Winni
Title: Re: Extract text from PDF
Post by: rvk on September 19, 2020, 05:05:02 pm: Quote from: wittbo on September 19, 2020, 04:53:25 pm
That sounds interesting. Does anyone know the price of pdf2text or is it free for private use?
It depends. You are on Mac? Is it standard installed on Mac (like it is often on Linux)?

https://en.m.wikipedia.org/wiki/Pdftotext

There are also lots of other versions floating around. You would need to look at the licenses to see if it's freely usable.

Calling a GPL program as executable from your closed sourced program is usually allowed.
Calling a GPL library from your closed sourced program usually not.
But that could spark a while different license discussion :D
Title: Re: Extract text from PDF
Post by: wittbo on September 19, 2020, 11:39:09 pm: Thanks to all for your help.

@rvk: yes, I'm on mac.

When looking for pdftotext for macOS I found the site of Carsten Bluem (https://www.bluem.net/files/pdftotext.dmg (https://www.bluem.net/files/pdftotext.dmg)), who extracted the pdftotext command line tool as a part of the "Xpdf" software (http://www.xpdfreader.com (http://www.xpdfreader.com)) with easy to use package installation. So, no license problem, it's open source.
Title: Re: Extract text from PDF
Post by: rvk on September 20, 2020, 09:21:20 pm: Quote from: wittbo on September 19, 2020, 11:39:09 pm
So, no license problem, it's open source.
Open source doesn't necessarily mean no problem.
Xpdf is licensed under GPL v2 and GPL v3.
http://www.xpdfreader.com/opensource.html

You can't dynamic link (dll, .so etc.) to it without open sourcing your own software.
You can install it separately and call the executable and keep your own software closed source.

But even with open source you always need to examine the licence if you want to keep your software closed source. (With calling the executable you are 'safe' in this case.)
Title: Re: Extract text from PDF
Post by: trev on September 21, 2020, 03:02:58 am: Quote from: rvk on September 20, 2020, 09:21:20 pm
Xpdf is licensed under GPL v2 and GPL v3.
http://www.xpdfreader.com/opensource.html

Better link: https://www.glyphandcog.com/opensource.html according to my man pages (FreeBSD and Solaris).