Recent

Author Topic: Extract text from PDF  (Read 960 times)

wittbo

  • Full Member
  • ***
  • Posts: 141
Extract text from PDF
« on: September 17, 2020, 05:51:20 pm »
I'm looking for a component (if there is one), which allows to extract all text from a given PDF. It should load the PDF from file and extract the pure text into a stringlist.
I know, there were one or two threads towards this direction nearly 8 years ago. Recently I saw something about fpPDF or fPDF; but I imagine, that these component(s) generate a PDF, not the reverse way.
Does anyone have some experience with this topic? Or some hints where to look for more information?
-wittbo-
MBAir with MacOS 10.14.6 / Lazarus 2.0.10
iMac with MacOS 10.13.6 / Lazarus 2.0.2

af0815

  • Hero Member
  • *****
  • Posts: 586
Re: Extract text from PDF
« Reply #1 on: September 17, 2020, 07:46:24 pm »
PDF can hold text in varios form. maybe as picture. so it is very complex. i did not found any component for this.
last time i used OCR for converting, because my PDF encapsulate text as a picture.
regards
Andreas

wittbo

  • Full Member
  • ***
  • Posts: 141
Re: Extract text from PDF
« Reply #2 on: September 17, 2020, 09:47:02 pm »
OK, thanks for the objection.
For my purpose it would be enough to limit myself to the PDFs with real text, no OCR.
-wittbo-
MBAir with MacOS 10.14.6 / Lazarus 2.0.10
iMac with MacOS 10.13.6 / Lazarus 2.0.2

af0815

  • Hero Member
  • *****
  • Posts: 586
Re: Extract text from PDF
« Reply #3 on: September 17, 2020, 10:03:38 pm »
You can use some commanlinetool with TProcess to do some conversion -> see pdftotext and poppler-utils

https://www.linuxuprising.com/2019/05/how-to-convert-pdf-to-text-on-linux-gui.html

i have seen this utils can work with simpler pdf, if the text is not grapical embedded



« Last Edit: September 17, 2020, 10:05:51 pm by af0815 »
regards
Andreas

rvk

  • Hero Member
  • *****
  • Posts: 4384
Re: Extract text from PDF
« Reply #4 on: September 17, 2020, 10:07:21 pm »
There are pdf2text solutions but you would need to call some executable from your program.

Other option could be to print the pdf automatically to a text printer and catch the output in file (which can be done in code)  :D
(But for that a text-printer driver should be present.)

You could use the txtwrite-device from Ghostscript. (for this you would need to supply a version of ghostscript, installed or portable).

Last option would be to decipher the pdf (which are streams) yourself.

Other topics:
https://forum.lazarus.freepascal.org/index.php?topic=46859.0
https://www.lazarusforum.de/viewtopic.php?t=2659
https://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf

trev

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1044
  • Former Delphi 1-7, 10.2 User
Re: Extract text from PDF
« Reply #5 on: September 18, 2020, 02:09:36 am »
You can use some commanlinetool with TProcess to do some conversion -> see pdftotext and poppler-utils

I've used pdftotext and pdftohtml on a massive scale to convert fairly complex legislation very successfully unless there were any images (eg for mathematical formulae) which then required manual intervention.
o Lazarus v2.1.0 r63871, FPC v3.3.1 r47164, macOS 10.14.6 (with sup update), Xcode 11.3.1
o Lazarus v2.1.0 r61574, FPC v3.3.1 r42318, FreeBSD 12.1 amd64 (VMware Fusion VM)
o FPC 3.0.4, FreeBSD 12.2-STABLE r365646 amd64
o Lazarus v2.1.0 r61574, FPC v3.0.4, Ubuntu 20.04 (Parallels VM)

wittbo

  • Full Member
  • ***
  • Posts: 141
Re: Extract text from PDF
« Reply #6 on: September 19, 2020, 04:53:25 pm »
That sounds interesting. Does anyone know the price of pdf2text or is it free for private use?
-wittbo-
MBAir with MacOS 10.14.6 / Lazarus 2.0.10
iMac with MacOS 10.13.6 / Lazarus 2.0.2

winni

  • Hero Member
  • *****
  • Posts: 1985
Re: Extract text from PDF
« Reply #7 on: September 19, 2020, 05:01:44 pm »
Hi!

pdf2text and his friends are open source.

They come with nearly all Linux distros.

Have a look at

https://poppler.freedesktop.org/

Winni

rvk

  • Hero Member
  • *****
  • Posts: 4384
Re: Extract text from PDF
« Reply #8 on: September 19, 2020, 05:05:02 pm »
That sounds interesting. Does anyone know the price of pdf2text or is it free for private use?
It depends. You are on Mac? Is it standard installed on Mac (like it is often on Linux)?

https://en.m.wikipedia.org/wiki/Pdftotext

There are also lots of other versions floating around. You would need to look at the licenses to see if it's freely usable.

Calling a GPL program as executable from your closed sourced program is usually allowed.
Calling a GPL library from your closed sourced program usually not.
But that could spark a while different license discussion  :D
« Last Edit: September 19, 2020, 05:20:30 pm by rvk »

wittbo

  • Full Member
  • ***
  • Posts: 141
Re: Extract text from PDF
« Reply #9 on: September 19, 2020, 11:39:09 pm »
Thanks to all for your help.

@rvk: yes, I'm on mac.

When looking for pdftotext for macOS I found the site of Carsten Bluem (https://www.bluem.net/files/pdftotext.dmg), who extracted the pdftotext command line tool as a part of the "Xpdf" software (http://www.xpdfreader.com) with easy to use package installation. So, no license problem, it's open source.
-wittbo-
MBAir with MacOS 10.14.6 / Lazarus 2.0.10
iMac with MacOS 10.13.6 / Lazarus 2.0.2

rvk

  • Hero Member
  • *****
  • Posts: 4384
Re: Extract text from PDF
« Reply #10 on: September 20, 2020, 09:21:20 pm »
So, no license problem, it's open source.
Open source doesn't necessarily mean no problem.
Xpdf is licensed under GPL v2 and GPL v3.
http://www.xpdfreader.com/opensource.html

You can't dynamic link (dll, .so etc.) to it without open sourcing your own software.
You can install it separately and call the executable and keep your own software closed source.

But even with open source you always need to examine the licence if you want to keep your software closed source. (With calling the executable you are 'safe' in this case.)



trev

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1044
  • Former Delphi 1-7, 10.2 User
Re: Extract text from PDF
« Reply #11 on: September 21, 2020, 03:02:58 am »
Xpdf is licensed under GPL v2 and GPL v3.
http://www.xpdfreader.com/opensource.html

Better link: https://www.glyphandcog.com/opensource.html according to my man pages (FreeBSD and Solaris).
o Lazarus v2.1.0 r63871, FPC v3.3.1 r47164, macOS 10.14.6 (with sup update), Xcode 11.3.1
o Lazarus v2.1.0 r61574, FPC v3.3.1 r42318, FreeBSD 12.1 amd64 (VMware Fusion VM)
o FPC 3.0.4, FreeBSD 12.2-STABLE r365646 amd64
o Lazarus v2.1.0 r61574, FPC v3.0.4, Ubuntu 20.04 (Parallels VM)

 

TinyPortal © 2005-2018