[DONE but not solved] From PDF to Textfile

madref

Hero Member
Posts: 949
..... A day not Laughed is a day wasted !!

[DONE but not solved] From PDF to Textfile

« on: September 25, 2019, 12:00:14 pm »

Is it possible to make from a pdf file a text file?

GameSheet-1920-9930.pdf (62.37 kB - downloaded 84 times.)

« Last Edit: October 03, 2019, 05:49:42 pm by madref »

Logged

You treat a disease, you win, you lose.
You treat a person and I guarantee you, you win, no matter the outcome.

Lazarus 3.99 (rev main_3_99-649-ge13451a5ab) FPC 3.3.1 x86_64-darwin-cocoa
Mac OS X Monterey

rvk

Hero Member
Posts: 6169

Re: From PDF to Textfile

« Reply #1 on: September 25, 2019, 12:14:50 pm »

You could use pdftotext from xpdfbin.
(with the -layout parameter it will result in attached text file)

(No idea if there is something native to FPC)

GameSheet-1920-9930.txt (10.42 kB - downloaded 72 times.)

Logged

madref

Hero Member
Posts: 949
..... A day not Laughed is a day wasted !!

Re: From PDF to Textfile

« Reply #2 on: September 28, 2019, 01:15:35 pm »

I have this sample project which Howard wrote for me. But for me it's very difficult to read what he has done.
He wrote a program for me that could 'read' a website and parse the individual penalties to my database and I have implemented it into this database.
This is one such website: https://www.nijb.nl/nijbsheet.php?GameID=52702&ShowGameSheet=1

Unfortunately for me Howard hasn't got the time to help me. So I am returning to the forum.

Can anyone help me on the right track to implement the above pdf-text into my sample project?

ForAschwin3.zip (136.92 kB - downloaded 65 times.)

Logged

madref

Hero Member
Posts: 949
..... A day not Laughed is a day wasted !!

Re: From PDF to Textfile

« Reply #3 on: October 02, 2019, 11:02:13 am »

no one?

Logged

MarkMLl

Hero Member
Posts: 6692

Re: From PDF to Textfile

« Reply #4 on: October 02, 2019, 11:37:15 am »

Quote from: madref on September 25, 2019, 12:00:14 pm

Is it possible to make from a pdf file a text file?

In the general case: no.

Depending on the content of the PDF, the operating system, and what other stuff is installed (OCR etc.): perhaps.

If you have a PDF which is entirely encapsulated text together with fount and formatting information, it might be possible to extract the original. If it's a scanned document saved as a bitmap, particularly if it's older stuff or the paper was soiled, then the best you'll get is a bitmap. Everything between those extremes relies on OCR to some extent.

Logged

MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

rvk

Hero Member
Posts: 6169

Re: From PDF to Textfile

« Reply #5 on: October 02, 2019, 11:38:55 am »

The problem is that the original uses the HTML table structure. In which case you are sure where certain numbers are.

The txt-variant doesn't have <td> etc. It doesn't even have tabs. So you would need to read every line and provide on what position what number is.

To top it all off, it seems that not all columns and columns headers are directly in line. For instance, Nr is left justified and the actual number is right justified.

Code: [Select]

Pos Nr          Name                                           Lic              Sit       G          Time         G    A     A          Time         Nr       Duration     Penalty     Start     End
GO      31      Barendregt Tom                                 99998            PP1       1          21:07        28   15    11         05:27        10       5            BOAR        05:27     10:27

In theorie you can read the file until you encounter "Pos Nr". Then read that line and set the index-position for every 'field'. And then read the following lines and extract the information. That's how Howard did it for you for the HTML. But converting the example (or writing it from scratch) is going to take quite some time. And someone has to put that time in.

Why can't you do it?

Is this information in this PDF file not provided via the same website as HTML?

Quote from: MarkMLl on October 02, 2019, 11:37:15 am

Quote from: madref on September 25, 2019, 12:00:14 pm
Is it possible to make from a pdf file a text file?
In the general case: no.

Actually the PDF is just text and can be converted to txt with pdftotext like I showed in my post.
The problem is reading and interpreting the information afterwards.

Logged

madref

Hero Member
Posts: 949
..... A day not Laughed is a day wasted !!

Re: From PDF to Textfile

« Reply #6 on: October 02, 2019, 07:10:37 pm »

The HTML-site comes from a dutch website and the PDF comes from a Belgium website.
Both are in dutch but both have different origins

@rvk
The method you showed me has also a program called pdftohtml can this work for me?

Logged

rvk

Hero Member
Posts: 6169

Re: From PDF to Textfile

« Reply #7 on: October 03, 2019, 10:28:13 am »

Quote from: madref on October 02, 2019, 07:10:37 pm

The method you showed me has also a program called pdftohtml can this work for me?

I'm not sure if that makes it any easier. I don't know if you realize how much time was spent on converting the original.

Attached is the result of the pdftohtml.

Looking at the first two rows you see that "Pos Nr Name" are all in one column.
Furthermore this is not converted to a table but a lot of <div>'s.

I think it is more work to convert that to info than it is for the .txt-variant.

page1.txt (93.31 kB - downloaded 62 times.)

Logged

madref

Hero Member
Posts: 949
..... A day not Laughed is a day wasted !!

Re: From PDF to Textfile

« Reply #8 on: October 03, 2019, 05:49:21 pm »

I think I am going to abandon this because manual input is much quicker than the labour I need to put in to this.

Thanks for all your input.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: [DONE but not solved] From PDF to Textfile (Read 4437 times)

madref

[DONE but not solved] From PDF to Textfile

rvk

Re: From PDF to Textfile

madref

Re: From PDF to Textfile

madref

Re: From PDF to Textfile

MarkMLl

Re: From PDF to Textfile

rvk

Re: From PDF to Textfile

madref

Re: From PDF to Textfile

rvk

Re: From PDF to Textfile

madref

Re: From PDF to Textfile

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook