Recent

Author Topic: [DONE but not solved] From PDF to Textfile  (Read 4437 times)

madref

  • Hero Member
  • *****
  • Posts: 949
  • ..... A day not Laughed is a day wasted !!
    • Nursing With Humour
[DONE but not solved] From PDF to Textfile
« on: September 25, 2019, 12:00:14 pm »
Is it possible to make from a pdf file a text file?
« Last Edit: October 03, 2019, 05:49:42 pm by madref »
You treat a disease, you win, you lose.
You treat a person and I guarantee you, you win, no matter the outcome.

Lazarus 3.99 (rev main_3_99-649-ge13451a5ab) FPC 3.3.1 x86_64-darwin-cocoa
Mac OS X Monterey

rvk

  • Hero Member
  • *****
  • Posts: 6169
Re: From PDF to Textfile
« Reply #1 on: September 25, 2019, 12:14:50 pm »
You could use pdftotext from xpdfbin.
(with the -layout parameter it will result in attached text file)

(No idea if there is something native to FPC)

madref

  • Hero Member
  • *****
  • Posts: 949
  • ..... A day not Laughed is a day wasted !!
    • Nursing With Humour
Re: From PDF to Textfile
« Reply #2 on: September 28, 2019, 01:15:35 pm »

I have this sample project which Howard wrote for me. But for me it's very difficult to read what he has done.
He wrote a program for me that could 'read' a website and parse the individual penalties to my database and I have implemented it into this database.
This is one such website: https://www.nijb.nl/nijbsheet.php?GameID=52702&ShowGameSheet=1


Unfortunately for me Howard hasn't got the time to help me. So I am returning to the forum.


Can anyone help me on the right track to implement the above pdf-text into my sample project?

You treat a disease, you win, you lose.
You treat a person and I guarantee you, you win, no matter the outcome.

Lazarus 3.99 (rev main_3_99-649-ge13451a5ab) FPC 3.3.1 x86_64-darwin-cocoa
Mac OS X Monterey

madref

  • Hero Member
  • *****
  • Posts: 949
  • ..... A day not Laughed is a day wasted !!
    • Nursing With Humour
Re: From PDF to Textfile
« Reply #3 on: October 02, 2019, 11:02:13 am »
no one?

You treat a disease, you win, you lose.
You treat a person and I guarantee you, you win, no matter the outcome.

Lazarus 3.99 (rev main_3_99-649-ge13451a5ab) FPC 3.3.1 x86_64-darwin-cocoa
Mac OS X Monterey

MarkMLl

  • Hero Member
  • *****
  • Posts: 6692
Re: From PDF to Textfile
« Reply #4 on: October 02, 2019, 11:37:15 am »
Is it possible to make from a pdf file a text file?

In the general case: no.

Depending on the content of the PDF, the operating system, and what other stuff is installed (OCR etc.): perhaps.

If you have a PDF which is entirely encapsulated text together with fount and formatting information, it might be possible to extract the original. If it's a scanned document saved as a bitmap, particularly if it's older stuff or the paper was soiled, then the best you'll get is a bitmap. Everything between those extremes relies on OCR to some extent.
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

rvk

  • Hero Member
  • *****
  • Posts: 6169
Re: From PDF to Textfile
« Reply #5 on: October 02, 2019, 11:38:55 am »
The problem is that the original uses the HTML table structure. In which case you are sure where certain numbers are.

The txt-variant doesn't have <td> etc. It doesn't even have tabs. So you would need to read every line and provide on what position what number is.

To top it all off, it seems that not all columns and columns headers are directly in line. For instance, Nr is left justified and the actual number is right justified.
Code: [Select]
Pos Nr          Name                                           Lic              Sit       G          Time         G    A     A          Time         Nr       Duration     Penalty     Start     End
GO      31      Barendregt Tom                                 99998            PP1       1          21:07        28   15    11         05:27        10       5            BOAR        05:27     10:27

In theorie you can read the file until you encounter "Pos Nr". Then read that line and set the index-position for every 'field'. And then read the following lines and extract the information. That's how Howard did it for you for the HTML. But converting the example (or writing it from scratch) is going to take quite some time. And someone has to put that time in.

Why can't you do it?

Is this information in this PDF file not provided via the same website as HTML?

Is it possible to make from a pdf file a text file?
In the general case: no.
Actually the PDF is just text and can be converted to txt with pdftotext like I showed in my post.
The problem is reading and interpreting the information afterwards.

madref

  • Hero Member
  • *****
  • Posts: 949
  • ..... A day not Laughed is a day wasted !!
    • Nursing With Humour
Re: From PDF to Textfile
« Reply #6 on: October 02, 2019, 07:10:37 pm »
The HTML-site comes from a dutch website and the PDF comes from a Belgium website.
Both are in dutch but both have different origins


@rvk
The method you showed me has also a program called pdftohtml can this work for me?
You treat a disease, you win, you lose.
You treat a person and I guarantee you, you win, no matter the outcome.

Lazarus 3.99 (rev main_3_99-649-ge13451a5ab) FPC 3.3.1 x86_64-darwin-cocoa
Mac OS X Monterey

rvk

  • Hero Member
  • *****
  • Posts: 6169
Re: From PDF to Textfile
« Reply #7 on: October 03, 2019, 10:28:13 am »
The method you showed me has also a program called pdftohtml can this work for me?
I'm not sure if that makes it any easier. I don't know if you realize how much time was spent on converting the original.

Attached is the result of the pdftohtml.

Looking at the first two rows you see that "Pos Nr Name" are all in one column.
Furthermore this is not converted to a table but a lot of <div>'s.

I think it is more work to convert that to info than it is for the .txt-variant.

madref

  • Hero Member
  • *****
  • Posts: 949
  • ..... A day not Laughed is a day wasted !!
    • Nursing With Humour
Re: From PDF to Textfile
« Reply #8 on: October 03, 2019, 05:49:21 pm »
I think I am going to abandon this because manual input is much quicker than the labour I need to put in to this.


Thanks for all your input.
You treat a disease, you win, you lose.
You treat a person and I guarantee you, you win, no matter the outcome.

Lazarus 3.99 (rev main_3_99-649-ge13451a5ab) FPC 3.3.1 x86_64-darwin-cocoa
Mac OS X Monterey

 

TinyPortal © 2005-2018