Recent

Author Topic: Reading data files that are Chaotic.  (Read 1789 times)

OC DelGuy

  • Full Member
  • ***
  • Posts: 208
  • 123
Reading data files that are Chaotic.
« on: April 09, 2025, 07:42:22 pm »
I have a text file with travel directions to different places.
I have no code as of yet because I'm barely starting out by studying the text file.

There are many sections to the file.  Each section is like this:
First there is a Destination line.        The destination line has no header.
Then there's the Trailer line.             This line also lacks a header.
Third is the Start Location.               This is the first line with a header: "Location:".
Fourth is the time line.                     The Header is "Time:".  This is the time allowed to go from start to finish (in Minutes).
Fifth is the stop line.                        The Header is "Stop:".  This Line shows up sometimes.  This is because the trip is too short.
The last part is the directions.          This part has turn by turn directions.  The header, "Directions:", is alone in the first line and the directions follow in successive lines.

So, I'm thinking that each section will be a record.  So I'll Start with:  (I just thought of this while writing this post)

Code: Pascal  [Select][+][-]
  1.   TDirs = Record
  2.     Dest  : String;
  3.     Trail : String;
  4.     Loc   : String;
  5.     Time  : Integer;   // Integer because I have to convert the minutes into Hours and Minutes.
  6.     Stop  : String;
  7.     Dirs  : String;
  8.   End;
  9.  
  10.  
  11.   Var
  12.     Dir : Array of TDirs;
  13.  


Problem:
The directions are a bunch of headerless lines and there is no separation of lines between the sections.  This means that the first two lines of the next section, being headerless, appear as the last two lines of directions.  The only way to tell the different section is when I get to the "Location:" header and by then it's too late, I'm already two lines behind.  How do I read the file and figure where the sections actually separate?

The file has hundreds of directions, but here is an example:
Code: Text  [Select][+][-]
  1.   Bob's Diner
  2.   Reefer
  3.   Location: Mike's Beef Barn
  4.   Time: 1,584
  5.   Stop: Highway Hotel.
  6.   Directions:
  7.   Go south on Broadway.
  8.   Get on freeway 30 South. Exit Main St.
  9.   Go south on Main to Highway Hotel.
  10.   Go north on Main.  Go east on Highway 15.  Exit Market St.
  11.   1522 Market St.
  12.   Comfy Pillows and Mattresses
  13.   Dry Van
  14.   Location: Innovative Furniture
  15.   Time: 223
  16.   Directions:
  17.   Turn right and go to Seville Ave.
  18.   Turn left on St. Michael St.
  19.   Go to end at the Comfy Sign.
  20.   Washington County Construction
  21.   Low Boy
  22.   Location: Marble Quarry in Gainesville.
  23.   Time: 1,135
  24.   Stop: Mountainside Motel.
  25.   Directions:
  26.   Make a U-turn on Alameda Ave. and a right on Chantilly Lane.
  27.   Get on freeway 67 East.
  28.   Pass the Hospital and Exit Seward St.
  29.   Turn left on Coral St. to Mountainside Motel on right.
  30.   Exit motel and go east on Jefferson and then right on Belhaven.
  31.   Follow Joelle Ln. to Hwy 49.
  32.   Turn right at the Red Hill Junction.  Go east on Harewood Rd.
  33.   Turn left on Blue Heron St.
  34.   8964 Red Rooster St.
  35.  

As you can see, the file is very chaotic.  I can start by reading the first two lines, and just put them in their corresponding variables.  Then read the next ones by searching for the headers.  Put them where they belong.  But then, with the directions, they're all over the place!  And I can't figure how to stop from getting the first two lines from the next set of directions in the current set.  As a human I can see the first set ends with the address, the second because it says goto the Comfy Sign and the last one is the EOF.   But I can't program that for hundreds of directions that are all different.  I'll be writing until next year!

So, how can I read a file, put the information in my record, but then backtrack and Delete the last two lines, and then re-read them from the file and put them in the next record?
Free Pascal Lazarus Version #: 2.2.4
Date: 24 SEP 2022
FPC Version: 3.2.2
Revision: Lazarus_2_2_4
x86_64-win64-win32/win64

cdbc

  • Hero Member
  • *****
  • Posts: 2216
    • http://www.cdbc.dk
Re: Reading data files that are Chaotic.
« Reply #1 on: April 09, 2025, 08:20:08 pm »
Hi
Hmmm... How big would this mess be, like file-size-wise?!?
You could run through the file line by line, keeping the last 2 lines cached in memory, then at least you can use 'Location:' as a break-point, where the next record starts 2 lines back, namely the 2 lines you've got cached...
Somewhat like 'a poor man's look-ahead', but looking back instead.
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE5 -> FPC 3.2.2 -> Lazarus 3.6 up until Jan 2024 from then on it's both above &: KDE5/QT5 -> FPC 3.3.1 -> Lazarus 4.99

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1530
    • Lebeau Software
Re: Reading data files that are Chaotic.
« Reply #2 on: April 09, 2025, 08:30:39 pm »
Code: Pascal  [Select][+][-]
  1.   TDirs = Record
  2.     ...
  3.     Dirs  : String;
  4.   End;

Why a single String for the Dirs and not an array of strings, or a TStringList?

The directions are a bunch of headerless lines and there is no separation of lines between the sections.

Where are you getting this file from, and why didn't the designer choose to provide a separator between different sections?
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Fibonacci

  • Hero Member
  • *****
  • Posts: 754
  • Internal Error Hunter
Re: Reading data files that are Chaotic.
« Reply #3 on: April 09, 2025, 09:14:43 pm »
How about regex (thanks GPT)

https://regex101.com/r/LbdpbF/1

TRon

  • Hero Member
  • *****
  • Posts: 4377
Re: Reading data files that are Chaotic.
« Reply #4 on: April 09, 2025, 10:04:21 pm »
How about regex (thanks GPT)
Holy macaroni. Let's comment and maintain that  :)

(it is a nice solution though)

A simply state machine will suffice (Just a quick mockup, could be better.).
Code: Pascal  [Select][+][-]
  1. program chaostheory;
  2.  
  3. {$mode objfpc}{$h+}
  4.  
  5. uses
  6.   classes, sysutils;
  7.  
  8. procedure HandleFile(Filename: string);
  9. var
  10.   Lines        : TStringList;
  11.   Line         : string;
  12.   LineIdx      : SizeInt = 0;
  13.   State        : integer = 0;
  14.   DirsCnt      : integer = 0;
  15.   DirsIdxStart : integer;
  16.   DirsIdx      : integer;
  17. begin
  18.   Lines := TStringList.Create;
  19.   Lines.LoadFromFile(Filename);
  20.  
  21.   while LineIdx < Lines.Count do
  22.   begin
  23.     Line := Lines[LineIdx];
  24.  
  25.     if State = 0 then
  26.     begin
  27.       if Line.Trim.StartsWith('Location:') then
  28.       begin
  29.         writeln('========================================');
  30.         writeln('Destination   = ', Lines[LineIdx-2]);
  31.         writeln('Trailer       = ', Lines[LineIdx-1]);
  32.         writeln('Location      = ', Line);
  33.         inc(State);
  34.       end;
  35.     end
  36.  
  37.     else
  38.     if State = 1 then
  39.     begin
  40.       if Line.Trim.StartsWith('Time:') then
  41.       begin
  42.         writeln('Time          = ', Line);
  43.         inc(state);
  44.       end
  45.       else writeln('ERROR: Time expected');
  46.     end
  47.  
  48.     else
  49.     if State = 2 then
  50.     begin
  51.       if Line.Trim.StartsWith('Stop:') then
  52.       begin
  53.         writeln('Stop          = ', Line);
  54.         inc(state);
  55.       end
  56.       else
  57.       // if stop does not appear at the start of this line then assume it is
  58.       // omitted which means handle directions instead
  59.       if Line.Trim.StartsWith('Directions:') then
  60.       begin
  61.         inc(state);
  62.         continue;
  63.       end
  64.       else writeln('ERROR: Stop or Directions expected');
  65.     end
  66.     else
  67.  
  68.     if State = 3 then
  69.     begin
  70.       // skip line
  71.       if Line.Trim.StartsWith('Directions:') then
  72.       begin
  73.         inc(state);
  74.         DirsIdxStart := LineIdx+1;
  75.         DirsCnt := 0;
  76.       end
  77.       else writeln('ERROR: Directions expected');
  78.     end
  79.     else
  80.  
  81.     if State = 4 then
  82.     begin
  83.       if Line.Trim.StartsWith('Location:') then
  84.       begin
  85.         for DirsIdx := DirsIdxStart to DirsIdxStart + pred(dirsCnt-2)
  86.           do writeln('Direction[', DirsIdx:2, '] = ', Lines[DirsIdx]);
  87.         State := 0;
  88.         // do not skip handling location
  89.         continue;
  90.       end
  91.       else
  92.         inc(DirsCnt);
  93.     end;
  94.     inc(lineIdx);
  95.   end;
  96.  
  97.   // last directions ?
  98.   if DirsCnt > 0 then
  99.     for DirsIdx := DirsIdxStart to DirsIdxStart + Pred(dirsCnt-2) do
  100.       if not Lines[DirsIdx].Trim.IsEmpty
  101.         then writeln('Direction[', DirsIdx:2, '] = ', Lines[DirsIdx]);
  102.  
  103.   Lines.Free;
  104. end;
  105.  
  106.  
  107. begin
  108.   HandleFile('data.txt');
  109. end.
  110.  
Today is tomorrow's yesterday.

Zvoni

  • Hero Member
  • *****
  • Posts: 2982
Re: Reading data files that are Chaotic.
« Reply #5 on: April 10, 2025, 08:24:00 am »
Alternative algorithm
1) Load the whole file into a StringList (or whatever else, Array....)
2) Go backwards through the StringList to fill your Record
3) When you reach "Location" you KNOW what the next 2 lines are.....

What Remy said for "Directions": Use Array of String or StringList. With my Approach keyword being "insert" at position 0, since you are stepping backwards through the directions. That way you'll get them in the correct order.
That said: I'd use a StringList (unsorted of course!)

To "collect" everthing: Probably a StringList, the entry being the "Destination" with the Record (Pointer to it) as the Object associated with an entry

No idea about Performance. Probably poor compared to the others, since "Inserts" are expensive

EDIT: Proof of concept attached
And i know it's leaking like a sieve...
« Last Edit: April 10, 2025, 10:00:57 am by Zvoni »
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

cdbc

  • Hero Member
  • *****
  • Posts: 2216
    • http://www.cdbc.dk
Re: Reading data files that are Chaotic.
« Reply #6 on: April 10, 2025, 08:30:46 am »
Hi
As @TRon said: "Use a state-machine", he even gave a nice example...
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE5 -> FPC 3.2.2 -> Lazarus 3.6 up until Jan 2024 from then on it's both above &: KDE5/QT5 -> FPC 3.3.1 -> Lazarus 4.99

OC DelGuy

  • Full Member
  • ***
  • Posts: 208
  • 123
Re: Reading data files that are Chaotic.
« Reply #7 on: April 10, 2025, 11:19:13 pm »
Hmmm... How big would this mess be, like file-size-wise?!?
614 KB

Why a single String for the Dirs and not an array of strings, or a TStringList?
I just wrote the Record as I was posting.  As of yet there is no code.  So I was thinking of the directions going in a TMemo.  So I just thought: "TMemo, just a real long string or a bunch of successive strings read as a real long string."

Where are you getting this file from, and why didn't the designer choose to provide a separator between different sections?
The file is a PDF.  The first line (Destination) is big type with bold print.  And the last line is double-spaced from the next record.  That was the separator.
Just figure a little secretary typing away on a little DOS box using WordPerfect back in 1989.   :o :D :D

1) Load the whole file into a StringList (or whatever else, Array....)
2) Go backwards through the StringList to fill your Record
3) When you reach "Location" you KNOW what the next 2 lines are.....
Holy Genius, Batman!
Free Pascal Lazarus Version #: 2.2.4
Date: 24 SEP 2022
FPC Version: 3.2.2
Revision: Lazarus_2_2_4
x86_64-win64-win32/win64

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1530
    • Lebeau Software
Re: Reading data files that are Chaotic.
« Reply #8 on: April 11, 2025, 08:02:12 am »
The file is a PDF.  The first line (Destination) is big type with bold print.  And the last line is double-spaced from the next record.  That was the separator.

There are libraries to parse PDF files. That might be a more reliable solution in the long run so you don't have to guess at the data.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

OC DelGuy

  • Full Member
  • ***
  • Posts: 208
  • 123
Re: Reading data files that are Chaotic.
« Reply #9 on: April 11, 2025, 04:08:48 pm »
There are libraries to parse PDF files.
Do you have any names and links?  Maybe I can use one.
Free Pascal Lazarus Version #: 2.2.4
Date: 24 SEP 2022
FPC Version: 3.2.2
Revision: Lazarus_2_2_4
x86_64-win64-win32/win64

cdbc

  • Hero Member
  • *****
  • Posts: 2216
    • http://www.cdbc.dk
Re: Reading data files that are Chaotic.
« Reply #10 on: April 11, 2025, 04:47:48 pm »
Hi
'muPDF' springs to mind, just a couple of days ago @Boleeman posted an app, that he made with this library...
If it can read and show a pdf, then it sure as h*ll can open and extract text too.

edit: Found it.... HERE Reply #7.
Regards Benny
« Last Edit: April 11, 2025, 04:53:55 pm by cdbc »
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE5 -> FPC 3.2.2 -> Lazarus 3.6 up until Jan 2024 from then on it's both above &: KDE5/QT5 -> FPC 3.3.1 -> Lazarus 4.99

Zvoni

  • Hero Member
  • *****
  • Posts: 2982
Re: Reading data files that are Chaotic.
« Reply #11 on: April 11, 2025, 07:09:18 pm »
Pdfium is another
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

OC DelGuy

  • Full Member
  • ***
  • Posts: 208
  • 123
Re: Reading data files that are Chaotic.
« Reply #12 on: April 21, 2025, 01:21:24 pm »
So I worked on this a while, and this is what I have.  It works fine up until the end.  At the end it stops and says: "List index out of bounds."
Everywhere I increment the "i" variable I've prefaced it with If i < sl.Count Then Inc(i);
I just can't see what's wrong.
I even checked the data, it's all in the right place.  I checked about 10 records in different parts of the file and they all come out right the way they should.  All the data is in the right fields.  Its just that at the end, it goes out of bounds.
Help!

Code: Pascal  [Select][+][-]
  1. unit ListMaker;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.   Classes, SysUtils, Forms, Controls, Graphics, Dialogs, StdCtrls;
  9.  
  10. type
  11.  
  12.   { TForm1 }
  13.  
  14.   TForm1 = class(TForm)
  15.     BtnLoad: TButton;
  16.     BtnDisplay: TButton;
  17.     BtnSave: TButton;
  18.     Edit1: TEdit;
  19.     Edit10: TEdit;
  20.     Edit2: TEdit;
  21.     Edit3: TEdit;
  22.     Edit4: TEdit;
  23.     Edit5: TEdit;
  24.     Edit6: TEdit;
  25.     Edit7: TEdit;
  26.     Edit8: TEdit;
  27.     Edit9: TEdit;
  28.     Label1: TLabel;
  29.     Label10: TLabel;
  30.     Label2: TLabel;
  31.     Label3: TLabel;
  32.     Label4: TLabel;
  33.     Label5: TLabel;
  34.     Label6: TLabel;
  35.     Label7: TLabel;
  36.     Label8: TLabel;
  37.     Label9: TLabel;
  38.     ListBox1: TListBox;
  39.     Memo1: TMemo;
  40.     procedure BtnLoadClick(Sender: TObject);
  41.   private
  42.  
  43.   public
  44.  
  45.   end;
  46.  
  47.   TDirs = packed record
  48.     Dest: String;
  49.     Trail: String;
  50.     Loc: String;
  51.     Time: String;
  52.     Mileage: String;
  53.     Trac: String;
  54.     MCNum: String;
  55.     Stop: String;
  56.     Diesel: String;
  57.     Scales: String;
  58.     Dirs: String;
  59.   end;
  60.  
  61. var
  62.   Dir: array of TDirs;
  63.   Form1: TForm1;
  64.  
  65. implementation
  66.  
  67. {$R *.lfm}
  68.  
  69. { TForm1 }
  70.  
  71. procedure TForm1.BtnLoadClick(Sender: TObject);
  72. var
  73.   sl : TStringList;
  74.   i, x, y : Integer;
  75.   Temp: Array[1..30] of String;
  76. begin
  77.   sl := TStringList.Create;
  78.   Memo1.Lines.Clear;
  79.   try
  80.     sl.LoadFromFile('Routes.txt');
  81.     SetLength(Dir, 0);
  82.     i := 0;
  83.     ListBox1.Items.Assign(sl);
  84.     while i < sl.Count do
  85.     begin
  86.       if Trim(sl[i]) = '' then
  87.       begin
  88.         If i < sl.Count Then Inc(i);
  89.         Continue;
  90.       end;  // If Trim
  91.       SetLength(Dir, Length(Dir) + 1);
  92.       with Dir[High(Dir)] do
  93.       begin
  94.         Dest := Trim(sl[i]);
  95.         If i < sl.Count Then Inc(i);
  96.         Trail := Trim(sl[i]);
  97.         If i < sl.Count Then Inc(i);
  98.         Loc := Copy(sl[i], Pos(':', sl[i]) + 2, MaxInt);
  99.         If i < sl.Count Then Inc(i);
  100.         Time := Copy(sl[i], Pos(':', sl[i]) + 2, MaxInt);
  101.         If i < sl.Count Then Inc(i);
  102.         Mileage := Copy(sl[i], Pos(':', sl[i]) + 2, MaxInt);
  103.         If i < sl.Count Then Inc(i);
  104.         Trac := Copy(sl[i], Pos(':', sl[i]) + 2, MaxInt);
  105.         If i < sl.Count Then Inc(i);
  106.         MCNum := Copy(sl[i], Pos(':', sl[i]) + 2, MaxInt);
  107.         If i < sl.Count Then Inc(i);
  108.       End;  // First With Statement
  109.         Memo1.Lines.Add('FieldName:' + Trim(Copy(sl[i], 0, Pos(':', sl[i]) - 1)) + '---');
  110.         If Trim(Copy(sl[i], 0, Pos(':', sl[i]) - 1)) = 'Stop' Then
  111.           Begin
  112.             Dir[High(Dir)].Stop := Copy(sl[i], Pos(':', sl[i]) + 2, MaxInt);
  113.             If i < sl.Count Then Inc(i);
  114.           End;
  115.       with Dir[High(Dir)] do
  116.       begin
  117.         Diesel := Copy(sl[i], Pos(':', sl[i]) + 2, MaxInt);
  118.         If i < sl.Count Then Inc(i);
  119.         Scales := Copy(sl[i], Pos(':', sl[i]) + 2, MaxInt);
  120.         If i < sl.Count Then Inc(i);
  121.         Dirs := Copy(sl[i], Pos(':', sl[i]) + 2, MaxInt);
  122.         If i < sl.Count Then Inc(i);
  123.       end;  // Second With Statement
  124.  
  125.         // Directions
  126.         x:=1;
  127.         Dir[High(Dir)].Dirs := '';
  128.         Memo1.Lines.Add(IntToStr(Pos('Location:', Trim(sl[i]))));
  129.         While (i < sl.Count) and (Pos('Location:', Trim(sl[i])) = 0) and (x<=Length(Temp)) Do
  130.           Begin
  131.             Memo1.Lines.Add('Herein lies i: ' + IntToStr(i) + '.      And the count is: ' + IntToStr(sl.Count));
  132.             Memo1.Lines.Add('----- ' + IntToStr(x) + ' -----' + Trim(sl[i]));
  133.             Temp[x]:= Trim(sl[i]);
  134.             Memo1.Lines.Add('Hereafter lies i: ' + IntToStr(i) + '.      And the count is: ' + IntToStr(sl.Count));
  135.             If i < sl.Count Then Inc(i);
  136.             Memo1.Lines.Add('After the Inc i: ' + IntToStr(i) + '.      And the count is: ' + IntToStr(sl.Count));
  137.             Inc(x);
  138.           End;
  139.         Memo1.Lines.Add('Found Location ' + Trim(sl[i]));
  140.         For y := 1 to x - 3 Do Dir[High(Dir)].Dirs := Dir[High(Dir)].Dirs + Temp[y];
  141.  
  142.       i := i - 2;
  143.  
  144.     end;  // While i < sl.Count
  145.   finally
  146.     sl.Free;
  147.   end;
  148. end;
  149.  
  150. end.
Free Pascal Lazarus Version #: 2.2.4
Date: 24 SEP 2022
FPC Version: 3.2.2
Revision: Lazarus_2_2_4
x86_64-win64-win32/win64

dseligo

  • Hero Member
  • *****
  • Posts: 1522
Re: Reading data files that are Chaotic.
« Reply #13 on: April 21, 2025, 01:43:24 pm »
I just can't see what's wrong.

I didn't analyze what your are trying to do in your code, but it isn't correct for sure.

I.e.:
Code: Pascal  [Select][+][-]
  1.         If i < sl.Count Then Inc(i);
  2.         Trail := Trim(sl[i]);

You are checking if variable 'i' is less then sl.Count.

Let's assume sl.Count is 100 and variable 'i' is 99. So you increase it to 100.

In next line you are accessing string list with sl[ i ]. Index here shouldn't be larger than 99.

And you have many of these increments in your code.

Maybe you should do it differently, something like this:
Code: Pascal  [Select][+][-]
  1.         Inc(i);
  2.         If i >= sl.Count Then Exit; // or Break or whatever you do when you are at the end
  3.         Trail := Trim(sl[i]);

Edit: I didn't use code tags with variable 'i'.
« Last Edit: April 21, 2025, 01:46:44 pm by dseligo »

paweld

  • Hero Member
  • *****
  • Posts: 1420
Re: Reading data files that are Chaotic.
« Reply #14 on: April 21, 2025, 01:48:23 pm »
change all occurrences:
Code: Pascal  [Select][+][-]
  1. If i < sl.Count Then Inc(i);
to:
Code: Pascal  [Select][+][-]
  1. If i < sl.Count - 1 Then Inc(i);
Best regards / Pozdrawiam
paweld

 

TinyPortal © 2005-2018