Recent

Author Topic: DataProblems Maybe  (Read 6274 times)

JLWest

  • Hero Member
  • *****
  • Posts: 595
Re: DataProblems Maybe
« Reply #30 on: April 24, 2019, 05:11:16 pm »
Once the problems with loading the files are solved, that should be easy; just a matter of
Code: [Select]
  Data.WhateverField := ExtractWord(X, TheLine)and then generating the other (calculated?) fields.
I have question. If data utf-8 then record string short and extractword not work. so make ansi. Then away  utf-8 codec and not can write fancy letter greek, hyroglyph etc ?

"Once the problems with loading the files are solved,"

Yea, I need to convert the UTF-8 BOM file to ASCII text. Then this problem will be solved.

FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

JLWest

  • Hero Member
  • *****
  • Posts: 595
Re: DataProblems Maybe
« Reply #31 on: April 24, 2019, 05:29:07 pm »
I'm going to write a program to load the the 38,000 records into a memo1 box lucamar's  LoadListFromFile procedure and then into a listbox and finally write to an ASCII text file.

If I can get this to work I will move the LoadListFromFile procedure back in the processing chain and use it to convert the file when I extract the 38,000 records  from the 7.9 million records.

« Last Edit: April 24, 2019, 05:30:47 pm by JLWest »
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

lucamar

  • Hero Member
  • *****
  • Posts: 2081
Re: DataProblems Maybe
« Reply #32 on: April 24, 2019, 05:33:18 pm »
UTF8 BOM   <--- No Idea what that is.

BOM means Byte-Order Mark and by convention is the Unicode zero-width space. Head over to wikipedia to learn more: Byte order mark

Quote
The data is extracted from a file of 7.9 million records. And I guess the 7.9 million records are UTF8 BOM.

No. The mark is for the file and there should be only one, at the beginning.

Quote
"If you can't avoid having the files with the UTF-8 BOM, you can load them first in a TMemo and assign the Memo.Lines tio the listbox items."

Don't understand the  "and assign the Memo.Lines to the listbox items.

Listbox1.Items.Add(Line) :=   Memo.Lines ???

Rather like this:
Code: [Select]
Listbox1.Items.AddStrings(Memo.Lines, True)
Quote
If I load the 38,000 records into a Memo1 and then load them into a listbox and the save them to a text file.

Will that get rid of the UTF-8 BOM in the text file?

Yes, it should. Lazarus never saves an UTF-8 BOM (unless it's already there, of course).

Quote
Not against pre-processing the file into ASCII if there is a way.

Easiest way? Use file streams and copy from the fourth byte of the source. If you wait a litle I will write you function to do it.

I'm going to write a program to load the the 38,000 records into a memo1 box lucamar's  LoadListFromFile procedure and then into a listbox and finally write to an ASCII text file.

Before doing that, wait a little time and I'll write you a more convenient function involving much less memory. Allow me half an hour or so, OK?


I have question. If data utf-8 then record string short and extractword not work. so make ansi. Then away  utf-8 codec and not can write fancy letter greek, hyroglyph etc ?

No, the problem here is that the file has a BOM that is not being taken account of. FPC and Lazarus, in general, can deal perfectly with mixing strings of various types in differents encodings. If anything, the compiler will warn that some automatic conversion or other may result in lost data which means that yes: assigning hieroglyphics to a short string may not work well.

If the need arises one can always convert the UTF8 string to a string declared with an appropiate codepage, from which conversion to a short string is normally direct, char to char. Of course, if the short string has to be stored to a file (for example) one should take care of storing some sort of reminder of the codepage in which it's stored ... or convert it again to UTF-8 and store that.
« Last Edit: April 24, 2019, 05:35:35 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.2/2.0.4  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.

lucamar

  • Hero Member
  • *****
  • Posts: 2081
Re: DataProblems Maybe
« Reply #33 on: April 24, 2019, 06:08:23 pm »
Ok, done. A five-minutes, no-frills, etc. BOM cleaning proc.

Code: Pascal  [Select]
  1. procedure CorrectFile(
  2.   const ASrcName: String; const ADestName: String);
  3. var
  4.   SrcFile, DestFile: TFileStream;
  5.   BOM: String[3] = '   ';
  6. begin
  7.   SrcFile := TFileStream.Create(ASrcName, fmOpenRead or fmShareDenyWrite);
  8.   try
  9.     DestFile := TFileStream.Create(ADestName, fmCreate or fmShareExclusive);
  10.     try
  11.       SrcFile.Read(BOM[1], 3); { Read over the BOM }
  12.       DestFile.CopyFrom(SrcFile, SrcFile.Size - 3); {and copy the rest}
  13.     finally
  14.       DestFile.Free;
  15.     end;
  16.   finally
  17.     SrcFile.Free;
  18.   end;
  19. end;

One caveat: Make sure that you don't pass the same name as source and destination. Rather rename the original to, for example, "ByAirport.original" and call the procedure as:
Code: [Select]
CorrectFile('ByAirport.original', 'ByAirport.txt')
Also note that there is no test to see if the origin actually has a BOM: it just skips the first three bytes, whatever they are. You should add a test like the one in my post above
« Last Edit: April 24, 2019, 06:12:46 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.2/2.0.4  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.

lucamar

  • Hero Member
  • *****
  • Posts: 2081
Re: DataProblems Maybe
« Reply #34 on: April 24, 2019, 06:42:16 pm »
Oops! Didn't see this:

i write more wrong all ways skip  :D
Code: Pascal  [Select]
  1.   ...
  2.     FileStream:= TFileStream.Create(Filename, fmOpenRead);
  3.     FileStream.Position:= 3;
  4.     Lines.Clear;
  5.     Lines.LoadFromStream(FileStream);
  6.     FileStream.Free;
  7.   ...

The problem with the above code (besides not checking if there is actually a BOM) is that FileStream.Position should be set to four (4), not three (3). :)
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.2/2.0.4  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.

Thausand

  • Full Member
  • ***
  • Posts: 234
Re: DataProblems Maybe
« Reply #35 on: April 24, 2019, 06:59:24 pm »
I have question. If data utf-8 then record string short and extractword not work. so make ansi. Then away  utf-8 codec and not can write fancy letter greek, hyroglyph etc ?

No, the problem here is that the file has a BOM that is not being taken account of. FPC and Lazarus, in general, can deal perfectly with mixing strings of various types in differents encodings. If anything, the compiler will warn that some automatic conversion or other may result in lost data which means that yes: assigning hieroglyphics to a short string may not work well.

If the need arises one can always convert the UTF8 string to a string declared with an appropiate codepage, from which conversion to a short string is normally direct, char to char. Of course, if the short string has to be stored to a file (for example) one should take care of storing some sort of reminder of the codepage in which it's stored ... or convert it again to UTF-8 and store that.
Thank you for answer and explain lucamar. Ok, then i know  :)

Oops! Didn't see this:
Is good  I add many later to post  :)

Quote
The problem with the above code (besides not checking if there is actually a BOM) is that FileStream.Position should be set to four (4), not three (3). :)
??

file JLWest BOM have 3 bytes
byte 1, position 0
byte 2, position 1
byte 3, position 2

then ansi start position 3, then position is  index (start 0) ?


i have make many wrong BOM detect, please not use. is only many smell dirty example
Code: Pascal  [Select]
  1. procedure HaveBOM(aStream: TStream);
  2. type
  3.   TBummer = record Name:String;BOM:RawByteString;end;
  4. const
  5.   Bummer: array[0..14] of TBummer =
  6.   (
  7.     (Name: 'UTF-7'      ; BOM: #$2B#$2F#$76#$38#$2D),
  8.     (Name: 'UTF-32 (BE)'; BOM: #$00#$00#$FE#$FF),
  9.     (Name: 'UTF-32 (LE)'; BOM: #$FF#$FE#$00#$00),
  10.     (Name: 'UTF-7'      ; BOM: #$2B#$2F#$76#$38),
  11.     (Name: 'UTF-7'      ; BOM: #$2B#$2F#$76#$39),
  12.     (Name: 'UTF-7'      ; BOM: #$2B#$2F#$76#$2B),
  13.     (Name: 'UTF-7'      ; BOM: #$2B#$2F#$76#$2F),
  14.     (Name: 'UTF-EBCDIC' ; BOM: #$DD#$73#$66#$73),
  15.     (Name: 'GB-18030'   ; BOM: #$84#$31#$95#$33),
  16.     (Name: 'UTF-8'      ; BOM: #$EF#$BB#$BF),
  17.     (Name: 'UTF-1'      ; BOM: #$F7#$64#$4C),
  18.     (Name: 'SCSU'       ; BOM: #$0E#$FE#$FF),
  19.     (Name: 'BOCU-1'     ; BOM: #$FB#$EE#$28),
  20.     (Name: 'UTF-16 (BE)'; BOM: #$FE#$FF),
  21.     (Name: 'UTF-16 (LE)'; BOM: #$FF#$FE)
  22.   );
  23. var
  24.   Buffer : AnsiString;
  25.   Bum    : TBummer;
  26. begin
  27.   if AStream.Size > 4 then SetLength(Buffer, 5) else
  28.   begin
  29.     Print('BOM detection not good but try');
  30.     SetLength(Buffer, AStream.Size);
  31.   end;
  32.   AStream.Read(Buffer[1], Length(Buffer));
  33.  
  34.   for BUM in Bummer do
  35.   begin
  36.     if Buffer.StartsWith(BUM.BOM) then
  37.     begin
  38.       Print('file have bom ', BUM.Name);
  39.       exit;
  40.     end;
  41.   end;
  42.   Print('file have not BOM');
  43.  
  44.   // no BOM and reset stream
  45.   AStream.Position := AStream.Position - Length(Buffer);
  46. end;
  47.  
  48.  
  49. procedure procfiles(Filenames: array of string);
  50. var
  51.   FileStream : TFileStream;
  52.   Filename   : string;
  53.   Lines      : TStringList;
  54. begin
  55.   Lines:= TStringList.Create;
  56.   for Filename in Filenames do
  57.   begin
  58.     Print('--------------------------------');
  59.     Print('proc file %s', [filename]);
  60.     Print('--------------------------------');
  61.     FileStream:= TFileStream.Create(Filename, fmOpenRead);
  62.     HaveBOM(FileStream);
  63.     Lines.Clear;
  64.     Lines.LoadFromStream(FileStream);
  65.     FileStream.Free;
  66.     proclines(lines);  // this use example process file JLWest. no here
  67.   end;
  68.   Lines.Free;
  69. end;
  70.  
  71. begin
  72.   procfiles(['ByAirport.txt','Composite.txt']);
  73. end.
  74.  
« Last Edit: April 24, 2019, 07:05:42 pm by Thausand »

JLWest

  • Hero Member
  • *****
  • Posts: 595
Re: DataProblems Maybe
« Reply #36 on: April 24, 2019, 07:28:38 pm »
@lucmar

I will implement the

procedure CorrectFile( const ASrcName: String; const ADestName: String);

 Thanks
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

JLWest

  • Hero Member
  • *****
  • Posts: 595
Re: DataProblems Maybe
« Reply #37 on: April 24, 2019, 08:22:28 pm »
Worked I think.

Just tested but appears to be working.

Thanks
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

JLWest

  • Hero Member
  • *****
  • Posts: 595
Re: DataProblems Maybe
« Reply #38 on: April 24, 2019, 09:42:16 pm »
Yes the conversion from UTF8 BOM worked and now I make the call to
RCD := Decompose(RCD); and it passes me back a RCD with all of the fields filled in perfectly with correct data.
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

lucamar

  • Hero Member
  • *****
  • Posts: 2081
Re: DataProblems Maybe
« Reply #39 on: April 24, 2019, 10:56:48 pm »
Yes the conversion from UTF8 BOM worked and now I make the call to
RCD := Decompose(RCD); and it passes me back a RCD with all of the fields filled in perfectly with correct data.

Glad it all worked. :)


Quote from: lucamar
The problem with the above code (besides not checking if there is actually a BOM) is that FileStream.Position should be set to four (4), not three (3). :)
??

file JLWest BOM have 3 bytes
byte 1, position 0
byte 2, position 1
byte 3, position 2

then ansi start position 3, then position is  index (start 0) ?

Hmm ... you're right, Stream.Position is zero-based, so three is the correct value.

Don't know what I was thinking about  :-[
« Last Edit: April 24, 2019, 10:58:36 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.2/2.0.4  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.

JLWest

  • Hero Member
  • *****
  • Posts: 595
Re: DataProblems Maybe
« Reply #40 on: April 25, 2019, 03:44:24 am »
I would like to thank all for the help I received.

When I get this working (the program finished) I would be willing to place it on my GDrive along with the large data set so you can run it and see what I'm try to accomplish.

Just as an aside I wouldn't mind in the least if someone would look at the code with a critical eye and make suggestions. I don't mind the criticism, I know I'm not very good and at my age not likely to improve much. But I do read the code and try my best to understand it.

It's very easy to see from the code provided during these problems there are elegant ways of doing things and there are brute blunt force trauma ways (that's me).

I have learned that the Hero Members are experts. What takes me days they do in 10 min. The difficult they do in a minute, the very difficult they do with ease and the impossible takes 15 minutes.

Thank you all.



FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

lucamar

  • Hero Member
  • *****
  • Posts: 2081
Re: DataProblems Maybe
« Reply #41 on: April 25, 2019, 03:51:59 am »
I have learned that the Hero Members are experts. What takes me days they do in 10 min. The difficult they do in a minute, the very difficult they do with ease and the impossible takes 15 minutes.

Wel, I thank you. Even though that "hero member" thing just means I have posted way too much in the forum :D

For some strange reason your problems and doubts almost always resonate with me in such a way that I feel compelled to give my utmost to find a nice solution. And I've so much fun doing it ...
« Last Edit: April 25, 2019, 03:55:52 am by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.2/2.0.4  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.

Thausand

  • Full Member
  • ***
  • Posts: 234
Re: DataProblems Maybe
« Reply #42 on: April 25, 2019, 03:14:32 pm »
Hmm ... you're right, Stream.Position is zero-based, so three is the correct value.

Don't know what I was thinking about  :-[
I think i not understand that i write. Many time i not think about and write wrong. Then i can make shake hand  :D


Yes the conversion from UTF8 BOM worked and now I make the call to
RCD := Decompose(RCD); and it passes me back a RCD with all of the fields filled in perfectly with correct data.
that many good !! i make congratulation. Now can make more program  :)

Thausand

  • Full Member
  • ***
  • Posts: 234
Re: DataProblems Maybe
« Reply #43 on: April 25, 2019, 03:33:02 pm »
I would like to thank all for the help I received.
Make help is good me for learn :) Then i thank you for post problem  :)

Quote
I don't mind the criticism, I know I'm not very good and at my age not likely to improve much. But I do read the code and try my best to understand it.
I think is improve if learn that it no matter what you have result and our have result and not have same data. Have good data and know data is many important. I think that if result you and help not same then give data for test  :)

If understand you code write then is ok. If only code for you and not work or any one then it no many important.

JLWest

  • Hero Member
  • *****
  • Posts: 595
Re: DataProblems Maybe
« Reply #44 on: April 25, 2019, 10:10:51 pm »
@lucamar

I'm still have a few residual UT-8 BOM  issues. So i' wonder if I can fix this once for good.

Here is an armature diagram of the data set I require:
   
   Apt.Dat file 7.9 million records. ( I suspect UT-8 BOM File and the root of the problem)

       Airports.txt Extracted from the 7.9 million Apt.Dat file  36,000 + records
 
          APList.txt Extracted from the 7.9 million Apt.Dat 36,000 +

          Regional.txt   Extracted from the Airports.txt  9,000  + records

              AirportsRegions.txt extracted from the Regional.txt    9,000 + Same size as
                   Regional.txt record for record just different format

               RegionByCountry extracted from the AirportsRegions.txt 9,000 +

So my questions is:

Can I write a program using your
procedure CorrectFile( const ASrcName: String; const ADestName: String); to convert the 7.9 Million records in the Apt.Dat file to ASCII.

Because all the data I use comes from the 7.9 million file if I fix it I won't be haveing any more  data issues.

 

FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB