Recent

Author Topic: Again: Reading non-unicode text files  (Read 4711 times)

ArminLinder

  • Sr. Member
  • ****
  • Posts: 314
  • Keep it simple.
Again: Reading non-unicode text files
« on: January 08, 2017, 11:00:44 pm »
Hi all,

quick question:

- I have a text file coming from somewhere, containg Names, one per Line. When reading it using FPC, the German language "Umlauts" (äöüÄÖÜ) get replaced with "?" characters. Looking into it using a hex editor I see that the file was encoded 1 byte per character, codepage was obviously ISO/IEC 8859-1 ("ü" = #fc, ...), Line-Ends are CR-LF (#0d#0a) not quite an unusual file format, I guess.

How can I read this file into Free Pascal line by line, keeping the special characters intact???

I tried:
Code: Pascal  [Select][+][-]
  1. var F:Text;
  2.  
  3. ...
  4.  
  5. readln(F,Buffer);
  6.  
  7.  

with Buffer beeing of type ShortString,AnsiString and UTF8String, no success.

Code: Pascal  [Select][+][-]
  1.  
  2. var S:TStringList;
  3.  
  4. ...
  5.  
  6. S := TStringList.Create;
  7. Stringlist.LoadFromFile(Filename)
  8. ...
  9.  

No sucess either. Using any of the conversion routines like AnsiToUTF8 and such does, of course, also fail, since the damage is already done after reading. I searched the FPC docs up and down to find any way to set a codepage for reading - no success either.

This problem has been described several times, and some people trying to help produced quite lengthy rants about flimsy Codepage support in FPC, but no solution. The docs weren't helpful as well, seems issues reading such files aren't of anyone's interest.

Simple question: how can I read those files properly.

Thx

Armin.
« Last Edit: January 09, 2017, 12:09:58 am by Nimral »
Lazarus 3.3.2 on Windows 7,10,11, Debian 10.8 "Buster", macOS Catalina, macOS BigSur, VMWare Workstation 15, Raspberry Pi

Bart

  • Hero Member
  • *****
  • Posts: 5274
    • Bart en Mariska's Webstek
Re: Again: Reading non-unicode text files
« Reply #1 on: January 08, 2017, 11:38:32 pm »
See LConvEncoding uit, it has conversion routines to utf8 for many codepages.

Bart

ArminLinder

  • Sr. Member
  • ****
  • Posts: 314
  • Keep it simple.
Re: Again: Reading non-unicode text files
« Reply #2 on: January 09, 2017, 12:21:03 am »
Hi Bart,

Many thanks for the hint, it saved my day (uhm, night  :))

After some browsing of the LAZUTF8 unit I stumbled accross the WinCPToUTF8 function, which did the trick for my needs:

Code: Pascal  [Select][+][-]
  1. Var F:Text;
  2.     Buffer:ShortString;
  3.  
  4. begin
  5.   Result := false;
  6.   Try
  7.     Try
  8.       AssignFile(F,FileName);
  9.       Reset(F);
  10.       While Not eof(F) do
  11.         begin
  12.         readln(F,Buffer);
  13.         if trim(Buffer) <> '' Then List.Add(WinCPToUTF8(Buffer));
  14.         end;
  15.       Result := true;
  16.     except on e:Exception do
  17.       ShowMessage(e.ToString);
  18.     end;
  19.   finally
  20.     Close(F);
  21.   end;
  22.  
  23. end;

After a few quick tests, this seems to read the files, and keep the Umlauts.

Greetings from Bavaria ...

Armin.
« Last Edit: January 09, 2017, 12:24:03 am by Nimral »
Lazarus 3.3.2 on Windows 7,10,11, Debian 10.8 "Buster", macOS Catalina, macOS BigSur, VMWare Workstation 15, Raspberry Pi

ASerge

  • Hero Member
  • *****
  • Posts: 2222
Re: Again: Reading non-unicode text files
« Reply #3 on: January 10, 2017, 12:12:02 am »
Code: Pascal  [Select][+][-]
  1. Var F:Text;
  2.     Buffer:ShortString;
  3. begin
  4.   Result := false;
  5.   Try
  6.     Try
  7.       AssignFile(F,FileName);
  8.       Reset(F);
  9.       //...
  10.     except on e:Exception do
  11.       ShowMessage(e.ToString);
  12.     end;
  13.   finally
  14.     Close(F);
  15.   end;
  16. end;
Warning about using try finally.
Right design:
Code: Pascal  [Select][+][-]
  1. AllocateResource;
  2. try
  3. finally
  4.   ReleaseResource;
  5. end;
Bad design:
Code: Pascal  [Select][+][-]
  1. try
  2.   AllocateResource;
  3. finally
  4.   ReleaseResource;
  5. end;
In you case must be
Code: Pascal  [Select][+][-]
  1. //...
  2. Reset(F);
  3. try
  4.   //...
  5. finally
  6.   CloseFile(F);
  7. end;

And may be more safe use Buffer: AnsiString, even if now all the names are short?

ArminLinder

  • Sr. Member
  • ****
  • Posts: 314
  • Keep it simple.
Re: Again: Reading non-unicode text files
« Reply #4 on: January 11, 2017, 12:20:32 am »
Hi Serge,

thanks for the Close/CloseFile hint .. bad habit from the past, I guess :-) It's a while ago since I worked with Pascal.

Regarding your suggestion to reset the file before the try block ... what if Reset fails? In that case I get an unhandled exception. But I see your point, a closefile is only valid if the reset was successful. But where does this lead??? Three Try blocks nested?

Code: Pascal  [Select][+][-]
  1. Function LoadNamesList(FileName:String;var List:TStringList):boolean;
  2.  
  3. Var F:Text;
  4.     Buffer:String;
  5.  
  6. begin
  7.   Result := false;
  8.   AssignFile(F,FileName);  // cannot possibly fail, right?
  9.   Try
  10.     Reset(F);
  11.     Try
  12.       Try
  13.         While Not eof(F) do
  14.           begin
  15.             readln(F,Buffer);
  16.             if trim(Buffer) <> '' Then List.Add(WinCPToUTF8(Buffer));
  17.           end;
  18.         Result := true;
  19.       except on e:Exception do     // handle Read errors
  20.         ShowMessage(e.ToString);
  21.       end;
  22.     finally
  23.       CloseFile(F);
  24.     end
  25.   except on e:Exception do // handle reset errors
  26.     ShowMessage(e.ToString);
  27.   end;
  28. end;
  29.  
  30.  

Yuck. Is this really the only way to get a text read loop with exception handling?

Thanks for your help,

Armin.
Lazarus 3.3.2 on Windows 7,10,11, Debian 10.8 "Buster", macOS Catalina, macOS BigSur, VMWare Workstation 15, Raspberry Pi

Bart

  • Hero Member
  • *****
  • Posts: 5274
    • Bart en Mariska's Webstek
Re: Again: Reading non-unicode text files
« Reply #5 on: January 11, 2017, 12:29:38 am »
Doesn't WinCPToUtf8() on StringList.Text solve your issue?

Code: [Select]
  SL.LoadFromFile(AFilename);
  SL.Text := WinCPToUTF8(SL.Text);

Bart

J-G

  • Hero Member
  • *****
  • Posts: 953
Re: Again: Reading non-unicode text files
« Reply #6 on: January 11, 2017, 01:17:46 am »
Regarding your suggestion to reset the file before the try block ... what if Reset fails? In that case I get an unhandled exception.
You should never have an 'unhandled' exception (but you know that anyway)

IOResult is always available  -  use it just after the reset(F)  - 
to do that you do need to turn IOChecking off and on : viz.
Code: Pascal  [Select][+][-]
  1.    {$I-}  Reset(F);  {$I+}
  2.     if IOResult <> 0 then
  3.        begin
  4.  //          show message that the file couldn't be opened
  5.  //          and deal with the consequence.  
  6. //           ie. handle the exception!
  7.        end;
  8.     else
  9.        begin
  10. //           Code to handle the file now it's open.
  11.       end;
FPC 3.0.0 - Lazarus 1.6 &
FPC 3.2.2  - Lazarus 2.2.0 
Win 7 Ult 64

Lutz Mändle

  • Jr. Member
  • **
  • Posts: 65
Re: Again: Reading non-unicode text files
« Reply #7 on: January 11, 2017, 06:33:40 am »
Since FPC 3.0.0 there is another approach possible: strings with codepage.
Try the following code fragment:

Code: Pascal  [Select][+][-]
  1. type
  2.   cpstring = type string(28591);   //codepage 28591 is iso-8859-1
  3.  
  4. var
  5.   F:Text;
  6.   Buffer:cpstring;
  7.   s:string;
  8.  
  9. ...
  10.  
  11. readln(F,Buffer);
  12. s:=Buffer;
  13.  
  14.  

For more information about codepage numbers see this link:
https://github.com/ConradIrwin/encoding-codepage/blob/master/README.md

ArminLinder

  • Sr. Member
  • ****
  • Posts: 314
  • Keep it simple.
Re: Again: Reading non-unicode text files
« Reply #8 on: January 11, 2017, 09:01:05 am »
Thanks all, it's good to know there are still some Pascal heros around :-)

@Bart: I could use other means of reading the file, but in this special case each line may need some processing. The sample does already trim and check for empty lines, before adding them to the Strings list. There is more checking to come in the final version. That's why, in this case, I wanted to read line by line.

I guess I could, however, derive my own StringList and overwrite the add function to get a comparable result, and use the LoadFromFile method.

My main error was that I assumed that readln would already do a conversion, depending on the string type I declare for the target variable, but this doesn't seem to be the case.

Code: Pascal  [Select][+][-]
  1. var StringBuffer:ShortString;
  2.      ANSIStringBuffer:AnsiString;
  3.      F:Text;
  4. ...
  5.  
  6.   readln(F,StringBuffer);
  7.   readln(F,AnsiStringBuffer);
  8.  
  9. ...
  10.  
  11.  

If I did a bytewise Hex-Dump of StringBuffer and ANSIStringBuffer - I would always get all the bytes like they are stored in the file, correct?

Armin.
Lazarus 3.3.2 on Windows 7,10,11, Debian 10.8 "Buster", macOS Catalina, macOS BigSur, VMWare Workstation 15, Raspberry Pi

ArminLinder

  • Sr. Member
  • ****
  • Posts: 314
  • Keep it simple.
Re: Again: Reading non-unicode text files
« Reply #9 on: January 11, 2017, 09:02:56 am »
@Lutz: Thanks ... I was totally unaware of this functionality, I'll look into it, may be useful in the future.

@J-G: IOresult :-) Back to the 1980 Programming techniques :-) I thought about this in the beginning, but then decided to go with try blocks, otherwise everyone would realize that I am already over 50 years old :-)

Armin
« Last Edit: January 11, 2017, 09:05:12 am by Nimral »
Lazarus 3.3.2 on Windows 7,10,11, Debian 10.8 "Buster", macOS Catalina, macOS BigSur, VMWare Workstation 15, Raspberry Pi

J-G

  • Hero Member
  • *****
  • Posts: 953
Re: Again: Reading non-unicode text files
« Reply #10 on: January 11, 2017, 12:39:07 pm »
@J-G: IOresult :-) Back to the 1980 Programming techniques :-) I thought about this in the beginning, but then decided to go with try blocks, otherwise everyone would realize that I am already over 50 years old :-)
Armin, don't be afraid of letting others know that you have some experience  :)
I have 25 years more experience of life that you  :D   and learn something new every day but that doesn't mean that things I learned when I was your age aren't still useful.
FPC 3.0.0 - Lazarus 1.6 &
FPC 3.2.2  - Lazarus 2.2.0 
Win 7 Ult 64

ASerge

  • Hero Member
  • *****
  • Posts: 2222
Re: Again: Reading non-unicode text files
« Reply #11 on: January 11, 2017, 07:52:59 pm »
Yuck. Is this really the only way to get a text read loop with exception handling?
One of the possible. Often use common try except in form
Code: Pascal  [Select][+][-]
  1. try
  2. except
  3.   on E: EFOpenError do
  4.     //...
  5.   on E: EReadError do
  6.     //...
  7.   ...
  8.   on E: Exception do
  9.     //...
  10. end;
Sometimes it is better not to handle exceptions, leaving them to outer block or even to the user.
There are two different approaches: working with exceptions and the work on error codes. As I see, your function should return False/True, i.e. the 2nd option. In this case it is more convenient use it in the form
Code: Pascal  [Select][+][-]
  1. function LoadNamesList(const FileName: string; List: TStringList; out ErrorMessage: string): Boolean;
  2. ...
  3. try
  4.   ...
  5.   ErrorMessage := '';
  6.   Result := True;
  7. except
  8.   on E: Exception do
  9.   begin
  10.     ErrorMessage := E.Message;
  11.     // Do other undo and clear operation, for example List.Clear
  12.     Result := False;
  13.   end;
  14. end;
  15.  
And outer code is use it as
Code: Pascal  [Select][+][-]
  1. if not LoadNamesList(FileName, List, ErrorMessage) then
  2. begin
  3.   // Put error message to user or log
  4.   // Plan B or stop and repeat
  5. end;
  6.  
If you do not need ErrorMessage, then exclude it fully with only False/True.

 

TinyPortal © 2005-2018