Recent

Author Topic: How to use new RTL unicode string support with ANSI (CP1252) file input/output?  (Read 17356 times)

BeniBela

  • Hero Member
  • *****
  • Posts: 905
    • homepage

in Delphi 1<->2 byte string conversions go over ansistring(0),


All conversions of any CP? But that makes no sense whatsoever, it is going to lose characters, unless ansistring(0) is utf8string

Still, you do not need to have ansistring = ansitring(0) = string. You can have ansistring = ansistring(123) and ansistring(0) = string.

Then have CP_SystemACP = 123 as replacement of CP_ACP

And CP_ACP = UTF8 everywhere, and you can even make conversions through ansistring(0) =  ansistring(CP_ACP) without screwing up the string

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
But if everyone who needs ansistrings leaves, we will never get proper 1-byte string types.

You can freely work on alternative solutions, nobody tries to prevent you. Please let us know when it works.

There is one important fact almost forgotten: LCL applications (including Lazarus itself) continue to work as before with the explicit conversion functions. Then you must NOT add any extra compiler flags.
Why exactly does it work? FPC thinks the strings are system codepage but LCL stores UTF-8 there, and then explicitly converts to other encodings when needed. It is another cheat.
Continuing with the old way without changes is a valid solution for existing applications.

However getting rid of (most) explicit conversions is a big benefit. What more, it leads to better results by avoiding lossy conversions between Ansi codepages and Unicode.
So yes, the "Better Unicode Support in Lazarus" is indeed better. It is not Delphi compatible but is much closer to that.

It is also good to remember that many people had trouble converting their Delphi applications for Delphi 2009. I also did the conversion but didn't have many problems. Code with lots of streamed data or misusing String for binary data had lots of problems.
Changing default encoding requires changes but let's keep things in perspective. Things are improving all the time.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
All conversions of any CP? But that makes no sense whatsoever, it is going to lose characters, unless ansistring(0) is utf8string
Still, you do not need to have ansistring = ansitring(0) = string. You can have ansistring = ansistring(123) and ansistring(0) = string.
Then have CP_SystemACP = 123 as replacement of CP_ACP
And CP_ACP = UTF8 everywhere, and you can even make conversions through ansistring(0) =  ansistring(CP_ACP) without screwing up the string

I think your ideas are valid but they would require changes in FPC architechture. It may be difficult because FPC project decided to go for the Delphi compatible solution after years and years of discussions in mailing lists. I guess everybody involved is exhausted now. I also understand the importance of Delphi compatibility.

Now it is realistic to use the existing FPC features. It is not that bad.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11382
  • FPC developer.
in Delphi 1<->2 byte string conversions go over ansistring(0),

All conversions of any CP? But that makes no sense whatsoever, it is going to lose characters, unless ansistring(0) is utf8string

It is backwards compatible, and logical, since the default conversions only work between ACS and UTF16 in Windows.   UTF8 came only later as an afterthought and has no real presence on Windows on the API level.


otoien

  • Jr. Member
  • **
  • Posts: 89
Wow, lots of replies here. This kind of topic seems to have a tendency to become rather heated.  ::)

The only remaining issue for "otoien" is how to read/write non-UTF-8 data. It can be solved easily with  WinCPToUTF8() or CP1252ToUTF8() etc..., as I wrote earlier.

This situation is exactly what I hate about the utf8rtl hack. The "ansi" encoding somehow gets lost. Inserting conversions that are not runtime (like cp1252) are hacks and are hopeless if your files are aggregates of complex write() commands. Also it hardcodes an encoding (1252 or whatever, which was not hardcoded (but locale dependent) before, so it is not a direct substitute). ..

Not sure I understnad what "hopeless if your files are aggregates of complex write() commands" implies. I specifically want to convert from CP1252 because this is the encoding used by the TP55 application by my own choice with respect to units [For display only it is translating to DOS CP 437/850].  I do not want a colleague with for instance a Hebrew codepage to use that codepage to read my own recorded data. One need to take height for the situation where data are transported internationally. Also if I supply an option in the new application to write data files in Windows ANSI (non-UTF-8) I think that also will be done in CP1252. Simply one does not want interpretation of the data (in this case the scientific units) to change depending on where they are read. (For the colleague with the non-CP1252 computer, the solution would be to choose the option to write UTF-8 encoded data files in the new application, as the windows CP1352 ANSI would not display correctly on his computer, only be correctly interpreted).

[As an example how bad things can go: The first versions of a commercial data acquisition program we used to monitor biorhythms in animals was writing files time stamps according UTC instead of local time. Then on read they would adjust time according to local time offset on the computer they were read. Seems like a perfectly sound engineered solution? Well, if data recorded here in the US on a day active animal was read on a European computer, it would be interpreted as if the animal was night-active! Not what one want for that kind of work.]


>Please read: http://wiki.freepascal.org/FPC_Unicode_support#Shortstring
>To make it simple: if EnableUTF8RTL is defined -> UTF-8 encoded by default.

Well I assume that even if my shortstring in the packed record [used to read the binary file] will be considered UTF-8 encoded, if I convert it using CP1252ToUTF8, it will be considered just a stream of bytes that should be converted correctly.

I must say that the situation in Lazarus 1.4.0/FPC 2.6.4  and earlier with different encoding of strings in RTL and LCL has been a somewhat crazy situation (but understandable in a historic perspective), where one need to keep aware where things are called from and keep doing these conversions when treating data internally.

It really seems that the EnableUTF8RTL option is going to make things a lot simpler and cleaner for new applications, with mostly need to thinks about possible conversion for different file input and perhaps certain data output. That is a lot more easy to isolate.

One worry I have is the writings about future development with UTC-16 (I am personally perfectly happy with UTC-8). I really hope this will not be the only option for windows applications, but that UTC-8 can be selected with switches for the default string encoding as in the coming version. I absolutely would not want to write Tab-delimited data text files in UTC-16 format. Besides, while my application could change internally, long term consistency in file formats is imperative in my kind of slow development. Having a split between Unix/Mac one one side and Windows on the other with respect to encoding would really also break the cross-platform compatibility. If there is a possibility that we are going to loose UTC-8 as an optional string default on Windows, then I think this should be clearly expressed in the wiki's etc. so that new code with respect to file input/output can be prepared with this in mind.

I will have to wait with further consideration/testing until I got a trunc version with FPC 3.x installed.
Unless otherwise noted I always use the latest stable version of Lasarus/FPC x86_64-win64-win32/win64

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11382
  • FPC developer.

Not sure I understnad what "hopeless if your files are aggregates of complex write() commands" implies. I specifically want to convert from CP1252 because this is the encoding used by the TP55 application by my own choice with respect to units [For display only it is translating to DOS CP 437/850]

(I mean if your File I/O isn't based on writing strings to file, but on writeln statements like  writeln(f,value:10:3,'text':4,'value2:2,'text2',','); etc, and then LOTS of them. Rearranging them to writeln(f,conversionfunction(s)); form requires a lot of work, and worse, revalidation.

Quote
.  I do not want a colleague with for instance a Hebrew codepage to use that codepage to read my own recorded data.

It sounds like that the problem is that you in effect abused textual output (which formally has an encoding and locale assigned) for binary output (namely a different encoding).

That is no longer possible with FPC 3.0, and would require you to change the encoding of the textual output to match the one you are using.  Try

Code: [Select]
   var f :text
   assignfile(f,'filename.txt');
   rewrite(f);
   settextcodepage(1252);
  do your writes.
  closefile(f);

This might actually work very well with the UTF8 hack, since working with UTF8 won't mutilate encodings, and all chars in cp1252 are supported in  utf8.
 

otoien

  • Jr. Member
  • **
  • Posts: 89
This legacy program I wrote the binary parameter files with is a DOS program so I do not know if the word abuse can be used for strings contained in binary files. I just took specific control of the content. :D  And who said writing a mixed record of fixed length strings, floats and other numbers to a binary file was abuse in the DOS days.  ::)

Thanks for the explanation of the writeln conversions, I understand now. I would be inclined to do the conversion before that, and then write the converted string; place needed conversion code in a common separate procedure that can be modified as needed if there is Lazarus/fpc changes that could be introduced in the future. For new code this approach should perhaps not be a problem.

BTW, There will be no binary files in the new program to store parameters, I am not going into that nightmare again.
Unless otherwise noted I always use the latest stable version of Lasarus/FPC x86_64-win64-win32/win64

wp

  • Hero Member
  • *****
  • Posts: 11853
Quote
legacy recorded data files that are tab-delimited files with a 3 line header use  CP1252 ANSI character encoding. Line 2 of those headers are scientific unit strings essential for the calculations, for instance "µL/(h g)" and W/(kg °C).
Why don't you post one of these data files here (if the forum software does not accept the file type just pack the file into a zip)? I am sure users will soon present code how to read them in nowaday's FPC and Lazarus.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11382
  • FPC developer.
Note that a

Code: [Select]
var s : ansistring(1252);

is a codepage 1252 string. UTF8 code assigned to it will be autoconverted. Maybe changing some local strings in your streaming routines will force this conversion.

otoien

  • Jr. Member
  • **
  • Posts: 89
Thanks, so there is still a possibility to define a classic US windows ANSI (CP1252) string with EnableUTF8RTL active, good to know.

>"Why don't you post one of these data files here (if the forum software does not accept the file type just pack the file into a zip)? I am sure users will soon present code how to read them in nowaday's FPC and Lazarus."
I do not expect forum members to write my code, I just want to understand the principles of the development tool; the needed code would anyway be too specific and reading the data files will depend on the type of memory data storage (likely dynamic arrays) and data structures I decide to use. However for anyone curious about what I do, I have attached a legacy example file. The format supports missing data, and also there is room for additional byte codes that for instance can give information about solenoid states. Once past the header, everything is ascii, so conversions should not be needed. Current DOS code will (a bit simplified) read in one line at a time and then work on the resulting buffered string. It should be easy converting this incoming string to UTF-8. I have not yet worked on rewriting this code, I have concentrated on reading the binary parameter record which in the legacy code among other things also contains the strings of the header in an array of fixed length strings. For those very curious  :D, I have published a couple of papers describing some aspects at different stages of the TP55 application's development:
Tøien, Ø. Data acquisition in thermal physiology: measurements of shivering. J. Therm. Biol. 1992; 17(6):357-366.
Tøien, Ø. Automated open flow respirometry in continuous and long-term measurements: design and principles. J. Appl. Physiol. 2013; 114:1094-1107.


One new question though: How can I determine if a line read from a file is ANSI (CP1252) or UTF-8 so that I can apply the conversion where appropriate?
« Last Edit: July 07, 2015, 12:57:42 pm by otoien »
Unless otherwise noted I always use the latest stable version of Lasarus/FPC x86_64-win64-win32/win64

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11382
  • FPC developer.
One new question though: How can I determine if a line read from a file is ANSI (CP1252) or UTF-8 so that I can apply the conversion where appropriate?

This is not possible to know. That information is not as something readable in the file. The best you can do is to some form of statistic analysis (based on occurrences of some sequences, assume a certain language) and do an educated guess. Afaik lazutils has such routine somewhere.

wp

  • Hero Member
  • *****
  • Posts: 11853
Ah, a file, finally!

Here's code that reads your file with Laz1.4/Fpc2.6.4 and Laz1.5/fpc3.11 (both with and without Enable UTF8RTL) into a StringGrid - just a form with a string grid and a button and the following button-click handler:
Code: [Select]
uses
  LConvEncoding;

{ TForm1 }

procedure TForm1.Button1Click(Sender: TObject);
const
  FILENAME = 'A030701a.dat';
var
  rowlist, celllist: TStringList;
  encoding: String;
  i,j: Integer;
begin
  rowlist := TStringList.Create;
  celllist := TStringList.Create;
  try
    rowlist.LoadfromFile(FILENAME);
    encoding := GuessEncoding(rowlist.Text);  // check whether file is utf8 or ansi
    rowlist.Text := ConvertEncoding(rowlist.Text, encoding, EncodingUTF8);
    celllist.Delimiter := ' ';
    cellList.DelimitedText := rowList[2];     // the longest line is row #2
    StringGrid1.RowCount := rowlist.Count;
    StringGrid1.ColCount := cellList.Count;
    StringGrid1.FixedCols := 0;
    StringGrid1.FixedRows := 3;
    for i:=0 to rowlist.Count-1 do begin
      cellList.Delimitedtext := rowList[i];
      for j:=0 to cellList.count-1 do
        StringGrid1.Cells[j, i] := cellList[j];
    end;
  finally
    cellList.Free;
    rowList.Free;
  end;
end;
« Last Edit: July 07, 2015, 02:57:13 pm by wp »

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Alternatively, SDK functions could be intercepted (hooked) and the conversion of UTF8 to Ansi be happening (or even redirected to W function call).
However, this might break compatibility in some other places.

otoien

  • Jr. Member
  • **
  • Posts: 89
Thanks wp,
Looks very useful; I see there might be simpler ways to read this than in my old DOS code, and also supply UTF-8 detection. Nice to see so many friendly forum members here willing to help.

Alternatively, SDK functions could be intercepted (hooked) and the conversion of UTF8 to Ansi be happening (or even redirected to W function call).
However, this might break compatibility in some other places.

Can't claim that I understand what you mean here.  (SDK functions? W functin calls?  Sounds like WinAPI calls? ) Keeping crossplatform compatibility is essential, and my priority is for maintainable safe cross-platform compatible code. Performance is no issue in this case as only the 3-5 line header can contain non-ascii characters. Looks like I anyway got my answers above; now the testing will have to begin. Again thanks for all the responses.
Unless otherwise noted I always use the latest stable version of Lasarus/FPC x86_64-win64-win32/win64

 

TinyPortal © 2005-2018