Recent

Author Topic: How to use new RTL unicode string support with ANSI (CP1252) file input/output?  (Read 17355 times)

otoien

  • Jr. Member
  • **
  • Posts: 89
I am looking into rewriting a large application originally coded in TP5.5, and maintained up to this day (!). I am trying to wrap my head around what to do with string types and file formats for the new application with respect to the new UTF-8 compatible FPC. The issue here is that legacy recorded data files that are tab-delimited files with a 3 line header use  CP1252 ANSI character encoding. Line 2 of those headers are scientific unit strings essential for the calculations, for instance "µL/(h g)" and W/(kg °C). I intend to keep the format of the data files, but using UTF-8 encoding either as default or as an option would be desirable. In addition the application needs to read old binary parameter/template files (I have solved this), and my current idea is to now store this information in .ini files in the future for more flexibility in format. UTF-8 encoding would be desirable here, but I am not sure if that will be supported in .ini files? (Use of XML instead could be an alternative, but let us take that discussion elsewhere).
 
With the new FPC RTL unicode support, things seem to get simplified a lot, as there is no need to "pollute" the general business code with UTF-8 and ANSI string types when dealing with LCL and FPC code, one could just use a generic string type for all except file input/output. However it seems that the generic string type (currently ANSI string in FPC) will be redefined as a unicode compatible UTF-8 encoded string. In another thread (http://forum.lazarus.freepascal.org/index.php/topic,28941.msg182075.html#new ) I noted the following:
" UTF8RTL
only 1 type of encoding by default
. string = ansistring = utf8string : UTF8 encoding "

So my worry is, if ANSI stings are redefined, how do I read and write files with CP1252 ANSI stings? Also it seems that a number of the conversion routines between ANSI and UTF-8 will no longer do anything?
I hope for some advice as to the best strategy here. I have read up on unicode in general, and the new FPC (http://wiki.freepascal.org/FPC_Unicode_support) and followed a number of the discussions, but these tend to become so specific that they are hard to completely follow.

I use Win 7, and unfortunately do not have the trunk version to do any testing here (spent days trying to install trunk with FPC 3+ earlier this year with no luck); I hope the Lazurus/FPC standard release version with new FPC unicode support will come in time for when I really need it.
Unless otherwise noted I always use the latest stable version of Lasarus/FPC x86_64-win64-win32/win64

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Read the data into a String or RawByteString and then convert with WinCPToUTF8() or CP1252ToUTF8() as early as possible.
I don't see any big problem in it. The other thread gave an impression that it is very difficult when ChrisF and wp were experimenting.
This all must be documented of course.

FPC project has promised an RC1 of 3.0 version soon. Then you can test with Lazarus trunk which is easy to compile. A Lazarus release with the new Unicode features will take a while.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

ChrisF

  • Hero Member
  • *****
  • Posts: 542
[...]  The other thread gave an impression that it is very difficult when ChrisF and wp were experimenting.[...] 

I wouldn't say "difficult", but rather "complicated / not intuitive".

And remember, we were talking mainly of migrating existing code; for which I DO think it will be not as easy.
« Last Edit: July 05, 2015, 03:22:51 am by ChrisF »

otoien

  • Jr. Member
  • **
  • Posts: 89
Thanks both of you for the quick response.
Good to hear there are functions for this. I assume there will be reverse equivalents?
(The reason is that I might want to provide an option to write the data files in CP1252 format.  These TAB limited CP1252 format files work directly when imported into older versions of Excel, which some of us prefer, while those versions do not seem to understand UTF-8).

Some more testing indicates that presently both Tinifile and TMemIniFile works with UTF-8, at least for writing fields. Tags will stay within asii.

Edit: Which units are  WinCPToUTF8() or CP1252ToUTF8() located? I discovered that I already have a VM with 1.4 RC1/FP 3.0 installed with the setup program from the getpascal page, not sure if this version is developed far enough to test this.

Edit2: The last reply from the other thread could indicate that string constants will be CP1252 regardless and also affect the codepage of a string it is merged with? I thought string constants in the code was supposed to be UTF-8 encoded just like the rest of the code? I have assignments like:

var Unugperml : String;  //in the new version this will have codepage UTF-8 activated with {$DEFINE EnableUTF8RTL}
...
Unugperml:='µg/ml';   // will I need to convert from CP1252 to UTF-8 here?


Edit3: What about encoding of fixed length strings (used when reading legacy files), will that still be CP1252 regardless of the settings?
« Last Edit: July 05, 2015, 08:36:40 am by otoien »
Unless otherwise noted I always use the latest stable version of Lasarus/FPC x86_64-win64-win32/win64

ChrisF

  • Hero Member
  • *****
  • Posts: 542
I assume there will be reverse equivalents?
There are reverse functions for each kind of conversion functions.


Edit: Which units are  WinCPToUTF8() or CP1252ToUTF8() located?
WinCPToUTF8 -> uses ... LazUTF8, ...
CPxxxxToUTF8 -> uses ... LConvEncoding, ...



The last reply from the other thread could indicate that string constants will be CP1252 regardless and also affect the codepage of a string it is merged with?
No, not exactly. First of all:
- Ansi doesn't necessary mean 1252,
- and CP_ACP doesn't necessary mean Ansi (see SetMultiByteConversionCodePage).

Concerning your question, please read this part of the link you've provided in your first post of this topic:  http://wiki.freepascal.org/FPC_Unicode_support#String_constants

Anyway, IMHO you shouldn't care about the code page of string constants (unless you planned to not use UTF-8 in your source code; which sounds to me a very bad idea). They are not really important, mostly of the time. The code page of the destination string variable is much more important, and "magic automatic" conversion is provided by the compiler. This is definitively a short cut, but it's more or less the general idea.

And no, string constants will not "affect the codepage of a string it is merged with".



will I need to convert from CP1252 to UTF-8 here?
Short answer: No.



What about encoding of fixed length strings?
Please read: http://wiki.freepascal.org/FPC_Unicode_support#Shortstring
To make it simple: if EnableUTF8RTL is defined -> UTF-8 encoded by default.
« Last Edit: July 05, 2015, 02:06:13 pm by ChrisF »

taazz

  • Hero Member
  • *****
  • Posts: 5368
Please read: http://wiki.freepascal.org/FPC_Unicode_support#Shortstring
To make it simple: if EnableUTF8RTL is defined -> UTF-8 encoded by default.
short strings too? Oh hell this becomes worst with every post.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
ChrisF and taazz, please stop spreading FUD about the new Unicode support feature.
The relevant information can be found here:
  http://wiki.lazarus.freepascal.org/Better_Unicode_Support_in_Lazarus
The page about FPC_Unicode_support is not as useful now because -dEnableUTF8RTL overrides the default encoding. Yes, it can be considered a hack but so what?

The only remaining issue for "otoien" is how to read/write non-UTF-8 data. It can be solved easily with  WinCPToUTF8() or CP1252ToUTF8() etc..., as I wrote earlier.

If somebody does not want to use -dEnableUTF8RTL, it is perfectly OK but there is no need to confuse people who actually need it.
-dEnableUTF8RTL is clearly the right solution for "otoien", you should respect it.
In future please open a new thread for issues without -dEnableUTF8RTL.

"otoien", you must install FPC 3.x and test with Lazarus trunk.
I myself just installed the latest FPC trunk for Windows with fpcup from here:
  https://github.com/LongDirtyAnimAlf/Reiniero-fpcup
Then I installed FPC only with:
 >fpcup64.exe --fpcURL="trunk" --only="FPCcleanonly,FPCgetonly,FPCbuildonly"
I already have got Lazarus trunk with TortoiseSVN.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

ChrisF

  • Hero Member
  • *****
  • Posts: 542
ChrisF and taazz, please stop spreading FUD about the new Unicode support feature.

OK. Subject is now closed for me.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11382
  • FPC developer.
The only remaining issue for "otoien" is how to read/write non-UTF-8 data. It can be solved easily with  WinCPToUTF8() or CP1252ToUTF8() etc..., as I wrote earlier.

This situation is exactly what I hate about the utf8rtl hack. The "ansi" encoding somehow gets lost. Inserting conversions that are not runtime (like cp1252) are hacks and are hopeless if your files are aggregates of complex write() commands. Also it hardcodes an encoding (1252 or whatever, which was not hardcoded (but locale dependent) before, so it is not a direct substitute).

That and the fact it is compatible with both old AND new (unicode, 2009+) Delphi.

If you really want to stop FUD, stop downplaying the issues with utf8 hack.

taazz

  • Hero Member
  • *****
  • Posts: 5368
ChrisF and taazz, please stop spreading FUD about the new Unicode support feature.
The relevant information can be found here:
  http://wiki.lazarus.freepascal.org/Better_Unicode_Support_in_Lazarus
The page about FPC_Unicode_support is not as useful now because -dEnableUTF8RTL overrides the default encoding. Yes, it can be considered a hack but so what?

The only remaining issue for "otoien" is how to read/write non-UTF-8 data. It can be solved easily with  WinCPToUTF8() or CP1252ToUTF8() etc..., as I wrote earlier.
FUD where is the FUD? IS the shortstring utf-8 or it supports the OEM character set of the installed windows (1252 etc)? Is the ansistring type utf-8 and different from what ever the user has chosen in the default ansi encoding in his windows installation yes or no? simple questions if they are it is unacceptable behavior of a library. Thats all I'm saying where is the Fear or the unsertenty on my posts?

Any way it is clear that the road ahead is not one I'd like to walk. So I'm out for now. have fun everyone.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11382
  • FPC developer.
Any way it is clear that the road ahead is not one I'd like to walk. So I'm out for now. have fun everyone.

To be fair: note that Mattias also said it was a transition horse, specially for the 2.x+3.x dual maintenance period and the slow progression of 3.x

So while I don't like it, and it has serious downsides if you do a lot of work in ACS, it is hopefully NOT the road ahead long term.
« Last Edit: July 05, 2015, 06:40:28 pm by marcov »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
This situation is exactly what I hate about the utf8rtl hack. The "ansi" encoding somehow gets lost. Inserting conversions that are not runtime (like cp1252) are hacks and are hopeless if your files are aggregates of complex write() commands.

Yes, this is the downside. The solution is to encapsulate the read/write code into some "dirty" functions which can be ported to be compatible with Delphi and the future UTF-16 Lazarus solutions.

Quote
Also it hardcodes an encoding (1252 or whatever, which was not hardcoded (but locale dependent) before, so it is not a direct substitute).

I think WinCPToUTF8() and UTF8ToWinCP() functions solve this.

Quote
That and the fact it is compatible with both old AND new (unicode, 2009+) Delphi.

The reading/writing non-UTF-8 data from/to files/streams is not compatible. Once the data is converted (in some encapsulated functions) the solution in amazingly compatible with Unicode Delphis.


Quote from: taazz
IS the shortstring utf-8 or it supports the OEM character set of the installed windows (1252 etc)? Is the ansistring type utf-8 and different from what ever the user has chosen in the default ansi encoding in his windows installation yes or no? simple questions if they are it is unacceptable behavior of a library.

Yes, AnsiString is UTF-8 as is clearly explained in the wiki page. ShortString is UTF-8, too, which is logical IMO, although constant assignment to ShortString has some issues as explained here:
  http://wiki.lazarus.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals

Quote
Any way it is clear that the road ahead is not one I'd like to walk. So I'm out for now. have fun everyone.

Please think what are the alternatives. FPC 3.x + Lazarus without -dEnableUTF8RTL leads to many problems as noticed by ChrisF, wp and many others.
The Delphi compatible UTF-16 solution is still years away. This solution now is an evolutionary continuation of the UTF-8 solution that LCL already has. Without this solution Lazarus would be doomed to either use FPC 2.6.4 the explicit conversion function hack for a long time or then repel still more people by a seriously broken Unicode support (ask ChrisF and wp for details).
We truly need a working Unicode solution. Competing languages / IDEs have supported it for ages.

The FUD was mostly about the place you wrote your opinions. This thread is otoien's honest question about Unicode support.
You can start new threads about solving problems in the alternative way of using FPC 3.x without -dEnableUTF8RTL. It is perfectly OK. You can co-operate with ChrisF and wp who seem to have the same goal.
« Last Edit: July 07, 2015, 09:29:32 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11382
  • FPC developer.
Yes, this is the downside. The solution is to encapsulate the read/write code into some "dirty" functions which can be ported to be compatible with Delphi and the future UTF-16 Lazarus solutions.

No, I won't mess up my Delphi code with temporary Lazarus hacks. (even if it were an option, which it isn't)

Messing around in complex long term stable code to insert manual conversions is hopeless. Within two seconds running with the customer he finds yet another rarely used codepath that needs modifications.

For people with big exposure (read: need for ansi encodings) or with quite large codebases with large parts are maintenance only, this is simply not an option.

For me it is not really a problem, since the worse codebases that suffer from this don't have to be ported to Lazarus anymore (maintenance only old framework), but if I would look at them and think about converting I wouldn't sleep tonight.

BeniBela

  • Hero Member
  • *****
  • Posts: 905
    • homepage
I think WinCPToUTF8() and UTF8ToWinCP() functions solve this.

But the entire point of codepage aware strings is FPC 3 is to remove the need for all conversion functions.

And that works really well if ansistring remains CP_ACP, without the silly dEnableUTF8RTL hack.

Why is there no option to change the default codepage of string, without changing the codepage of ansistring? It would solve all the issues, without causing further trouble


Any way it is clear that the road ahead is not one I'd like to walk. So I'm out for now. have fun everyone.
OK. Subject is now closed for me.

But if everyone who needs ansistrings leaves, we will never get proper 1-byte string types.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11382
  • FPC developer.
Why is there no option to change the default codepage of string, without changing the codepage of ansistring? It would solve all the issues, without causing further trouble

Because the 3.x language facilities were designed for Delphi compatibility and not for utf8 usage on Windows. One of the troubles of the Delphi design is that is that literals and other conversions convert over ansistring(0), the ACS encoding in Delphi.

This is backwards compatible but makes redesigning the LCL on top of utf8string difficult. The utf8 hack fixes that, but sacrifices ACS abilities in turn.



in Delphi 1<->2 byte string conversions go over ansistring(0),


Any way it is clear that the road ahead is not one I'd like to walk. So I'm out for now. have fun everyone.
OK. Subject is now closed for me.

But if everyone who needs ansistrings leaves, we will never get proper 1-byte string types.
[/quote]

 

TinyPortal © 2005-2018