Recent

Author Topic: WideString to AnsiString - data loss  (Read 15542 times)

guest60499

  • Guest
Re: WideString to AnsiString - data loss
« Reply #15 on: June 06, 2017, 06:47:50 pm »
From an application programming standpoint the main justification is not needing to differentiate octet streams and UTF-8 data (because most octet streams will end up being treated as Unicode data at some point). If you don't re-encode before passing to the OS you will be doing it before you save to disk or send it over the network.

Nearly all such formats are dynamically encoded (due to BOM presence, an encoding field in the protocol or (HTTP) metadata annotations) anyway. There is precious few raw wire and file formats guaranteed UTF8 (except in Unix derived software).

I guess I will concede that point, but UTF-8 is by far the most popular choice. Even if the HTTP or other protocol engine supports UTF-16 or UTF-32 it is most likely it will be converted to UTF-8 for use in the application due to the handling benefits outlined in the link I gave.

Note the author mainly develops for Windows.

Moreover, that is about a totally separate issue, namely document encoding. Which is a totally different issue from API/ABI encoding.

Well - you might be handling the same data, as you move it around, which makes it related. If it's not related then you solve multiple issues at once.

Since it is Windows that is the odd one out, it seems to make the most sense to quarantine its use of UTF-16 as close to its API as you can.

That depends on your situation. It is more than a simple count, since a lot of SME development will have windows as majority target. And increasingly mobile targets are separate codebases, which have supplanted the minimal desktop apps for e.g. OS X. 

If you read the sample code it should be pretty clear that re-encoding is essentially the same in FreePascal as it is in C++ (you call a function). It's very likely you will have to change encoding at some point, and there is a best place to do it if you want to remain cross platform.

Well, first cross platform is not a given, and second it is a weighted count rather than a straight one.  It does not make sense to do an hatched job on incoming delphi code in an incompatible way for some naive sense of multiplatform when the likely target is again Delphi.

And doing it at the OS interface (which thousands of Windows calls to abstract) is IMHO very bad modularization. You typically modularize to minimize interactions (read: conversions).

Cross platform isn't a given which is why you need to design for it. If you handle Unicode data as I've mentioned you save yourself most of the headache of doing different things on different platforms. Within your program you are mostly able to maintain the concept of operating on strings of bytes for whatever operations you need to do.

Doing it at the OS interface seems pretty reasonable because of the handling benefits outlined in the article I gave. It isn't extremely visible when using Lazarus with the LCL as you aren't typically making WinAPI calls, but it's something to consider.

What made me suggest that this was tangential to what the OP wanted was that he already has UTF-16 data for some reason. However he may want it as UTF-8 to process it, which is a fairly reasonable course of action.

Ondrej Pokorny

  • Full Member
  • ***
  • Posts: 220
Re: WideString to AnsiString - data loss
« Reply #16 on: June 06, 2017, 06:56:39 pm »
The round() function is more something like utf8encode(). It is a deliberate, typed, conversion, and not done to silence a mere warning that something is wrong.

1.) You say the String() typecast in FPC 3 is an uncertain hack. This implies that using fpc_UnicodeStr_To_AnsiStr function is a hack as well because both calls are equivalent.
2.) You say that using Round() to assign an extended value to integer is a deliberate and correct way to "assign something with a greater range to something with potentially a smaller range". This implies that using fpc_UnicodeStr_To_AnsiStr is also deliberate and correct because it is also a function that "assigns something with a greater range to something with potentially a smaller range".

=> You contradict yourself.

BeniBela

  • Hero Member
  • *****
  • Posts: 948
    • homepage
Re: WideString to AnsiString - data loss
« Reply #17 on: June 06, 2017, 10:12:17 pm »
This is such an incredible mess  >:(

There should have been a compiler switch like $H to make string utf8string. Then there would never be a possible loss when assigning from UnicodeString.
 
There are already switches to make string shortstring, unicodestring, and ansistring. Just add utf8string to the list

guest60499

  • Guest
Re: WideString to AnsiString - data loss
« Reply #18 on: June 07, 2017, 12:11:18 am »
This is such an incredible mess  >:(

There should have been a compiler switch like $H to make string utf8string. Then there would never be a possible loss when assigning from UnicodeString.
 
There are already switches to make string shortstring, unicodestring, and ansistring. Just add utf8string to the list

So can someone help me understand why the OP can't do what he tried to do? Does AnsiString just refuse to store UTF-8 encoded data on principle? If that is the case then a flag to set "String" to "UTF8String" is probably necessary, or better yet the constraints on AnsiString should be relaxed and it should be equivalent to UTF8String.

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: WideString to AnsiString - data loss
« Reply #19 on: June 07, 2017, 12:31:01 am »
This is such an incredible mess  >:(

There should have been a compiler switch like $H to make string utf8string. Then there would never be a possible loss when assigning from UnicodeString.
 
There are already switches to make string shortstring, unicodestring, and ansistring. Just add utf8string to the list

So can someone help me understand why the OP can't do what he tried to do? Does AnsiString just refuse to store UTF-8 encoded data on principle?
Nope. He didn't like the warning that automatically converting from widestring/unicodestring to a string/ansistring (aka a runtime defined encoding) might put the code in a position to convert to an encoding that does not support some of the characters in the string. That's it. That is the problem.
If that is the case then a flag to set "String" to "UTF8String" is probably necessary, or better yet the constraints on AnsiString should be relaxed and it should be equivalent to UTF8String.
nope, all strings except widestring are utf8 in lazarus > 1.6. if the compiler could count on the string type being utf8 there wouldn't be a warning.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

J-G

  • Hero Member
  • *****
  • Posts: 966
Re: WideString to AnsiString - data loss
« Reply #20 on: June 08, 2017, 01:16:43 am »
Wow !!   -   Sorry, I've been busy re-setting music all day and haven't had chance to do my normal log-on every 20 minutes or so :)

I had no idea that my simple question would open such a can of worms !

Since the data I deal with is always in English - and therefore has no characters with accents - I've not needed to investigate Unicode UTF8/16/32 so was just intrigued to know why I got a warning that my code (involving a simple assignment of a string variable to what I thought was a string of ASCII characters) was potentially liable to data loss.

At least your discussion has given me a better understanding of string handling :)

FPC 3.0.0 - Lazarus 1.6 &
FPC 3.2.2  - Lazarus 2.2.0 
Win 7 Ult 64

 

TinyPortal © 2005-2018