Recent

Author Topic: WideString to AnsiString - data loss  (Read 15335 times)

J-G

  • Hero Member
  • *****
  • Posts: 966
WideString to AnsiString - data loss
« on: June 06, 2017, 01:55:02 pm »
Following my foray with reading a directory from a web-page, I would like to resolve a warning that appears at the end of compilation.

I get the warning that 'Implicit string type conversion with potential data loss from "WideString" to "AnsiString"'. The code that gives rise to this is :

Code: Pascal  [Select][+][-]
  1.         F := tdomelement(els[x]).getattribute('href');

'F' is declared as 'String'   but I have also changed that to 'AnsiString' for testing and get the same warning.

The data that is returned in F is good and can be used in further data manipulation but I would like to know how I can correct what appears to be an 'issue'.
 
FPC 3.0.0 - Lazarus 1.6 &
FPC 3.2.2  - Lazarus 2.2.0 
Win 7 Ult 64

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12523
  • FPC developer.
Re: WideString to AnsiString - data loss
« Reply #1 on: June 06, 2017, 02:05:16 pm »
You can't. The problem is that the knowledge to suppress cases where it doesn't go wrong is a runtime issue (the default code page can be changed at runtime/startup).

The only correct way is to keep everything utf16 (widestring, unicodestring), and avoid possibly leaky conversions. (reducing the number of conversions is a good practice anyway)

wp

  • Hero Member
  • *****
  • Posts: 13195
Re: WideString to AnsiString - data loss
« Reply #2 on: June 06, 2017, 02:11:46 pm »
You can't.

Cast to string?
Code: Text  [Select][+][-]
  1. F := string(tdomelement(els[x]).getattribute('href'));

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12523
  • FPC developer.
Re: WideString to AnsiString - data loss
« Reply #3 on: June 06, 2017, 02:21:59 pm »
You can't.

Cast to string?
Code: Text  [Select][+][-]
  1. F := string(tdomelement(els[x]).getattribute('href'));

That also suppresses the warning if a change in the future changes the situation. I don't like making code modification (specially if they are not equivalent) to silence warnings. IMHO the solution is worse then the problem.

If you like to minimize warnings, first do it on a more structural level (e.g. make code that belongs together use the same stringtype, iow modularize),  and silence the remaining ones with a system like {%h-} that can be undone mechanically and quickly. (either by compiling on the commandline, or some SEDing replacing all {%h-} with nothing in a copy of the code)


In the past I've used a system with $user in the line before the warning. This made it possible to filter the compiler output in the CI system. Like $warn directives, but they can be turned off easier, by simply not filtering.
« Last Edit: June 06, 2017, 02:23:34 pm by marcov »

Ondrej Pokorny

  • Full Member
  • ***
  • Posts: 220
Re: WideString to AnsiString - data loss
« Reply #4 on: June 06, 2017, 02:48:29 pm »
Casting to string is absolutely OK since FPC 3.0.0. The UnicodeString is converted with current codepage (DefaultSystemCodePage). If you set DefaultSystemCodePage to UTF-8, there won't be any data loss. Otherwise it will be converted to your 8-bit encoding - which is wanted because you cast it to 8-bit string.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12523
  • FPC developer.
Re: WideString to AnsiString - data loss
« Reply #5 on: June 06, 2017, 03:03:59 pm »
Casting to string is absolutely OK since FPC 3.0.0.

It can hide the warning about dataloss, and depending on configuration, dataloss can still happen on 3.0, and while compiler the compiler has no information about DefaultSystemCodePage.


J-G

  • Hero Member
  • *****
  • Posts: 966
Re: WideString to AnsiString - data loss
« Reply #6 on: June 06, 2017, 03:13:42 pm »
Thanks all.

Yes the casting works  ie. the project compiles without warnings and the resulting program does what I expect.

I don't understand the point marcov makes about 'a change in the future changes the situation' though.

Since the data destined to be held in 'F' is never going to be greater than about 30 characters I've made F a string[30] without problem. Not that it changes the size of the program but it may have a bearing on the amount of memory needed at run-time which may be pertinent.

The character set of the data is never going to be outside the 'normal' ASCII range and always in English, (it is under my control) so marcov's reservations - whilst justified - are (I think) in this case not relevant.

The assurance from Ondrej Pokorny is welcome though (yes, I'm using FPC 3.0).
FPC 3.0.0 - Lazarus 1.6 &
FPC 3.2.2  - Lazarus 2.2.0 
Win 7 Ult 64

Ondrej Pokorny

  • Full Member
  • ***
  • Posts: 220
Re: WideString to AnsiString - data loss
« Reply #7 on: June 06, 2017, 03:17:23 pm »
Casting to string is absolutely OK since FPC 3.0.0.

It can hide the warning about dataloss, and depending on configuration, dataloss can still happen on 3.0, and while compiler the compiler has no information about DefaultSystemCodePage.

...yes and it is completely OK and valid. How do you want to assign a UnicodeString to AnsiString without dataloss? It's completely clear and fine to do that - the same when you do SingleValue := ExtendedValue (or IntValue := Round(ExtendedValue)).

If you want to assign a floating point value to an integer, you count with data loss as well.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12523
  • FPC developer.
Re: WideString to AnsiString - data loss
« Reply #8 on: June 06, 2017, 03:22:26 pm »
...yes and it is completely OK and valid. How do you want to assign a UnicodeString to AnsiString without dataloss?

That is not the question. The question is if you want to get a warning if you do. Some possible causes to mine if you have a bug in your code. Silencing the warning with a typecast therefore is dumb, since when you use the same code in e.g. a service, and don't set utf8, BANG.

Quote
It's completely clear and fine to do that - the same when you do SingleValue := ExtendedValue (or IntValue := Round(ExtendedValue)).

The first is IMHO worth a warning for exactly the same reasons, and the second is a deliberate programming construct. Not a uncertain hack to clean out warnings.

Quote
If you want to assign a floating point value to an integer, you count with data loss as well.

The compiler won't let you. With good reason. For larger to smaller integer there are range checks.

guest60499

  • Guest
Re: WideString to AnsiString - data loss
« Reply #9 on: June 06, 2017, 03:59:07 pm »
You can't. The problem is that the knowledge to suppress cases where it doesn't go wrong is a runtime issue (the default code page can be changed at runtime/startup).

The only correct way is to keep everything utf16 (widestring, unicodestring), and avoid possibly leaky conversions. (reducing the number of conversions is a good practice anyway)

Maybe tangential to the question, but I think rather important. You might want to look at http://utf8everywhere.org/ for a list of reasons to always use UTF-8, even on Windows - convert before passing to the OS API.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12523
  • FPC developer.
Re: WideString to AnsiString - data loss
« Reply #10 on: June 06, 2017, 04:39:08 pm »
Maybe tangential to the question, but I think rather important. You might want to look at http://utf8everywhere.org/ for a list of reasons to always use UTF-8, even on Windows - convert before passing to the OS API.

The named reasons for that lie in bad Windows support of C standards. That is not relevant for FPC/Delphi. In general, IMHO the article is not very convincing. Looks like they first decided it should be utf8 (probably *nix origin), and then started piling up vague and loosely connected arguments

If you are an independent contractor, you can't simply have the case where you keep your majority OS at arms-length.
« Last Edit: June 06, 2017, 04:42:05 pm by marcov »

guest60499

  • Guest
Re: WideString to AnsiString - data loss
« Reply #11 on: June 06, 2017, 05:25:41 pm »
Maybe tangential to the question, but I think rather important. You might want to look at http://utf8everywhere.org/ for a list of reasons to always use UTF-8, even on Windows - convert before passing to the OS API.

The named reasons for that lie in bad Windows support of C standards. That is not relevant for FPC/Delphi. In general, IMHO the article is not very convincing. Looks like they first decided it should be utf8 (probably *nix origin), and then started piling up vague and loosely connected arguments

If you are an independent contractor, you can't simply have the case where you keep your majority OS at arms-length.

From an application programming standpoint the main justification is not needing to differentiate octet streams and Unicode data (because most octet streams will end up being treated as Unicode data at some point). If you don't re-encode before passing to the OS you will be doing it before you save to disk or send it over the network. Since it is Windows that is the odd one out, it seems to make the most sense to quarantine its use of UTF-16 as close to its API as you can.

If you read the sample code it should be pretty clear that re-encoding is essentially the same in FreePascal as it is in C++ (you call a function). It's very likely you will have to change encoding at some point, and there is a best place to do it if you want to remain cross platform.

Ondrej Pokorny

  • Full Member
  • ***
  • Posts: 220
Re: WideString to AnsiString - data loss
« Reply #12 on: June 06, 2017, 05:42:16 pm »
...yes and it is completely OK and valid. How do you want to assign a UnicodeString to AnsiString without dataloss?

That is not the question. The question is if you want to get a warning if you do. Some possible causes to mine if you have a bug in your code. Silencing the warning with a typecast therefore is dumb, since when you use the same code in e.g. a service, and don't set utf8, BANG.

Marcov, you are wrong.

You have to distinguish between "silencing a warning with an explicit typecast" (myEnum := TMyEnum(-1)) and explicitely converting a UnicodeString to AnsiString with string() in FPC 3.0.0. It's not the same !!!

In case of "myEnum := TMyEnum(-1)" you get an invalid value in myEnum.

In case of MyAnsiString := string(MyUnicodeString) you always get a valid string. FPC 3.0.0 calls fpc_UnicodeStr_To_AnsiStr in this case. So:
Code: [Select]
MyAnsiString := string(MyUnicodeString)is equivalent to
Code: [Select]
MyAnsiString := fpc_UnicodeStr_To_AnsiStr(MyUnicodeString, DefaultSystemCodePage)
Both are valid and absolutely save calls and no hacks whatsoever.

Quote
It's completely clear and fine to do that - the same when you do SingleValue := ExtendedValue (or IntValue := Round(ExtendedValue)).

The first is IMHO worth a warning for exactly the same reasons, and the second is a deliberate programming construct. Not a uncertain hack to clean out warnings.

As I prove above, "MyAnsiString := string(MyUnicodeString)" is not an "uncertain hack to clean out warnings" but a valid shortcut for fpc_UnicodeStr_To_AnsiStr function.

Quote
If you want to assign a floating point value to an integer, you count with data loss as well.

The compiler won't let you. With good reason. For larger to smaller integer there are range checks.

I meant assign a floating point value to an integer with e.g. the Round() function. And the compiler will let you do the direct assignment if you overload the assignment operator for that.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12523
  • FPC developer.
Re: WideString to AnsiString - data loss
« Reply #13 on: June 06, 2017, 06:25:38 pm »
From an application programming standpoint the main justification is not needing to differentiate octet streams and UTF-8 data (because most octet streams will end up being treated as Unicode data at some point). If you don't re-encode before passing to the OS you will be doing it before you save to disk or send it over the network.

Nearly all such formats are dynamically encoded (due to BOM presence, an encoding field in the protocol or (HTTP) metadata annotations) anyway. There is precious few raw wire and file formats guaranteed UTF8 (except in Unix derived software).

Moreover, that is about a totally separate issue, namely document encoding. Which is a totally different issue from API/ABI encoding.

Quote
Since it is Windows that is the odd one out, it seems to make the most sense to quarantine its use of UTF-16 as close to its API as you can.

That depends on your situation. It is more than a simple count, since a lot of SME development will have windows as majority target. And increasingly mobile targets are separate codebases, which have supplanted the minimal desktop apps for e.g. OS X. 

Quote
If you read the sample code it should be pretty clear that re-encoding is essentially the same in FreePascal as it is in C++ (you call a function). It's very likely you will have to change encoding at some point, and there is a best place to do it if you want to remain cross platform.

Well, first cross platform is not a given, and second it is a weighted count rather than a straight one.  It does not make sense to do an hatched job on incoming delphi code in an incompatible way for some naive sense of multiplatform when the likely target is again Delphi.

And doing it at the OS interface (which thousands of Windows calls to abstract) is IMHO very bad modularization. You typically modularize to minimize interactions (read: conversions).


marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12523
  • FPC developer.
Re: WideString to AnsiString - data loss
« Reply #14 on: June 06, 2017, 06:29:54 pm »
You have to distinguish between "silencing a warning with an explicit typecast" (myEnum := TMyEnum(-1)) and explicitely converting a UnicodeString to AnsiString with string() in FPC 3.0.0. It's not the same !!!

In case of "myEnum := TMyEnum(-1)" you get an invalid value in myEnum.

In case of MyAnsiString := string(MyUnicodeString) you always get a valid string.

You assign something with a greater range to something with potentially a smaller range. Just like extended to integer (which is not allowed) or higher integer to lower integer (which is guarded by rangechecks).

I do know how the FPC unicode implementation works, I worked on it, and converted most of the Windows RTL to it.

Quote
I meant assign a floating point value to an integer with e.g. the Round() function. And the compiler will let you do the direct assignment if you overload the assignment operator for that.

The round() function is more something like utf8encode(). It is a deliberate, typed, conversion, and not done to silence a mere warning that something is wrong.

 

TinyPortal © 2005-2018