Recent

Author Topic: Issues with new strings of FPC trunk  (Read 27128 times)

stocki

  • Full Member
  • ***
  • Posts: 144

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #46 on: July 04, 2015, 07:53:04 pm »
[...]  I modified your example from above and, using Lazarus' character map, I added some Japanese characters which definitely are not on my 1252 code page. But these characters are displayed correctly. How is that?  [...]

Got it...

Preliminary note (just in case): Any Windows API call in the LCL is usually coded this way when text are involved, in order to use either the A(nsi) version and/or the W(ide) version API.

Code: [Select]

{$ifdef WindowsUnicodeSupport}
  if UnicodeEnabledOS then
    (Use W API)
  else
    (Use A API)
{$else}
  (use A API)
{$endif}

In practice, we are almost always using the W Windows API set: no DisableWindowsUnicodeSupport defined, and Windows version >= VER_PLATFORM_WIN32_NT.

When the W Windows API set is used, text coming from the LCL and sent to the concerned W Windows function is converted from UTF8 to UTF16 via a "pure LCL" function called UTF8ToUTF16.

The real code for this function is in fact in the ConvertUTF8ToUTF16 function (in the LazUTF8 unit).

Looking at the concerned code, this function just doesn't care about code page or any other similar stuff. It just takes the "raw text data" in input and processes them by considering that they are ALWAYS UTF8 text data (i.e. whatever the code page value of the container is).

Therefore, if you put valid UTF8 data in any 1-byte string type, the conversion will always succeed and so, the data sent to the W Windows function are correct.

Furthermore, there are no data lost when converting from UTF8 to UTF16.

« Last Edit: July 04, 2015, 08:19:00 pm by ChrisF »

rtusrghsdfhsfdhsdfhsfdhs

  • Full Member
  • ***
  • Posts: 162
Re: Issues with new strings of FPC trunk
« Reply #47 on: July 05, 2015, 12:27:47 am »
So there is a conversion going inside in the LCL from UTF8 <> UTF16? How often are these conversions?

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Issues with new strings of FPC trunk
« Reply #48 on: July 05, 2015, 01:41:17 am »
So there is a conversion going inside in the LCL from UTF8 <> UTF16? How often are these conversions?
everytime you talk to any windows api.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Issues with new strings of FPC trunk
« Reply #49 on: July 05, 2015, 02:06:12 am »
Japanese and any other character constants have worked so far because the source file is UTF-8 but does not have BOM. Then FPC copies the value as-is.
It is explained here:
  http://wiki.freepascal.org/LCL_Unicode_Support#UTF8_and_source_files_-_the_missing_BOM
In the new system -FcUTF8 or {$codepage utf8} is needed, and the source file still must be UTF-8.

Quote
So there is a conversion going inside in the LCL from UTF8 <> UTF16? How often are these conversions?

Only when you call Windows API or library code that uses WideString or UnicodeString.
Does anybody have measurements of how much time the conversions take? I have noticed they are fast but I don't have numbers or exact measurements. I think the slow-down from conversions for Windows programs is very marginal.

Quote
The real code for this function is in fact in the ConvertUTF8ToUTF16 function (in the LazUTF8 unit).

Looking at the concerned code, this function just doesn't care about code page or any other similar stuff. It just takes the "raw text data" in input and processes them by considering that they are ALWAYS UTF8 text data (i.e. whatever the code page value of the container is).

ConvertUTF8ToUTF16 is used now with LCL + FPC 2.6.x. I don't think it will be needed any more in the new system. (but not sure)

How to fix the various conversions functions like :
  function UTF8ToCP1250(const s: string): string;
If I set the string encoding with:
Code: [Select]
SetCodePage(rawbytestring(result), xxx, false);
Then what is "xxx"?
I could fix them tomorrow (if I figure out how), patches are welcome, too.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

rtusrghsdfhsfdhsdfhsfdhs

  • Full Member
  • ***
  • Posts: 162
Re: Issues with new strings of FPC trunk
« Reply #50 on: July 05, 2015, 02:11:33 am »
So there is a conversion going inside in the LCL from UTF8 <> UTF16? How often are these conversions?
everytime you talk to any windows api.

wow well there is one performance killer right there. I thought LCL is UTF16? If you have a lot of winapi calls its going to add up.
« Last Edit: July 05, 2015, 02:14:28 am by Fiji »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Issues with new strings of FPC trunk
« Reply #51 on: July 05, 2015, 02:30:35 am »
wow well there is one performance killer right there. I thought LCL is UTF16? If you have a lot of winapi calls its going to add up.

Please do some measurements and let us know how big is the hit for performance. I don't think it is bad at all.

BTW, LCL has used UTF-8 since ~15 years ago and Windows applications always suffered from the same conversions. Nothing new there.
« Last Edit: July 05, 2015, 02:35:19 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Issues with new strings of FPC trunk
« Reply #52 on: July 05, 2015, 02:36:56 am »
wow well there is one performance killer right there. I thought LCL is UTF16? If you have a lot of winapi calls its going to add up.

Please do some measurements and let us know how big is the hit for performance. I don't think it is bad at all.

BTW, LCL has used UTF-8 since ~15 years ago and Windows applications always suffered from the same conversions. Nothing new there.
was it? wasn't considered ansi code page and there was no conversion to ansi?
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Issues with new strings of FPC trunk
« Reply #53 on: July 05, 2015, 02:46:40 am »
wow well there is one performance killer right there. I thought LCL is UTF16? If you have a lot of winapi calls its going to add up.

Please do some measurements and let us know how big is the hit for performance. I don't think it is bad at all.

BTW, LCL has used UTF-8 since ~15 years ago and Windows applications always suffered from the same conversions. Nothing new there.
was it? wasn't considered ansi code page and there was no conversion to ansi?

I am not sure of the time scale but as long as I know LCL has been UTF-8 only.
UTF-8 data is kept in AnsiString and converted with explicit conversion functions always when needed. It is clumsy and hackish.
The new system may also be considered a hack but it is a much better hack.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Issues with new strings of FPC trunk
« Reply #54 on: July 05, 2015, 02:48:23 am »
wow well there is one performance killer right there. I thought LCL is UTF16? If you have a lot of winapi calls its going to add up.

Please do some measurements and let us know how big is the hit for performance. I don't think it is bad at all.

BTW, LCL has used UTF-8 since ~15 years ago and Windows applications always suffered from the same conversions. Nothing new there.
was it? wasn't considered ansi code page and there was no conversion to ansi?

I am not sure of the time scale but as long as I know LCL has been UTF-8 only.
UTF-8 data is kept in AnsiString and converted with explicit conversion functions always when needed.
that would mean that if my ansi code page do not support unicode then I can't textout unicode though doesn't it?
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #55 on: July 05, 2015, 03:03:27 am »
ConvertUTF8ToUTF16 is used now with LCL + FPC 2.6.x. I don't think it will be needed any more in the new system. (but not sure)

I just have been surprised to find "native" conversion functions (ConvertUTF8ToUTF16 and ConvertUTF16ToUTF8). Until now, I though there was just like UTF8ToSys for the ANSI API set - based upon a Free Pascal function (Utf8ToAnsi for UTF8ToSys), or now using the "magic automatic" conversion coming with Free Pascal 3.0+.


Then what is "xxx"?
1250

But is is really THE solution ?

I mean, this is only changing the dynamic code page value. One could -falsely- interpret this as having a string with 1250 as a code page : which is only temporarily true.

Thing for which I've already given a sample in this post: http://forum.lazarus.freepascal.org/index.php/topic,28941.msg182057.html#msg182057 (last part of the post, code and result).

May be a few reflexion before any changes would be interesting,  especially a few others opinions, no ? (this is only my own opinion).



So there is a conversion going inside in the LCL from UTF8 <> UTF16? How often are these conversions?
everytime you talk to any windows api.

May I amend a little bit your answer, if you don't mind:
[everytime you talk to any windows api,] when text is involved in the API (parameters or result; PChar, PWideChar....);

And I m' not trying to say it represents just a small -or a big- part of all the API calls in the LCL, because I just don't know how much.

I don't even have an estimation: some tests for getting the average values of the overhead would be quite interesting (I'd liked to see them, if anyone has some ...).



I am not sure of the time scale but as long as I know LCL has been UTF-8 only.

So I am. I've always seen these conversions UTF8<->Ansi API and/or UTF8<->Wide API in the LCL code (at least since a few years).

(edited)
« Last Edit: July 05, 2015, 03:25:03 am by ChrisF »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Issues with new strings of FPC trunk
« Reply #56 on: July 05, 2015, 03:12:21 am »
that would mean that if my ansi code page do not support unicode then I can't textout unicode though doesn't it?

Yes, conversion between Ansi codepages and Unicode is lossy. The new system will be better also in this regard.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Issues with new strings of FPC trunk
« Reply #57 on: July 05, 2015, 03:35:16 am »
1250

Right, TSystemCodePage = Word. It is an unsigned number.

Quote
But is is really THE solution ?
I mean, this is only changing the dynamic code page value. One could -falsely- interpret this as having a string with 1250 as a code page : which is only temporarily true.
Thing for which I've already given a sample in this post: http://forum.lazarus.freepascal.org/index.php/topic,28941.msg182057.html#msg182057 (last part of the post, code and result).
May be a few reflexion before any changes would be interesting,  especially a few others opinions, no ? (this is only my own opinion).

Your problem happened because you don't use -FcUTF8. You must use it in any case, even if you don't use -dEnableUTF8RTL to map type String to UTF-8.
To me it sounds right to set the codepage to what it actually is in the string data. Why should it be something else? "Dynamic" and "temporary" are almost the same thing. The encoding can also change dynamically when data changes.
But let's wait for comments. I must learn more again ...
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #58 on: July 05, 2015, 04:01:33 am »
[...]  Your problem happened because you don't use -FcUTF8. You must use it in any case, even if you don't use -dEnableUTF8RTL to map type String to UTF-8. [...]

Are we talking of the same thing (I must confess I've probably been very imprecise in my link) ?

Here is what I meant with my link:

Code: [Select]
procedure TForm1.Button1Click(Sender: TObject);
var ws: widestring;
var s: string;
begin
  ws := 'test';
  s := UTF8Encode(ws);
  ShowMessage('CP=' + IntToStr(StringCodePage(s)));
  s := s + ' finished';
  ShowMessage('CP=' + IntToStr(StringCodePage(s)));
end;

The first value is CP_UTF8=65001, but the second one is back again 1252 (if 1252 is your Windows active code page).

These results are probably not surprising  for you, as you've been playing with all this new stuff since a while,now. But I'm sorry to say it was not the case for me: I've been surprised.


[...] But let's wait for comments. [...]

Agree.

rtusrghsdfhsfdhsdfhsfdhs

  • Full Member
  • ***
  • Posts: 162
Re: Issues with new strings of FPC trunk
« Reply #59 on: July 05, 2015, 12:42:02 pm »
I'd be willing to port LCL to UTF16 aswell as WIN32SDK.

 

TinyPortal © 2005-2018