Recent

Author Topic: using the CODEPAGE correctly?  (Read 29004 times)

helio

  • New Member
  • *
  • Posts: 29
using the CODEPAGE correctly?
« on: June 21, 2016, 01:28:43 pm »
 using the CODEPAGE correctly?

Code: Pascal  [Select][+][-]
  1. program Project1;
  2. {$mode objfpc}{$H+}
  3. {$Codepage UTF8}
  4.  
  5. Var
  6.   S: string;
  7.   I1,I2: ShortInt;
  8. begin
  9.   S:= 'CÇ';
  10.   I1:= Length(S);
  11.   I2:= Length('CÇ');
  12.   WriteLn(I1);
  13.   WriteLn(I2);
  14. end.
  15.  


       

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #1 on: June 21, 2016, 02:05:23 pm »
Don't use {$Codepage UTF8}. It is explained (somehow) here:
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

helio

  • New Member
  • *
  • Posts: 29
Re: using the CODEPAGE correctly?
« Reply #2 on: June 21, 2016, 02:28:23 pm »
@JuhaManninen.

 Could you give an example of use? I am very confused !!!

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #3 on: June 21, 2016, 02:53:38 pm »
Could you give an example of use? I am very confused !!!

As I wrote, don't use {$Codepage UTF8}. Remove it. Then everything works.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

helio

  • New Member
  • *
  • Posts: 29
Re: using the CODEPAGE correctly?
« Reply #4 on: June 21, 2016, 03:03:35 pm »

OK! without the directive the outputs are 3, the correct would be 2?  %)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #5 on: June 21, 2016, 03:14:10 pm »
OK! without the directive the outputs are 3, the correct would be 2?  %)

No, 3 is correct.
UTF8Length(S) would return 2.
Good luck learning Unicode. :)
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5610
    • Bart en Mariska's Webstek
Re: using the CODEPAGE correctly?
« Reply #6 on: June 21, 2016, 04:30:07 pm »
IIRC:
S is a string, you assign 'CÇ' to it, the sourcecode is encode in UTF8, where 'CÇ' then is 3 bytes long and length returns 3 (3 byte size elements).
In Length('CÇ'), the CÇ is treated as a constant by the compiler and evaluated as UnicodeString, and then Length will be 2 (2 word size elements).

If you specify a Windows codepage and save the source with that encoding (in the Lazarus IDE you can do that via the context menu of the editor), then the Length(S) would be 2 as well, since the encoding of the string would be a 1-byte encoding.

Bart

tetrastes

  • Hero Member
  • *****
  • Posts: 694
Re: using the CODEPAGE correctly?
« Reply #7 on: June 21, 2016, 05:07:07 pm »
In Length('CÇ'), the CÇ is treated as a constant by the compiler and evaluated as UnicodeString, and then Length will be 2 (2 word size elements).

But why without {$Codepage UTF8} Length('CÇ') is 3?

rvk

  • Hero Member
  • *****
  • Posts: 6885
Re: using the CODEPAGE correctly?
« Reply #8 on: June 21, 2016, 05:11:11 pm »
But why without {$Codepage UTF8} Length('CÇ') is 3?
Because 'CÇ' is then evaluated as string (which is a utf-8 string).
And because string consists of one byte elements (even a utf-8 string) you get 3 elements.

If it was unicode (widestring) as it is in Delphi, a string consists of widechars (which is 2 bytes per element).
So in Delphi it would result in 2 (because it uses 2 words/widechar elements).

You need to think if string is made up of WHAT kind of elements. Length will return that number of elements.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #9 on: June 21, 2016, 05:57:37 pm »
But why without {$Codepage UTF8} Length('CÇ') is 3?

Because then the data is interpreted correctly. Yes, it is counter-intuitive. Please read the wiki page I linked.
The fundamental reason is that our new UTF-8 mode switches the encoding at run-time, yet constants are evaluated at compile-time.
With {$Codepage UTF8} the compiler does a wrong conversion and indeed uses UnicodeString as a temporary medium.

Quote from: rvk
If it was unicode (widestring) as it is in Delphi, ...

UTF-8 is also Unicode. Unfortunately Delphi named their string type as UnicodeString instead of a proper UTF16String. It causes lots of confusion.

Everybody please test my latest code attachment in "Encoding agnostic functions for codepoints + an iterator" thread.
It is ūber-cool and gives more insight of Unicode, codepoints and codeunits. :)
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5610
    • Bart en Mariska's Webstek
Re: using the CODEPAGE correctly?
« Reply #10 on: June 21, 2016, 07:12:26 pm »
But why without {$Codepage UTF8} Length('CÇ') is 3?

Because the file encoding is still UTF8 (unless you changed it), but the compiler thinks the codepage is defaultsystemcodepage (probably 1252), it sees the bytes 43 C3 87 and sees them as a 3-byte string in a 1-byte per character (cp1252) encoding.

If you tell the compiler the codepage is UTF8 it will treat string constants as UTF16 (it converts them from UTF8 to UTF16).

I took your example and saved the file in cp1252 encoding (1 byte = 1 char) and removed the {$codepage UTF8}.
It will then correctly output 2 for both length's (the 'CÇ' is stored as 2 bytes 43 C7 in the file in this case).

Code: [Select]
43 C7, I1=2
43 C7, I2=2

Bart

BeniBela

  • Hero Member
  • *****
  • Posts: 947
    • homepage
Re: using the CODEPAGE correctly?
« Reply #11 on: June 21, 2016, 07:31:25 pm »

If you tell the compiler the codepage is UTF8 it will treat string constants as UTF16

That sounds like a crazy compiler

rvk

  • Hero Member
  • *****
  • Posts: 6885
Re: using the CODEPAGE correctly?
« Reply #12 on: June 21, 2016, 07:33:08 pm »
Quote from: rvk
If it was unicode (widestring) as it is in Delphi, ...
UTF-8 is also Unicode. Unfortunately Delphi named their string type as UnicodeString instead of a proper UTF16String. It causes lots of confusion.
Yeah... that was kind of what I meant. (That's why I put "(widestring)" after unicode but that was the incorrect term).

Quote
Unfortunately Delphi named their string type as UnicodeString instead of a proper UTF16String. It causes lots of confusion.
FPC also has a type unicodestring. So was it wrong for FPC to also define one ???

Unicode (unicodestring) can be a UTF-32, UTF-16 or UTF-8 string. I't up to the language (or setting) to determine which it really is. So there is no wrong or right here. (otherwise FPC would also be wrong to define a unicodestring)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #13 on: June 21, 2016, 08:39:27 pm »

If you tell the compiler the codepage is UTF8 it will treat string constants as UTF16
That sounds like a crazy compiler

:)
Bart's sentence was a little inaccurate.
The compiler treats string constants as UTF-8 if you tell it so. However our UTF-8 "hack" sets the default encoding only later at run-time. At compile time the default string encoding is still the system's codepage, for example cp1252. Thus the compiler converts UTF-8 -> cp1252 when assigning a constant. Later our "hack" changes the default string encoding and the wrongly converted cp1252 data is treated as UTF-8 -> error!

Remember however that the potential problem is only with constants. Variables have dynamic encoding info and assignment goes right always.
Summa summarum:
 Do not define {$Codepage UTF8} and assign constants only to an AnsiString. Then everything works as magic.

Please also look at my LazUnicode unit and test program, attached in another thread.
You can see that I assign a constant to a temporary AnsiString also when {$ModeSwitch UnicodeStrings} is defined and String type maps to UnicodeString.

FPC also has a type unicodestring. So was it wrong for FPC to also define one ???

No, Delphi compatibility is important, they had to define it.
There is no problem if everybody knows the facts. However Unicode is confusing and complicated enough even without extra complications. Take the word "character", what does it mean in Unicode? At least the encodings should be called without ambiquity.

Quote
Unicode (unicodestring) can be a UTF-32, UTF-16 or UTF-8 string.

This kind of proves the confusion. Does "unicodestring" mean the Delphi's UTF-16 type or a string of Unicode in general?
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

rvk

  • Hero Member
  • *****
  • Posts: 6885
Re: using the CODEPAGE correctly?
« Reply #14 on: June 21, 2016, 08:48:47 pm »
Quote
Unicode (unicodestring) can be a UTF-32, UTF-16 or UTF-8 string.
This kind of proves the confusion. Does "unicodestring" mean the Delphi's UTF-16 type or a string of Unicode in general?
Yeah... maybe. I should have said (unicode string or unicode-string and not unicodestring :)). Unicode (or a string with unicode) itself can be UTF-32, UTF-16 or UTF-8. Unicodestring is usually defined as UTF16 or UCS2. Is there actually a language which uses UTF-8 for type Unicodestring?

 

TinyPortal © 2005-2018