Recent

Author Topic: PChar and UTF-8. Weird stuff. (Fpc bug )  (Read 859 times)

beria

  • Jr. Member
  • **
  • Posts: 70
PChar and UTF-8. Weird stuff. (Fpc bug )
« on: August 21, 2022, 09:21:08 am »
Code: Pascal  [Select][+][-]
  1. {$codepage utf8}
  2. program project1;
  3.  
  4. var
  5.   ar: PChar = 'Ёжик';
  6. begin
  7.  Writeln(strlen(ar));  //Error
  8.   Writeln(Length(Ar)); //Error
  9.   ar := 'Ёжик';
  10.   Writeln(strlen(ar)); //Ok
  11.   Writeln(Length(Ar)); //Ok
  12.   Readln;
  13. end.    

4
4
8
8

Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. var
  4.   ar: PChar = 'Ёжик';
  5. begin
  6.  Writeln(strlen(ar)); //Ok
  7.   Writeln(Length(Ar)); //Ok
  8.   ar := 'Ёжик';
  9.   Writeln(strlen(ar)); //Ok
  10.   Writeln(Length(Ar)); //Ok
  11.   Readln;
  12. end.

8
8
8
8

Why does the RECOMMENDED compile switch {$codepage utf8} in some cases fail to work with UTF-8 strings in the PChar container?
« Last Edit: August 21, 2022, 07:18:23 pm by beria »

MarkMLl

  • Hero Member
  • *****
  • Posts: 6686
Re: PChar and UTF-8. Weird stuff.
« Reply #1 on: August 21, 2022, 09:32:49 am »
Don't mix C-style strings (PChar etc.) and Pascal-style string handling functions. It's asking for trouble.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

beria

  • Jr. Member
  • **
  • Posts: 70
Re: PChar and UTF-8. Weird stuff.
« Reply #2 on: August 21, 2022, 09:36:05 am »
Don't mix C-style strings (PChar etc.) and Pascal-style string handling functions. It's asking for trouble.

MarkMLl

 Writeln(strlen(ar));  //Error  ;D
strlen -  function specifically for the PChar type

MarkMLl

  • Hero Member
  • *****
  • Posts: 6686
Re: PChar and UTF-8. Weird stuff.
« Reply #3 on: August 21, 2022, 10:00:31 am »
Writeln(strlen(ar));  //Error  ;D
strlen -  function specifically for the PChar type

Yes, but ar is a pointer, and you're relying on compiler magic to assign a Pascal-style string containing UTF-8 encoded characters to it.

You're then demonstrating that it's not a happy combination :-)

Also see https://forum.lazarus.freepascal.org/index.php/topic,60324.msg450814.html#msg450814

MarkMLl
« Last Edit: August 21, 2022, 10:08:27 am by MarkMLl »
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: PChar and UTF-8. Weird stuff.
« Reply #4 on: August 21, 2022, 12:08:14 pm »
Why does the RECOMMENDED compile switch {$codepage utf8} in some cases fail to work with UTF-8 strings in the PChar container?
It is explained here:
 https://wiki.freepascal.org/Unicode_Support_in_Lazarus
Read especially the "String Literals" part. {$codepage utf8} is not recommended.
Yes, it is counter-intuitive. The fundamental reason is that the Lazarus UTF-8 solution is a hack in FPC's point of view. However the hack works amazingly well when you assign string literals always to a variable of type "String". After that you can iterate it using a PChar without problems.
« Last Edit: August 21, 2022, 12:10:33 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9866
  • Debugger - SynEdit - and more
    • wiki
Re: PChar and UTF-8. Weird stuff.
« Reply #5 on: August 21, 2022, 12:25:06 pm »
With fpc 3.3.1 it's 1,1,8,8 => IMHO a bug in fpc.

tetrastes

  • Sr. Member
  • ****
  • Posts: 481
Re: PChar and UTF-8. Weird stuff.
« Reply #6 on: August 21, 2022, 03:59:20 pm »
Why does the RECOMMENDED compile switch {$codepage utf8} in some cases fail to work with UTF-8 strings in the PChar container?
It is explained here:
 https://wiki.freepascal.org/Unicode_Support_in_Lazarus
Read especially the "String Literals" part. {$codepage utf8} is not recommended.

I would add "Read especially "What happens when I use $codepage utf8?" part". It explains why this switch is especially not recommended for OSes with UTF-8 system codepage.

By the way, at linux-x86_64 with fpc 3.2.2 I get (with $codepage utf8)
0
0
8
8

And if I add
Code: Pascal  [Select][+][-]
  1. uses cwstring;
5
5
8
8

And of course if you add
Code: Pascal  [Select][+][-]
  1. {$codepage utf8}
  2. ...
  3. var
  4.   ar: PChar = 'Ёжик';
  5. begin
  6.   writeln(string(ar));
  7. ...
  8.  
you will not get Ёжик at output
« Last Edit: August 21, 2022, 04:15:11 pm by tetrastes »

beria

  • Jr. Member
  • **
  • Posts: 70
Re: PChar and UTF-8. Weird stuff.
« Reply #7 on: August 21, 2022, 04:29:21 pm »

Read especially the "String Literals" part. {$codepage utf8} is not recommended.
With {$codepage utf8} or the compiler switch -FcUTF8, PChar does not work in principle. This I have already understood and accepted, although it complicates console output. But it goes on to say at "https://wiki.freepascal.org/Unicode_Support_in_Lazarus/ru" that "WideString/UnicodeString/UTF8String only work with {$codepage utf8} / -FcUTF8."
If anything, I use UTF8String to form a network query, and WideString is generally the only string type used to bind to Autocad and MSOffice (Win,Mac) scripting languages .....



MarkMLl

  • Hero Member
  • *****
  • Posts: 6686
Re: PChar and UTF-8. Weird stuff.
« Reply #8 on: August 21, 2022, 04:41:48 pm »
WideString is generally the only string type used to bind to Autocad and MSOffice (Win,Mac) scripting languages .....

...which is obviously a significant consideration. But my understanding is that the correct doctrine is that since the base language is Pascal, you're safer using Pascal-style strings and only converting to a PChar when calling an API or library that requires it.

I admit to being no great lover of this UTF-8 stuff, but there's really two issues here: Pascal vs C-type strings (AnsiString, String, WideString), and character representation (UTF-8 vs a fixed-width Unicode encoding).

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

PascalDragon

  • Hero Member
  • *****
  • Posts: 5476
  • Compiler Developer
Re: PChar and UTF-8. Weird stuff.
« Reply #9 on: August 21, 2022, 04:43:51 pm »
Don't mix C-style strings (PChar etc.) and Pascal-style string handling functions. It's asking for trouble.

There are no Pascal-style string handling functions involved in this example (Length exists in a variant for both and a string constant is not a Pascal-style string).

Why does the RECOMMENDED compile switch {$codepage utf8} in some cases fail to work with UTF-8 strings in the PChar container?
It is explained here:
 https://wiki.freepascal.org/Unicode_Support_in_Lazarus
Read especially the "String Literals" part. {$codepage utf8} is not recommended.
Yes, it is counter-intuitive. The fundamental reason is that the Lazarus UTF-8 solution is a hack in FPC's point of view. However the hack works amazingly well when you assign string literals always to a variable of type "String". After that you can iterate it using a PChar without problems.

The example in question does not use any LCL units, thus what is suggested for Lazarus is not relevant here.

Why does the RECOMMENDED compile switch {$codepage utf8} in some cases fail to work with UTF-8 strings in the PChar container?

It's a bug in the handling of the constant string (the code that handles the constant value does not correctly handle the case of the string constant being a Unicode string constant unlike the code for expressions). Please report as a bug.

beria

  • Jr. Member
  • **
  • Posts: 70
Re: PChar and UTF-8. Weird stuff.
« Reply #10 on: August 21, 2022, 07:16:35 pm »

 

TinyPortal © 2005-2018