Recent

Author Topic: For..in..do  (Read 1554 times)

Ed78z

  • Jr. Member
  • **
  • Posts: 66
For..in..do
« on: September 16, 2025, 03:07:24 am »
I can't compile this tiny code! what's the reason??

Code: Pascal  [Select][+][-]
  1. program Test;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. var
  6.   MyString: String;
  7.   ch: UCS4Char;
  8. begin
  9.   MyString := 'Test €!';
  10.   for ch in MyString do
  11.     Writeln('Character: ', ch, ', Ordinal Value: ', Ord(ch));
  12.   Readln;
  13. end.


project1.lpr(10,7) Error: Incompatible types: got "Char" expected "UCS4Char"


Lazarus 4.2 x64 Windows 11 x64

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1566
    • Lebeau Software
Re: For..in..do
« Reply #1 on: September 16, 2025, 07:03:38 am »
A String contains Char elements, not UCS4Char elements. You need to change your ch variable to Char.

If you want to use UCS4Char, you will have to use UCS4String instead of String.
« Last Edit: September 16, 2025, 07:07:56 am by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Khrys

  • Sr. Member
  • ****
  • Posts: 342
Re: For..in..do
« Reply #2 on: September 16, 2025, 07:25:53 am »
Since each  AnsiString  has its own codepage (which may not be  CP_UTF8),  no Unicode code point enumerator is defined for  AnsiString.
You might want to take a look at the  LazUTF8  and  LazUnicode  units from the  LazUtils  package (distributed with Lazarus).

(A code point enumerator would be a nice addition to the RTL, though... just my 0.02 €)

Ed78z

  • Jr. Member
  • **
  • Posts: 66
Re: For..in..do
« Reply #3 on: September 16, 2025, 07:39:22 am »
A String contains Char elements, not UCS4Char elements. You need to change your ch variable to Char.

If you want to use UCS4Char, you will have to use UCS4String instead of String.

No, it's not the solution (char), it will break on multi-byte characters....

The only option that I found with some try-error is:
Code: Pascal  [Select][+][-]
  1. {$codepage utf8}
  2.  
  3. program Test;
  4.  
  5. {$mode objfpc}{$H+}
  6.  
  7. var
  8.   MyString: UCS4String;
  9.   ch: UCS4Char;
  10. begin
  11.   MyString := UnicodeStringToUCS4String('Test €!');
  12.   for ch in MyString do
  13.   begin
  14.     if ch = 0 then continue;
  15.     Writeln('Character: ', UnicodeChar(ch), ', Code Point: ', ch);
  16.   end;
  17.   Readln;
  18. end.


Thank you for your response
« Last Edit: September 16, 2025, 07:44:42 am by Ed78z »

Ed78z

  • Jr. Member
  • **
  • Posts: 66
Re: For..in..do
« Reply #4 on: September 16, 2025, 07:42:21 am »
Since each  AnsiString  has its own codepage (which may not be  CP_UTF8),  no Unicode code point enumerator is defined for  AnsiString.
You might want to take a look at the  LazUTF8  and  LazUnicode  units from the  LazUtils  package (distributed with Lazarus).

(A code point enumerator would be a nice addition to the RTL, though... just my 0.02 €)

Thank you for your comment,
So, I guess after FPC 3.0 the compiler directive {$H+} means UnicodeString not AnsiString

Khrys

  • Sr. Member
  • ****
  • Posts: 342
Re: For..in..do
« Reply #5 on: September 16, 2025, 08:28:09 am »
So, I guess after FPC 3.0 the compiler directive {$H+} means UnicodeString not AnsiString

No, {$H+} just defines  type String = AnsiString. Without this directive,  type String = ShortString  -  which you typically don't want. It has nothing to do with Unicode.

What I meant was that  AnsiString  (aka  String  when  {$H+}may contain UTF-8 data, but it doesn't have to - each string instance has a codepage field in its metadata (right next to its reference count & length).
This just means that e.g. unlike in Rust, strings in FPC aren't guaranteed to use UTF-8 under the hood, hence it isn't sound to assume that every  AnsiString  contains Unicode code points.

Ed78z

  • Jr. Member
  • **
  • Posts: 66
Re: For..in..do
« Reply #6 on: September 16, 2025, 08:36:49 am »
So, I guess after FPC 3.0 the compiler directive {$H+} means UnicodeString not AnsiString

No, {$H+} just defines  type String = AnsiString. Without this directive,  type String = ShortString  -  which you typically don't want. It has nothing to do with Unicode.

What I meant was that  AnsiString  (aka  String  when  {$H+}may contain UTF-8 data, but it doesn't have to - each string instance has a codepage field in its metadata (right next to its reference count & length).
This just means that e.g. unlike in Rust, strings in FPC aren't guaranteed to use UTF-8 under the hood, hence it isn't sound to assume that every  AnsiString  contains Unicode code points.

I thought after fpc3.0, string=UnicodeString...
In Modern Free Pascal (3.0+):
{$H+}: String = UnicodeString (UTF-16) by default
{$H-}: String = ShortString (255 chars max)

In Older Free Pascal (pre-3.0):
{$H+}: String = AnsiString
{$H-}: String = ShortString


My idea was accessing each character without dealing with Surrogate pairs...

cdbc

  • Hero Member
  • *****
  • Posts: 2466
    • http://www.cdbc.dk
Re: For..in..do
« Reply #7 on: September 16, 2025, 09:46:31 am »
Hi
You need to use the enumerator from 'LazUnicode':
Code: Pascal  [Select][+][-]
  1.  
  2. program Test;
  3. {$mode objfpc}{$H+}
  4. uses LazUnicode; ///<- for the utf8 enumerator
  5. var
  6.   MyString: String;
  7.   ch: string;
  8. begin
  9.   MyString := 'Test €!';
  10.   for ch in MyString do ///<- employs the enumerator from lazunicode
  11.   begin
  12.     if ch = #0 then continue; /// ch is now a string
  13.     Writeln('Character: ', ch, ', Code Point: ', ch); /// same same now
  14.   end;
  15.   Readln;
  16. end.
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE6 -> FPC 3.2.2 -> Lazarus 4.0 up until Jan 2025 from then on it's both above &: KDE6/QT6 -> FPC 3.3.1 -> Lazarus 4.99

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: For..in..do
« Reply #8 on: September 16, 2025, 10:33:09 am »
project1.lpr(10,7) Error: Incompatible types: got "Char" expected "UCS4Char"
The original error message looks like a compiler bug. The types should be the other way around.
I can reproduce that with FPC 3.2.2. Maybe it is fixed later. Somebody please test that.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

dbannon

  • Hero Member
  • *****
  • Posts: 3558
    • tomboy-ng, a rewrite of the classic Tomboy
Re: For..in..do
« Reply #9 on: September 16, 2025, 11:17:59 am »
main 3.3.1 does the same thing. But are you sure its an error ? I'd  see the 'for' pulling first lump out of the string, getting a AnsiChar and finding it cannot copy that into the UCS4Char. And thats what it says ?

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

Warfley

  • Hero Member
  • *****
  • Posts: 2021
Re: For..in..do
« Reply #10 on: September 16, 2025, 12:11:27 pm »
You could use my iterator library, it has an iterator for utf-8:
Code: Pascal  [Select][+][-]
  1. procedure UTF8Test;
  2. const
  3.   TestString = '€$£';
  4. var
  5.   c: String;
  6. begin
  7.   Write('Testing iterating over "', TestString, '":');
  8.   for c in IterateUTF8(TestString) do
  9.     Write(' ', c);
  10.   WriteLn;
  11. end;

cdbc

  • Hero Member
  • *****
  • Posts: 2466
    • http://www.cdbc.dk
Re: For..in..do
« Reply #11 on: September 16, 2025, 12:56:20 pm »
Hi
@Khrys was spot on:
Quote
You might want to take a look at the  LazUTF8  and  LazUnicode  units from the  LazUtils  package (distributed with Lazarus).
... as I proved in post #7... include 'LazUtils' package.
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE6 -> FPC 3.2.2 -> Lazarus 4.0 up until Jan 2025 from then on it's both above &: KDE6/QT6 -> FPC 3.3.1 -> Lazarus 4.99

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1566
    • Lebeau Software
Re: For..in..do
« Reply #12 on: September 16, 2025, 05:24:37 pm »
I thought after fpc3.0, string=UnicodeString...

No, String is still AnsiString by default in {$H+} mode. If you want String to be UnicodeString then you need to use {$ModeSwitch UnicodeStrings} or {$Mode DelphiUnicode}.

My idea was accessing each character without dealing with Surrogate pairs...

UnicodeString is UTF-16, thus uses surrogates.  Even UTF-8 uses multiple codeunits per codepoint. Only UTF-32/UCS-4 is 1:1 between codeunits and codepoints, but UCS4String is not a native string type, it is just a dynamic array.
« Last Edit: September 16, 2025, 05:26:55 pm by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

PascalDragon

  • Hero Member
  • *****
  • Posts: 6191
  • Compiler Developer
Re: For..in..do
« Reply #13 on: September 16, 2025, 10:36:38 pm »
Only UTF-32/UCS-4 is 1:1 between codeunits and codepoints, but UCS4String is not a native string type, it is just a dynamic array.

It's close enough. A codepage is not required, thus the reference count of the dynamic array is sufficient. One difference is slighty different copy-on-write semantic, because changing an element in a dynamic array will not make it unique unlike for a managed string type, another is that there won't be an implicit NUL at the end.

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1566
    • Lebeau Software
Re: For..in..do
« Reply #14 on: September 17, 2025, 01:56:24 am »
One difference is slighty different copy-on-write semantic, because changing an element in a dynamic array will not make it unique unlike for a managed string type

IOW, dynamic arrays are not copy-on-write at all.

another is that there won't be an implicit NUL at the end.

True, there will be an explicit one instead, which is included in the array's length.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

 

TinyPortal © 2005-2018