Recent

Author Topic: problem with certain character recognition  (Read 2880 times)

lupo

  • Newbie
  • Posts: 1
problem with certain character recognition
« on: June 27, 2025, 05:06:12 pm »
The code doesn't recogniez the '°' character. When using the watch the value of alpha seems to be '?' in stead of '°'.


unit Unit1;

   {$CODEPAGE UTF8}
{$mode objfpc}{$H+}

interface

uses
  Classes, SysUtils, Forms, Controls, Graphics, Dialogs;

type
  TForm1 = class(TForm)
  private

  public

  end;

var
  Form1: TForm1;

implementation

{$R *.lfm}

var
  alpha: Char;
begin
  alpha := '°';
  if alpha = '°' then
    WriteLn('It works!');
end.

end.
                                                                       

Gigatron

  • Sr. Member
  • ****
  • Posts: 421
  • Amiga Rulez !!
    • Gigatron Shader Network Demo
Re: problem with certain character recognition
« Reply #1 on: June 27, 2025, 06:02:31 pm »
Hi

Is it working for you ? Or I misunderstood the problem.

Code: Pascal  [Select][+][-]
  1.   var
  2.   alpha: String;
  3. begin
  4.   alpha := '°';
  5.   if alpha = '°' then
  6.     WriteLn('It works!'  );
  7.  
  8.   WriteLn( UTF8ToString('°°°°°°°'));
  9.     ReadLn;
  10. end.        
« Last Edit: June 27, 2025, 06:14:02 pm by Gigatron »
Coding faster than Light !

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12398
  • Debugger - SynEdit - and more
    • wiki
Re: problem with certain character recognition
« Reply #2 on: June 27, 2025, 06:19:15 pm »
The Pascal type "char" represents an "ansi-char", that is a character that fits into one single byte.

'°' does not fit into a single byte. https://www.compart.com/en/unicode/U+00B0

Using
Utf-8 it needs 2 bytes (codeunits) 0xC2 0xB0
Utf-16 it needs 1 word (codeunit) 0x00B0



In that case you can get away by using utf16 and the Pascal type "widechar".

But Unicode also has characters that never fit into a single char or widechar. And not even into a single "32-bit char" if that would exist, and if you were using Utf-32.

Some chars use combining codepoints. They can be of any length.

gues1

  • Guest
Re: problem with certain character recognition
« Reply #3 on: June 27, 2025, 06:44:24 pm »
Uhm...
Why with original code Lazarus (or the compiler) doesn't show a warning message (or better, an error)?
The character were elaborate from the compiler at compiled time, and it know that is not a single byte char.

It's like if you put two char and the compiler say that this is a WIDESTRING instead a char (and it knows is a WIDESTRING, not an ansistring).

Bye

Thaddy

  • Hero Member
  • *****
  • Posts: 19247
  • Glad to be alive.
Re: problem with certain character recognition
« Reply #4 on: June 28, 2025, 05:38:40 pm »
That is programmer error, not compiler error.
You do not even know string types (e.g. widestring vs unicodestring vs UTF8string vs shortstring vs ansistring)
« Last Edit: June 28, 2025, 06:42:04 pm by Thaddy »
objects are fine constructs. You can even initialize them with constructors.

Weiss

  • Full Member
  • ***
  • Posts: 241
Re: problem with certain character recognition
« Reply #5 on: July 07, 2025, 03:21:43 am »
this is beginners forum, Thaddy. We are here because we do not know.

440bx

  • Hero Member
  • *****
  • Posts: 6528
Re: problem with certain character recognition
« Reply #6 on: July 07, 2025, 03:59:50 am »
Actually, the OP might have stumbled upon a compiler bug.

if you comment out the {$CODEPAGE UTF8} line, the compiler emits an error showing that it knows that the character does NOT fit in a single byte (type char.)

{$CODEPAGE UTF8} does not change the fact that a "char" can only accommodate a single byte (as shown above, the compiler knows that), therefore it should still emit an error.


FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

Thaddy

  • Hero Member
  • *****
  • Posts: 19247
  • Glad to be alive.
Re: problem with certain character recognition
« Reply #7 on: July 07, 2025, 06:28:13 am »
It is a combination.
Simplest example that basically changed 1 thing: char --> utf8char.
Code: Pascal  [Select][+][-]
  1. {$ifdef fpc}{$mode objfpc}{$endif}{$codepage utf8}
  2. var
  3.   alpha: Utf8Char;
  4. begin
  5.   alpha := '°';
  6.   if alpha = '°' then
  7.     WriteLn('It works!');
  8.   readln;
  9. end.
Assumes your terminal/console is in utf8 mode.
In a GUI app, take the original code and just change char to utf8char.

« Last Edit: July 07, 2025, 06:35:42 am by Thaddy »
objects are fine constructs. You can even initialize them with constructors.

LV

  • Sr. Member
  • ****
  • Posts: 427
Re: problem with certain character recognition
« Reply #8 on: July 07, 2025, 11:32:10 am »
Is it working for you ? Or I misunderstood the problem.

Code: Pascal  [Select][+][-]
  1. unit Unit1;
  2.  
  3. {$CODEPAGE UTF8}
  4. {$mode objfpc}{$H+}
  5.  
  6. interface
  7.  
  8. uses
  9.   Classes, SysUtils, Forms, Controls, Graphics, Dialogs;
  10.  
  11. type
  12.   TForm1 = class(TForm)
  13.   private
  14.  
  15.   public
  16.  
  17.   end;
  18.  
  19. var
  20.   Form1: TForm1;
  21.  
  22. implementation
  23.  
  24. {$R *.lfm}
  25.  
  26. var
  27.   alpha: string;
  28. begin
  29.   alpha :=
  30.     '∀    ∃    ∄    ∆    ∈' + #13#10 +
  31.     '∋    ∇    ∉    ∌    ∊' + #13#10 +
  32.     '∍    ∑    √    ∝    ∏' + #13#10 +
  33.     '∞    ∣    ≦    ≧    ⊃' + #13#10 +
  34.     '∩    ∪   ⊂    °     ±' + #13#10 +
  35.     '∙    ✕    ∅    ∆    ∇' + #13#10 +
  36.     '∧    ∨    ∫     ∠   ∟' + #13#10 +
  37.     '≃    ℃    ∞    — '     + #13#10 +
  38.     '⌈    ⌉    ⌊     ⌋';
  39.   WriteLn('WriteLn(alpha):');
  40.   WriteLn(alpha);
  41.   WriteLn('WriteLn(UTF8ToString(alpha)):');
  42.   WriteLn(UTF8ToString(alpha));
  43.  
  44.   alpha := '°';
  45.   if alpha = '°' then
  46.     WriteLn('It works!');
  47. end.
  48.  


Thaddy

  • Hero Member
  • *****
  • Posts: 19247
  • Glad to be alive.
Re: problem with certain character recognition
« Reply #9 on: July 07, 2025, 11:44:17 am »
You forgot to set the console output to UTF8. (As I wrote).
And you also need a terminal font that supports utf8
objects are fine constructs. You can even initialize them with constructors.

LV

  • Sr. Member
  • ****
  • Posts: 427
Re: problem with certain character recognition
« Reply #10 on: July 07, 2025, 12:07:57 pm »
You forgot to set the console output to UTF8. (As I wrote).
And you also need a terminal font that supports utf8

Yes, but in the Lazarus graphical application, this is done automatically (even without the {$CODEPAGE UTF8} directive), which I use regularly.

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   alpha: string;
  4. begin
  5.   alpha :=
  6.     'Label1' + #13#10 +
  7.     '∀    ∃    ∄    ∆    ∈' + #13#10 +
  8.     '∋    ∇    ∉    ∌    ∊' + #13#10 +
  9.     '∍    ∑    √    ∝    ∏' + #13#10 +
  10.     '∞    ∣    ≦    ≧    ⊃' + #13#10 +
  11.     '∩    ∪   ⊂    °     ±' + #13#10 +
  12.     '∙    ✕    ∅    ∆    ∇' + #13#10 +
  13.     '∧    ∨    ∫     ∠   ∟' + #13#10 +
  14.     '≃    ℃    ∞    — ' + #13#10 +
  15.     '⌈    ⌉    ⌊     ⌋';
  16.   Label1.Caption:=alpha;
  17. end;
  18.  

tetrastes

  • Hero Member
  • *****
  • Posts: 766
Re: problem with certain character recognition
« Reply #11 on: July 07, 2025, 02:05:35 pm »
if you comment out the {$CODEPAGE UTF8} line, the compiler emits an error showing that it knows that the character does NOT fit in a single byte (type char.)

This is because the source in utf-8, and there is indeed 2 characters in °, but compiler treats it as been in system CP. This how it looks in CP1252:
Code: Pascal  [Select][+][-]
  1.   alpha := '°';
  2.   if alpha = '°' then

Convert the source to one-byte CP, and there is no error.

{$CODEPAGE UTF8} does not change the fact that a "char" can only accommodate a single byte (as shown above, the compiler knows that), therefore it should still emit an error.

And now compiler knows that source is utf-8. Moreover, in this case it interprets '°' as WideChar constant:
Code: ASM  [Select][+][-]
  1. # [7] alpha := '..';
  2.         movl    $176,%eax
  3.         movl    %eax,%ecx
  4.         call    fpc_uchar_to_char
  5.         movb    %al,U_$P$PROGRAM_$$_ALPHA(%rip)
  6. # [8] if alpha = '..' then
  7.         movzbl  U_$P$PROGRAM_$$_ALPHA(%rip),%ecx
  8.         call    fpc_char_to_uchar
  9.         cmpw    $176,%ax

tetrastes

  • Hero Member
  • *****
  • Posts: 766
Re: problem with certain character recognition
« Reply #12 on: July 07, 2025, 06:50:58 pm »
It is a combination.
Simplest example that basically changed 1 thing: char --> utf8char.
Code: Pascal  [Select][+][-]
  1. {$ifdef fpc}{$mode objfpc}{$endif}{$codepage utf8}
  2. var
  3.   alpha: Utf8Char;
  4. begin
  5.   alpha := '°';
  6.   if alpha = '°' then
  7.     WriteLn('It works!');
  8.   readln;
  9. end.
Assumes your terminal/console is in utf8 mode.
In a GUI app, take the original code and just change char to utf8char.
???
Code: Pascal  [Select][+][-]
  1. type
  2.   AnsiChar = Char;
  3.   UTF8Char = AnsiChar;

cdbc

  • Hero Member
  • *****
  • Posts: 2808
    • http://www.cdbc.dk
Re: problem with certain character recognition
« Reply #13 on: July 07, 2025, 08:23:13 pm »
Hi
Hmmm... I would have expected an Utf8Char to be defined as a 'string', due to the fact that it can be 1..4 bytes long, which also makes it harder to index into...
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE6/QT6 -> FPC Release -> Lazarus Release &  FPC Main -> Lazarus Main

tetrastes

  • Hero Member
  • *****
  • Posts: 766
Re: problem with certain character recognition
« Reply #14 on: July 07, 2025, 09:22:02 pm »
[Utf8/Wide/UCS4]Char represents code unit, not code point.

 

TinyPortal © 2005-2018