Recent

Author Topic: I need to know if someone can explain s:='Sér'; ShowMessage(s.Length.ToString)  (Read 777 times)

sergio.garcia

  • Newbie
  • Posts: 2
Shouldn't the result of the programming code below be 3?
var
  s:string;
begin
  s:='Sér';  ShowMessage(s.Length.ToString);
end;

Warfley

  • Hero Member
  • *****
  • Posts: 1763
The length of the String is not the length in (printed) characters, but the length in bytes. And in UTF-8 one character can be represented by up to 6 bytes (even though currently only up to 4 bytes are used, but it may be extended afterwards). E.g. '$' is 1 byte long, '£' is two bytes and '€' is three bytes long. Generally only 7-bit ASCII characters are also 1 byte in unicode (i.e. only the "american" alphabet).

And even the with the utf-8 character length, not all characters are printable, e.g. the é, while there is a unicode char to represent this, it can also be a combination of ´+e using character combination.

So general rule of thumb, when dealing with unicode, you can't tell from the length of a string how much characters will be printed
« Last Edit: September 16, 2023, 11:34:05 pm by Warfley »

paweld

  • Hero Member
  • *****
  • Posts: 1268
therefore, if you know that there may be non-ASCII characters in the string use the UTF8Length function.
Code: Pascal  [Select][+][-]
  1. uses  
  2.   LazUTF8;  
  3.    
  4. var
  5.   s: String;
  6. begin
  7.   s := 'Sér';    
  8.   ShowMessage(UTF8Length(s).ToString);
  9. end;
Best regards / Pozdrawiam
paweld

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1430
    • Lebeau Software
And in UTF-8 one character can be represented by up to 6 bytes (even though currently only up to 4 bytes are used, but it may be extended afterwards).

The original UTF-8 spec allowed for up to 6 bytes (technically, it could be unlimited!), but RFC 3629 restricts UTF-8 to 4 bytes max for compatibility with UTF-16, which physically can't exceed codepoint U+10FFFF.  I don't see them ever extending UTF-8 beyond that, unless Unicode itself eventually grows beyond U+10FFFF and consequently has to deprecate UTF-16.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

 

TinyPortal © 2005-2018