### Bookstore

 Computer Math and Games in Pascal (preview) Lazarus Handbook

### Author Topic: I need to know if someone can explain s:='Sér'; ShowMessage(s.Length.ToString)  (Read 596 times)

#### sergio.garcia

• Newbie
• Posts: 2
##### I need to know if someone can explain s:='Sér'; ShowMessage(s.Length.ToString)
« on: September 16, 2023, 11:23:14 pm »
Shouldn't the result of the programming code below be 3?
var
s:string;
begin
s:='Sér';  ShowMessage(s.Length.ToString);
end;

#### Warfley

• Hero Member
• Posts: 1469
##### Re: I need to know if someone can explain s:='Sér'; ShowMessage(s.Length.ToString)
« Reply #1 on: September 16, 2023, 11:28:46 pm »
The length of the String is not the length in (printed) characters, but the length in bytes. And in UTF-8 one character can be represented by up to 6 bytes (even though currently only up to 4 bytes are used, but it may be extended afterwards). E.g. '\$' is 1 byte long, '£' is two bytes and '€' is three bytes long. Generally only 7-bit ASCII characters are also 1 byte in unicode (i.e. only the "american" alphabet).

And even the with the utf-8 character length, not all characters are printable, e.g. the é, while there is a unicode char to represent this, it can also be a combination of ´+e using character combination.

So general rule of thumb, when dealing with unicode, you can't tell from the length of a string how much characters will be printed
« Last Edit: September 16, 2023, 11:34:05 pm by Warfley »

#### paweld

• Hero Member
• Posts: 851
##### Re: I need to know if someone can explain s:='Sér'; ShowMessage(s.Length.ToString)
« Reply #2 on: September 17, 2023, 12:45:47 am »
therefore, if you know that there may be non-ASCII characters in the string use the UTF8Length function.
Code: Pascal  [Select][+][-]
1. uses
2.   LazUTF8;
3.
4. var
5.   s: String;
6. begin
7.   s := 'Sér';
8.   ShowMessage(UTF8Length(s).ToString);
9. end;
Best regards / Pozdrawiam
paweld

#### Remy Lebeau

• Hero Member
• Posts: 1283
##### Re: I need to know if someone can explain s:='Sér'; ShowMessage(s.Length.ToString)
« Reply #3 on: September 18, 2023, 10:35:00 pm »
And in UTF-8 one character can be represented by up to 6 bytes (even though currently only up to 4 bytes are used, but it may be extended afterwards).

The original UTF-8 spec allowed for up to 6 bytes (technically, it could be unlimited!), but RFC 3629 restricts UTF-8 to 4 bytes max for compatibility with UTF-16, which physically can't exceed codepoint U+10FFFF.  I don't see them ever extending UTF-8 beyond that, unless Unicode itself eventually grows beyond U+10FFFF and consequently has to deprecate UTF-16.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)