#### yus

« on: July 03, 2020, 05:34:18 pm »
Hello! I have this code.

Code: Pascal
1. uses
2.   ... LazUTF8;
3.
4. procedure TForm1.Button2Click(Sender: TObject);
5. var
6.   ar: array [0..19] of byte;
7.   u8str: UTF8String;
8.   err: integer;
9. begin
10.   //cebae1bdb9cf83cebcceb5eda080656469746564
11.   ar[0] := \$ce;
12.   ar[1] := \$ba;
13.   ar[2] := \$e1;
14.   ar[3] := \$bd;
15.   ar[4] := \$b9;
16.   ar[5] := \$cf;
17.   ar[6] := \$83;
18.   ar[7] := \$ce;
19.   ar[8] := \$bc;
20.   ar[9] := \$ce;
21.   ar[10] := \$b5;
22.   ar[11] := \$ed;
23.   ar[12] := \$a0;
24.   ar[13] := \$80;
25.   ar[14] := \$65;
26.   ar[15] := \$64;
27.   ar[16] := \$69;
28.   ar[17] := \$74;
29.   ar[18] := \$65;
30.   ar[19] := \$64;
31.   SetLength(u8str, 20);
32.   move(ar[0], u8str[1], 20); // invalid UTF-8 string
33.
34.   err := FindInvalidUTF8Codepoint(@u8str[1], 20);
35.   // err = -1 but this invalid UTF-8 string.
36.
37.   u8str := 'HELLO';  // valid UTF-8 string
38.   err := FindInvalidUTF8Codepoint(@u8str[1], 5);
39.
40.
41.   u8str := '';
42. end;
43.

Why FindInvalidUTF8Codepoint returns -1;
What am I doing wrong?

Lazarus
Version #: 2.0.8
Date: 2020-14-11
FPC Version: 3.0.4
SVN Revision: 62944
x86_64-win64-win32/win64
« Last Edit: July 03, 2020, 06:19:34 pm by yus »

#### winni

« Reply #1 on: July 03, 2020, 08:30:29 pm »
Hi

Code: Pascal
1. u8str := 'HELLO';  // valid UTF-8 string
2. err := FindInvalidUTF8Codepoint(@u8str, 5);
3. showMessage(IntToStr(err)+' '+u8str);

Winni

#### yus

« Reply #2 on: July 03, 2020, 09:20:33 pm »
Hi

Code: Pascal
1. u8str := 'HELLO';  // valid UTF-8 string
2. err := FindInvalidUTF8Codepoint(@u8str, 5);
3. showMessage(IntToStr(err)+' '+u8str);

Winni
HELLO - is valid UTF-8 string. -1 for HELLO is right result.

Why Result=-1 for invalid UTF-8 string.
Code: Pascal
1.   SetLength(u8str, 20);
2.   move(ar[0], u8str[1], 20); // invalid UTF-8 string
3.
4.   err := FindInvalidUTF8Codepoint(@u8str[1], 20);
5.   showMessage(IntToStr(err)+' '+u8str);

#### winni

« Reply #3 on: July 03, 2020, 11:00:05 pm »
Hi!

You move 20 bytes  behind ar[0] to the string.
This is illegal.
But there is a good chance that you move the first 20 bytes of your array.

So the string starts with \$CEBA

This is a legal UTF8 character namely the small greek kappa: κ
There is a good chance that the other bytes build legal utf8 characters.

Move only one byte to your string and you get an illegal UTF8 string:

Code: Pascal
1. move(ar[0], u8str[1], 1);

Winni

« Last Edit: July 03, 2020, 11:02:16 pm by winni »