Recent

Author Topic: Question about FindInvalidUTF8Codepoint.  (Read 983 times)

yus

  • Jr. Member
  • **
  • Posts: 57
Question about FindInvalidUTF8Codepoint.
« on: July 03, 2020, 05:34:18 pm »
Hello! I have this code.

Code: Pascal  [Select][+][-]
  1. uses
  2.   ... LazUTF8;
  3.  
  4. procedure TForm1.Button2Click(Sender: TObject);
  5. var
  6.   ar: array [0..19] of byte;
  7.   u8str: UTF8String;
  8.   err: integer;
  9. begin
  10.   //cebae1bdb9cf83cebcceb5eda080656469746564
  11.   ar[0] := $ce;
  12.   ar[1] := $ba;
  13.   ar[2] := $e1;
  14.   ar[3] := $bd;
  15.   ar[4] := $b9;
  16.   ar[5] := $cf;
  17.   ar[6] := $83;
  18.   ar[7] := $ce;
  19.   ar[8] := $bc;
  20.   ar[9] := $ce;
  21.   ar[10] := $b5;
  22.   ar[11] := $ed;
  23.   ar[12] := $a0;
  24.   ar[13] := $80;
  25.   ar[14] := $65;
  26.   ar[15] := $64;
  27.   ar[16] := $69;
  28.   ar[17] := $74;
  29.   ar[18] := $65;
  30.   ar[19] := $64;
  31.   SetLength(u8str, 20);
  32.   move(ar[0], u8str[1], 20); // invalid UTF-8 string
  33.  
  34.   err := FindInvalidUTF8Codepoint(@u8str[1], 20);
  35.   // err = -1 but this invalid UTF-8 string.    
  36.  
  37.   u8str := 'HELLO';  // valid UTF-8 string
  38.   err := FindInvalidUTF8Codepoint(@u8str[1], 5);
  39.  
  40.  
  41.   u8str := '';
  42. end;
  43.  

Why FindInvalidUTF8Codepoint returns -1;
What am I doing wrong?

Lazarus
Version #: 2.0.8
Date: 2020-14-11
FPC Version: 3.0.4
SVN Revision: 62944
x86_64-win64-win32/win64
« Last Edit: July 03, 2020, 06:19:34 pm by yus »

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Question about FindInvalidUTF8Codepoint.
« Reply #1 on: July 03, 2020, 08:30:29 pm »
Hi

Code: Pascal  [Select][+][-]
  1. u8str := 'HELLO';  // valid UTF-8 string
  2. err := FindInvalidUTF8Codepoint(@u8str, 5);
  3. showMessage(IntToStr(err)+' '+u8str);        

Winni

yus

  • Jr. Member
  • **
  • Posts: 57
Re: Question about FindInvalidUTF8Codepoint.
« Reply #2 on: July 03, 2020, 09:20:33 pm »
Hi

Code: Pascal  [Select][+][-]
  1. u8str := 'HELLO';  // valid UTF-8 string
  2. err := FindInvalidUTF8Codepoint(@u8str, 5);
  3. showMessage(IntToStr(err)+' '+u8str);        

Winni
HELLO - is valid UTF-8 string. -1 for HELLO is right result.

Why Result=-1 for invalid UTF-8 string.
Code: Pascal  [Select][+][-]
  1.   SetLength(u8str, 20);
  2.   move(ar[0], u8str[1], 20); // invalid UTF-8 string
  3.  
  4.   err := FindInvalidUTF8Codepoint(@u8str[1], 20);  
  5.   showMessage(IntToStr(err)+' '+u8str);





winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Question about FindInvalidUTF8Codepoint.
« Reply #3 on: July 03, 2020, 11:00:05 pm »
Hi!

You move 20 bytes  behind ar[0] to the string.
This is illegal.
But there is a good chance that you move the first 20 bytes of your array.

So the string starts with $CEBA

This is a legal UTF8 character namely the small greek kappa: κ
There is a good chance that the other bytes build legal utf8 characters.

Move only one byte to your string and you get an illegal UTF8 string:

Code: Pascal  [Select][+][-]
  1. move(ar[0], u8str[1], 1);

Winni

« Last Edit: July 03, 2020, 11:02:16 pm by winni »

 

TinyPortal © 2005-2018