Recent

Author Topic: pos or utf8pos return incorrect value  (Read 247 times)

DeSoLaToR

  • Newbie
  • Posts: 1
pos or utf8pos return incorrect value
« on: May 17, 2019, 05:02:09 pm »
Hello!
I have issue when i try write transliterate rus - eng.
Code fine works on delphi10.

lets look a code:
Code: Pascal  [Select]
  1. uses
  2.   Classes, SysUtils, FileUtil, Forms, Controls, Graphics, Dialogs, StdCtrls, ClipBrd, LCLProc;
Code: Pascal  [Select]
  1. function Translit(s: string): string;
  2. const
  3. rus: string = 'абвгдеёжзийклмнопрстуфхцчшщьыъэюя';
  4. lat: array[1..33] of string = ('a', 'b', 'v', 'g', 'd', 'e', 'yo', 'zh', 'z', 'i', 'y', 'k',
  5. 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'ts', 'ch', 'sh', 'sch', '', 'y', '', 'e', 'yu', 'ya');
  6. var
  7. p, i, l, r, r2, l2: integer;
  8. rp, rp2: string;
  9. begin
  10. s:=widelowercase(s);
  11. Result := '';
  12. l := Length(s);
  13. for i := 1 to l do
  14. begin
  15. p := Pos(s[i], rus);
  16. if p<1 then Result := Result + s[i] else Result := Result + lat[p];
  17. end;
  18. end;

For example:
Length returns byte value, and it's different from needed value.(35 bytes, 18 chars), (it needs utf8length instead)
pos(викторов александр) returns: 1 6 1 20 1 24 13 40 1 32 13 36 1 32 1 6 0 1 2 1 26 1 12 1 24 13 38 1 2 1 30 1 10 13 36
but correct value is: 3 10 12 20 16 18 16 3 1 13 6 12 19 1 15 5 18

Example two, using utf8:
Code: Pascal  [Select]
  1. uses
  2.   Classes, SysUtils, FileUtil, Forms, Controls, Graphics, Dialogs, StdCtrls, ClipBrd, LCLProc, lazutf8;
Code: Pascal  [Select]
  1. function Translit(s: string): string;
  2. const
  3. rus: string = 'абвгдеёжзийклмнопрстуфхцчшщьыъэюя';
  4. lat: array[1..33] of string = ('a', 'b', 'v', 'g', 'd', 'e', 'yo', 'zh', 'z', 'i', 'y', 'k',
  5. 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'ts', 'ch', 'sh', 'sch', '', 'y', '', 'e', 'yu', 'ya');
  6. var
  7. p, i, l, r, r2, l2: integer;
  8. rp, rp2: string;
  9. begin
  10. s:=widelowercase(s);
  11. Result := '';
  12. l := utf8Length(s);
  13. for i := 1 to l do
  14. begin
  15. p := utf8Pos(s[i], rus);
  16. if p<1 then Result := Result + s[i] else Result := Result + lat[p];
  17. end;

When we use utf8:
utf8length return correct value, 18 chars.
utf8pos(викторов александр) returns: 1 4 1 11 1 13 7 21 1 17 7 19 1 17 1 4 0 1
still not correct value. (3 10 12 20 16 18 16 3 1 13 6 12 19 1 15 5 18)

I repeat, in delphi, all code work fine.
My board:
Win10x64, Laz 1.6.4
Where i gone wrong? Help me please.

wp

  • Hero Member
  • *****
  • Posts: 6337
Re: pos or utf8pos return incorrect value
« Reply #1 on: May 17, 2019, 05:50:22 pm »
I don't exactly know what you are doing to get these numbers. But when I modify your code as shown below I get an output which seems to be correct for me:

Code: Pascal  [Select]
  1. uses
  2.   LazUTF8, LazUnicode;
  3.  
  4. function Translit(s: String): String;
  5. const
  6.   rus: string = 'абвгдеёжзийклмнопрстуфхцчшщьыъэюя';
  7.   lat: array[1..33] of string = ('a', 'b', 'v', 'g', 'd', 'e', 'yo', 'zh', 'z',
  8.     'i', 'y', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'ts',
  9.     'ch', 'sh', 'sch', '', 'y', '', 'e', 'yu', 'ya');
  10. var
  11.   ch: string;  // IMPORTANT: must not be "char"
  12.   p: Integer;
  13. begin
  14.   Result := '';
  15.   s := Lowercase(s);
  16.   for ch in s do begin
  17.     p := UTF8Pos(ch, rus);
  18.     if p < 1 then Result := Result + ch else Result := Result + lat[p];
  19.   end;
  20. end;
  21.  
  22. procedure TForm1.Button1Click(Sender: TObject);
  23. begin
  24.   ShowMessage(Translit('викторов александр'));
  25. end;

The point is that the input string is UTF8-encoded, that what we perceive as "characters" consists of 1 to 4 bytes. Therefore your code p := utf8Pos(s[i], rus) is wrong because s[i] steps through the string by byte, but not by character as you expect.

In unit LazUnicode, there is a handy enumerator which helps you stepping through the string by character: Define a "character" variable ch which must be type "string", not "char", because it can consist of up to 4 bytes. Then use for ch in s do... to iterate through the string.

In Delphi, strings are encoded as UTF16, i.e. consist of 1 or 2 words per codepoint. Essentially this results in the same problem, but I guess the 2nd word is not needed for all Russian characters.
Lazarus trunk / fpc 3.0.4 / all 32-bit on Win-10

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 5699
    • wiki
Re: pos or utf8pos return incorrect value
« Reply #2 on: May 17, 2019, 06:13:48 pm »
Just to underline the last line of wp:
Quote
In Delphi, strings are encoded as UTF16, i.e. consist of 1 or 2 words per codepoint.

In Delphi this may work, because you are lucky. The code "s[1]" is still wrong. But with the Russian chars you use, the error will never manifest. Because those chars (not verified, but likely) are each one word in UTF16.

If you did another language, then it (s[1]) would fail with UTF16 too.
Even some European chars like "ä" can have 2 words in UTF16 (Even in UTF32). They usually don't, but they can.

This is not bound to any form of utf-n. Utf is just an encoding for unicode. And some chars are of variable length. (google "combining codepoints")