pos or utf8pos return incorrect value

DeSoLaToR

Newbie
Posts: 1

pos or utf8pos return incorrect value

« on: May 17, 2019, 05:02:09 pm »

Hello!
I have issue when i try write transliterate rus - eng.
Code fine works on delphi10.

lets look a code:

uses
  Classes, SysUtils, FileUtil, Forms, Controls, Graphics, Dialogs, StdCtrls, ClipBrd, LCLProc;

Code: Pascal [Select][+]

function Translit(s: string): string;
const
rus: string = 'абвгдеёжзийклмнопрстуфхцчшщьыъэюя';
lat: array[1..33] of string = ('a', 'b', 'v', 'g', 'd', 'e', 'yo', 'zh', 'z', 'i', 'y', 'k',
'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'ts', 'ch', 'sh', 'sch', '', 'y', '', 'e', 'yu', 'ya');
var
p, i, l, r, r2, l2: integer;
rp, rp2: string;
begin
s:=widelowercase(s);
Result := '';
l := Length(s);
for i := 1 to l do
begin
p := Pos(s[i], rus);
if p<1 then Result := Result + s[i] else Result := Result + lat[p];
end;
end;

For example:
Length returns byte value, and it's different from needed value.(35 bytes, 18 chars), (it needs utf8length instead)
pos(викторов александр) returns: 1 6 1 20 1 24 13 40 1 32 13 36 1 32 1 6 0 1 2 1 26 1 12 1 24 13 38 1 2 1 30 1 10 13 36
but correct value is: 3 10 12 20 16 18 16 3 1 13 6 12 19 1 15 5 18

Example two, using utf8:

Code: Pascal [Select][+]

uses
  Classes, SysUtils, FileUtil, Forms, Controls, Graphics, Dialogs, StdCtrls, ClipBrd, LCLProc, lazutf8; 

Code: Pascal [Select][+]

function Translit(s: string): string;
const
rus: string = 'абвгдеёжзийклмнопрстуфхцчшщьыъэюя';
lat: array[1..33] of string = ('a', 'b', 'v', 'g', 'd', 'e', 'yo', 'zh', 'z', 'i', 'y', 'k',
'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'ts', 'ch', 'sh', 'sch', '', 'y', '', 'e', 'yu', 'ya');
var
p, i, l, r, r2, l2: integer;
rp, rp2: string;
begin
s:=widelowercase(s);
Result := '';
l := utf8Length(s);
for i := 1 to l do
begin
p := utf8Pos(s[i], rus);
if p<1 then Result := Result + s[i] else Result := Result + lat[p];
end;

When we use utf8:
utf8length return correct value, 18 chars.
utf8pos(викторов александр) returns: 1 4 1 11 1 13 7 21 1 17 7 19 1 17 1 4 0 1
still not correct value. (3 10 12 20 16 18 16 3 1 13 6 12 19 1 15 5 18)

I repeat, in delphi, all code work fine.
My board:
Win10x64, Laz 1.6.4
Where i gone wrong? Help me please.

Logged

wp

Hero Member
Posts: 11910

Re: pos or utf8pos return incorrect value

« Reply #1 on: May 17, 2019, 05:50:22 pm »

I don't exactly know what you are doing to get these numbers. But when I modify your code as shown below I get an output which seems to be correct for me:

Code: Pascal [Select][+]

uses
  LazUTF8, LazUnicode;
 
function Translit(s: String): String;
const
  rus: string = 'абвгдеёжзийклмнопрстуфхцчшщьыъэюя';
  lat: array[1..33] of string = ('a', 'b', 'v', 'g', 'd', 'e', 'yo', 'zh', 'z',
    'i', 'y', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'ts',
    'ch', 'sh', 'sch', '', 'y', '', 'e', 'yu', 'ya');
var
  ch: string;  // IMPORTANT: must not be "char"
  p: Integer;
begin
  Result := '';
  s := Lowercase(s);
  for ch in s do begin
    p := UTF8Pos(ch, rus);
    if p < 1 then Result := Result + ch else Result := Result + lat[p];
  end;
end;
 
procedure TForm1.Button1Click(Sender: TObject);
begin
  ShowMessage(Translit('викторов александр'));
end; 

The point is that the input string is UTF8-encoded, that what we perceive as "characters" consists of 1 to 4 bytes. Therefore your code p := utf8Pos(s[i], rus) is wrong because s[i] steps through the string by byte, but not by character as you expect.

In unit LazUnicode, there is a handy enumerator which helps you stepping through the string by character: Define a "character" variable ch which must be type "string", not "char", because it can consist of up to 4 bytes. Then use for ch in s do... to iterate through the string.

In Delphi, strings are encoded as UTF16, i.e. consist of 1 or 2 words per codepoint. Essentially this results in the same problem, but I guess the 2nd word is not needed for all Russian characters.

Logged

Martin_fr

Administrator
Hero Member
Posts: 9857
Debugger - SynEdit - and more

Re: pos or utf8pos return incorrect value

« Reply #2 on: May 17, 2019, 06:13:48 pm »

Just to underline the last line of wp:

Quote

In Delphi, strings are encoded as UTF16, i.e. consist of 1 or 2 words per codepoint.

In Delphi this may work, because you are lucky. The code "s[1]" is still wrong. But with the Russian chars you use, the error will never manifest. Because those chars (not verified, but likely) are each one word in UTF16.

If you did another language, then it (s[1]) would fail with UTF16 too.
Even some European chars like "ä" can have 2 words in UTF16 (Even in UTF32). They usually don't, but they can.

This is not bound to any form of utf-n. Utf is just an encoding for unicode. And some chars are of variable length. (google "combining codepoints")

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

MakcuM

Newbie
Posts: 1

Re: pos or utf8pos return incorrect value

« Reply #3 on: February 03, 2022, 06:37:25 pm »

Please, try this code:

Code: Pascal [Select][+]

function UTF8Translit(s: String): String;
const
  rus : string =  'абвгдеёжзийклмнопрстуфхцчшщьыъэюя';
  rusC: string =  'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЬЫЪЭЮЯ';
  lat : array[1..33] of string = ('a', 'b', 'v', 'g', 'd', 'e', 'yo', 'zh', 'z',
    'i', 'y', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'ts',
    'ch', 'sh', 'sch', '', 'y', '', 'e', 'yu', 'ya');
var
  ch: string;  // IMPORTANT: must not be "char"
   p: Integer;
  pC: Integer;
   i: integer;
begin
  Result := '';
  for i:=1 to UTF8Length(s) do
    begin
      ch:= NthCodePoint(s,i);
      p := UTF8Pos(ch, rus);
      pC:= UTF8Pos(ch, rusC);
      if p < 1 then
        begin
          if PC < 1 then
            Result := Result + ch
          else
            if length(lat[pC])>0 then
              begin
                Result := Result + UpperCase(lat[pC][1]);
                if length(lat[pC])>1 then
                  for p:=2 to length(lat[pC])do
                    Result := Result + lat[pC][p];
              end;
        end
      else
        Result := Result + lat[p];
 
    end;
end;