Recent

Author Topic: [SOLVED] Replacing a non-ascii chars in UTF8 string  (Read 3032 times)

alpine

  • Hero Member
  • *****
  • Posts: 1061
[SOLVED] Replacing a non-ascii chars in UTF8 string
« on: January 15, 2022, 11:45:46 am »
Hi,

Is anyone figured out an easy way to replace non-ascii chars in a UTF8 string with a translation table?

The problem is when the user enters a license plate number into an edit box, it can look same as the real number but written in Cyrillic characters. In Cyrillic alphabet there is many letters that look as Latin ones. When such a wrong number enters into the database, it can't be found easily.

The obvious way should be to replace all similar looking chars with Latin ones before processing, but I can't find a simple way of doing that.
« Last Edit: January 15, 2022, 02:16:29 pm by y.ivanov »
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Replacing a non-ascii chars in UTF8 string
« Reply #1 on: January 15, 2022, 12:20:36 pm »
Hi!

The UTF8 chars are grouped.

You should get the value of the UTF8char and look if it is member of the latin groups or the cyrilic groups.

The values:

Teh basic groups are

The Latin Groups U+0000 .. U+2CF7
The Cyrillic Groups U+0400 .. U+052F

There are some more.

Have a look at https://www.utf8-chartable.de/unicode-utf8-table.pl

Winni

Thaddy

  • Hero Member
  • *****
  • Posts: 14364
  • Sensorship about opinions does not belong here.
Re: Replacing a non-ascii chars in UTF8 string
« Reply #2 on: January 15, 2022, 12:35:44 pm »
It would be best if the database field would be defined as UTF8. Then it will always store correctly. This also how most database string fields are defined, nowadays.
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Replacing a non-ascii chars in UTF8 string
« Reply #3 on: January 15, 2022, 12:50:52 pm »
It would be best if the database field would be defined as UTF8. Then it will always store correctly. This also how most database string fields are defined, nowadays.
Well, yes, but only if you type the same char when searching for the value. The storage isn't the problem.

alpine

  • Hero Member
  • *****
  • Posts: 1061
Re: Replacing a non-ascii chars in UTF8 string
« Reply #4 on: January 15, 2022, 12:53:25 pm »
It would be best if the database field would be defined as UTF8. Then it will always store correctly. This also how most database string fields are defined, nowadays.
It is defined as such.

The trouble is at another place - when the users enters data, usually the keyboard is switched to Cyrillic and because of similarity of the letters it is not obvious what kind of letters are there. The next time the keyboard can be switched to Latin - and the same plate number can look exactly the same, but with Latin letters.

FYI: АВЕКМНОРСТУХ are shared between Cyrillic/Latin, but they of course have different codes 

UTF8 field can't resolve that issue, the chars must be replaced with a help of translation table:
 АВЕКМНОРСТУХ -> ABEKMHOPCTYX
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Replacing a non-ascii chars in UTF8 string
« Reply #5 on: January 15, 2022, 01:07:47 pm »
Letter E

Note the letters that look like 'H' and such.

I just checked and the Cyrillic "E" isn't even in the picture. There are far too many E's to fit the box.
« Last Edit: January 15, 2022, 01:10:44 pm by SymbolicFrank »

wp

  • Hero Member
  • *****
  • Posts: 11912
Re: Replacing a non-ascii chars in UTF8 string
« Reply #6 on: January 15, 2022, 01:25:22 pm »
Using the string iterator of unit LazUnicode it is easy to run through the UTF8 codepoints of an input string and replace them as needed.

Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUnicode;
  3.  
  4. function CheckAndReplace(const AText: String): String;
  5. var
  6.   cyrillicCh, latinCh: String;
  7. begin
  8.   Result := '';
  9.   for cyrillicCh in AText do
  10.   begin
  11.     case cyrillicCh of
  12.       'А': latinCh := 'A';
  13.       'В': latinCh := 'B';
  14.       'е': latinCh := 'e';
  15.       'І': latinCh := 'I';
  16.       'Ѕ': latinCh := 'S';
  17.       'О': latinCh := 'O';
  18.       'Р': latinCh := 'P';
  19.       'Т': latinCh := 'T';
  20.       'С': latinCh := 'C';
  21.       'М': latinCh := 'M';
  22.       'Н': latinCh := 'H';
  23.       'а': latinCh := 'a';
  24.       'о': latinCh := 'o';
  25.       'р': latinCh := 'p';
  26.       'с': latinCh := 'c';
  27.       // add more...
  28.       else latinCh := cyrillicCh;  
  29.     end;
  30.     Result := Result + LatinCh;
  31.   end;
  32. end;

alpine

  • Hero Member
  • *****
  • Posts: 1061
Re: Replacing a non-ascii chars in UTF8 string
« Reply #7 on: January 15, 2022, 02:13:20 pm »
Using the string iterator of unit LazUnicode it is easy to run through the UTF8 codepoints of an input string and replace them as needed.
That's exactly what I've meant. Thanks a lot!
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

 

TinyPortal © 2005-2018