Recent

Author Topic: Lazarus and specific utf8/unicode characters  (Read 18017 times)

sasa

  • Guest
Lazarus and specific utf8/unicode characters
« on: May 13, 2011, 09:25:50 am »
Latest SVN Lazarus, FPC 2.4.2, Windows7 32_bit starter.
Task: converter from Latin to Cyrilic.
No special fonts are used (tried default, Arial...) nor charset (default, UNICODE)

Rather strange problem and possibly bug. Conversion tables are insert into .pas files as a constant in string array. Conversions goes correct until some special characters are used, e.g. "đ" to "ђ", or "š" to "ш". Thus comparison to find "đ" fails.

Assumed there is a difference in saved utf8 .pas, however assigning these character into edit.text or caption shows correct result. Assume also that string type have full utf8 is supported for windows.

This is very simple part of the code to demonstrate the problem:
Code: [Select]
procedure TForm1.Button5Click(Sender: TObject);

const
  Cyr : array [1..4] of string = ('а','б','ђ','ш');
  Lat : array [1..4] of string = ('a','b','đ','š');

var
  i: integer;
  s: string;
begin
   s:='abđš';

  for i:=1 to 4 do
  begin
    if  s[i] = Lat[i] then
      caption:=caption+ Cyr[i]
  end;

end;


Caption have only "аб", correct result should be "абђш".

Direct assigning value into caption returns correct result:
Code: [Select]
caption:= 'абђш';
or
caption:= 'abđš';

Can anyone confirm?

zeljko

  • Hero Member
  • *****
  • Posts: 1764
    • http://wiki.lazarus.freepascal.org/User:Zeljan
Re: Lazarus and specific utf8/unicode characters
« Reply #1 on: May 13, 2011, 09:51:17 am »
That works ok on linux afaik. Can you try UTF8Decode(Cyr) and see if it works under windows ?

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1932
Re: Lazarus and specific utf8/unicode characters
« Reply #2 on: May 13, 2011, 10:04:48 am »
No, this can't work. You have UTF-8 in your code editor.

Code: [Select]
  s:='abđš';

  for i:=1 to 4 do
  begin
    if  s[i] =


The length of "s" is not 4 but 6 I guess (UTF8Length). The whole idea won't work with UTF-8.
Use UTF8Decode (as Zeljko says) and WideString.

sasa

  • Guest
Re: Lazarus and specific utf8/unicode characters
« Reply #3 on: May 13, 2011, 10:52:19 am »
This is implicit typecasting nonsense, string should initially be in utf8 and logical length should be 4, as in Linux. "Code once and compile everywhere" fails here. I'm aware that earlier Lazarus version for windows force user to compile IDE with some special compiler directive to use unicode at all, however latest versions are UTF8 complaint. Or at least it should be by default.

Following tweaks works:
Code: [Select]
procedure TForm1.Button5Click(Sender: TObject);

const
  Cyr2 : array [1..4] of string = ('а','б','ђ','ш');
  Lat2 : array [1..4] of string = ('a','b','đ','š');

var
  i: integer;
  s: string;
begin
  s:=UTF8Decode('abđš');

  for i:=1 to 4 do
  begin
    if s[i] = UTF8Decode(Lat2[i]) then
      caption:=caption+ (Cyr2[i])
  end;
end;

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1932
Re: Lazarus and specific utf8/unicode characters
« Reply #4 on: May 13, 2011, 11:18:44 am »
"s" should be WideString in this case. UTF8Decode returns a WideString.


sasa

  • Guest
Re: Lazarus and specific utf8/unicode characters
« Reply #5 on: May 13, 2011, 11:30:21 am »
As mentioned, implicit typecasting complication, which should not be the issue at all.

If string is utf8 by default, any manipulation should assume UTF8, except explicit typecasting. If so, using widestring in this case should not need to forced at all.

Takeda

  • Full Member
  • ***
  • Posts: 157
Re: Lazarus and specific utf8/unicode characters
« Reply #6 on: May 14, 2011, 03:28:43 am »
As mentioned, implicit typecasting complication, which should not be the issue at all.

If string is utf8 by default, any manipulation should assume UTF8, except explicit typecasting. If so, using widestring in this case should not need to forced at all.

Since you want to adding support about Unicode character then you must care about length of data too.. Coz I think length of data type is so important.. Myself had got same error just like you (in few month ago). But I was solved it. :)

As far as I know, length of "utf8string" > "string", so I think it suitable to accommodate standard length of "string" too..

regards,
takeda.
Call me Takeda coz that's my true name.
Pascal coding using Lazarus => "Be native in any where.."

ƪ(˘⌣˘)┐ ƪ(˘⌣˘)ʃ ┌(˘⌣˘)ʃ

sasa

  • Guest
Re: Lazarus and specific utf8/unicode characters
« Reply #7 on: May 14, 2011, 07:54:16 am »
Well, Linux is UTF8 based OS, Windows is not (Ansi, Widestring or UCS).

The problem we faced here is simple: UTF8 is physically variable length character (from 1 to 6), widechar is fixed to 2 bytes. The first is not applicable for indexing, the second is.

There is two solutions for indexing UTF8 string issue:

1. That compiler always starts from begin and  recalculate physical position of character in desired index. It is slow but at least logically correct.

2. To leave as is - index shows physical value of string array, which is not logically correct but saves lot of overheat.

I have checked today on Linux Zeljko, primary code also fails. Thus case 2.) is used in both Linux and Windows version.

felipemdc

  • Administrator
  • Hero Member
  • *
  • Posts: 3538
Re: Lazarus and specific utf8/unicode characters
« Reply #8 on: May 18, 2011, 03:57:14 pm »
Well, Linux is UTF8 based OS, Windows is not (Ansi, Widestring or UCS).

With Lazarus all OSes are utf-8 based, because it hides the differences between them. Your original code is wrong, you cannot access s[ i ] and expect a utf-8 char out of it, it returns one byte, not one utf-8 char.

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: Lazarus and specific utf8/unicode characters
« Reply #9 on: May 18, 2011, 04:28:56 pm »
@sasa

The recommended way to deal with UTF8 chars is:

Code: [Select]
var
  Unicode, Charlen :integer;
  P :PChar;
begin
  P := YourString;
  repeat
    Unicode := UTF8CharacterToUnicode(P, Charlen);

   // your code

    Inc(P, Charlen);
  until (Charlen = 0) or (Unicode = 0);

Zoran

  • Hero Member
  • *****
  • Posts: 1949
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: Lazarus and specific utf8/unicode characters
« Reply #10 on: May 18, 2011, 09:26:53 pm »
Theo's utf8tools package has TUTF8Scanner class, which is what you need. :)
See: http://wiki.lazarus.freepascal.org/Theodp
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

sasa

  • Guest
Re: Lazarus and specific utf8/unicode characters
« Reply #11 on: May 19, 2011, 08:33:44 am »
Thanks for all suggestions, but here is not the technical issue to solve the problem, but inconsistence of string type and semantic, as noted in my previous  post. It is logical that if string type is in UTF8 to apply all rules to deal with UTF8 characters and strings, not apply different rules for different situations  (ASCII rule for indexing, AFICS some basic string functions assume no UTF8, implicit and explicit typecasting inconsistence, etc).

 

TinyPortal © 2005-2018