Recent

Author Topic: Character access in WideStrings, Unicode etc.  (Read 14979 times)

idog

  • Full Member
  • ***
  • Posts: 121
    • www.idogendel.com (Hebrew)
Character access in WideStrings, Unicode etc.
« on: March 16, 2010, 06:50:24 am »
Hi,

It should be simple, but despite all my searching I can't seem to find a clear answer...
I'm trying to write a small function that converts a string, one character at a time, according to two other strings serving as keys/values.

For example, the string "AAC" should become "113" using "ABCD" as keys and "1234" as values.

In the good old days I'd write something like this (simplified):

Code: [Select]
function conv(const src, keys, values : string) : string;
var
  j : integer;

begin
  Result := src;
  for j := 1 to Length(src) do
   Result[j] := values[Pos(src[j], keys)];
end;
 

Now, with all the unicode and widestrings, this simply doesn't work for strings with, in my case, Hebrew characters. The [] reference gives access only to 8-bit Chars.

How can I rewrite the above function to accommodate all kinds of strings/characters? thanks!
« Last Edit: March 16, 2010, 06:53:08 am by idog »

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: Character access in WideStrings, Unicode etc.
« Reply #1 on: March 16, 2010, 07:14:23 am »
Quote from: idog link=topic=8910.msg43189#msg43189
How can I rewrite the above function to accommodate all kinds of strings/characters? thanks!

WideString conversion, but  make sure that src, keys and values don't have surrogate characters.
Code: [Select]
function convwide(const src, keys, values: WideString): WideString;
var
  j : Integer;
begin
  Result := src;
  for j := 1 to Length(src) do
   Result[j] := values[Pos(src[j], keys)];
end;

This one should work on ANY utf8 encoded string

// NOTE: the function has been fixed, after Idog's test
Code: [Select]
// should work on ANY utf8 encoded string
function convutf8(const src, keys, values: string): string;
var
  i : Integer;
  p : Integer;
begin
  i:=1;
  Result:='';
  // note: UTF8Length is expensive call, it's better to call it once!
  for i:=1 to UTF8Length(src) do begin
    p:=Utf8Pos( Utf8Copy(src, i, 1), keys);
    if p>0 then Result:=Result+Utf8Copy(values, p, 1);
  end;
end;

I'm attaching the sample, that uses both conversions. Cyrillic characters are used by default, just replace them with Hebrew chars

// the attached sample can be found in later posts
« Last Edit: March 16, 2010, 08:53:37 am by skalogryz »

idog

  • Full Member
  • ***
  • Posts: 121
    • www.idogendel.com (Hebrew)
Re: Character access in WideStrings, Unicode etc.
« Reply #2 on: March 16, 2010, 08:13:30 am »
Wow, thanks for the effort! But there are still some unresolved issues.

Your program works well in "Wide" mode, but when I use the very same function in my code it doesn't work (the result is added into a TMemo.Lines, and I just see "?" characters). I'm sending the "Keys" and "Values" as constants, e.g.

Code: [Select]
Memo1.Lines.Add(convwide('YNET', 'YNET', 'טמקא')));

As for the utf8 conversion, in your program it works only in one direction - when src and Keys are Hebrew and Values is English. If they are the other way around, using the same strings as in the code above, I get "ט?מ" in the result (first two characters of the expected four-character result, with a "?" between them). See attached screenshot.

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: Character access in WideStrings, Unicode etc.
« Reply #3 on: March 16, 2010, 08:34:15 am »
Your program works well in "Wide" mode, but when I use the very same function in my code it doesn't work (the result is added into a TMemo.Lines, and I just see "?" characters). I'm sending the "Keys" and "Values" as constants, e.g.

when you're using direct ansi to wide convertion, RTL comes to play, and treats ansi string as local encoding. While LCL accepts strings in UTF8. So you need to take care about converting wide strings into utf8 first:
Code: [Select]
Memo1.Lines.Add(utf8Encode(convwide('YNET', 'YNET', 'טמקא'))));

Quote
As for the utf8 conversion, in your program it works only in one direction - when src and Keys are Hebrew and Values is English.
Sorry. I've fixed the function text above (UTF8Copy should be used instead of Copy) . i'm resending the source code.
« Last Edit: March 16, 2010, 08:57:47 am by skalogryz »

idog

  • Full Member
  • ***
  • Posts: 121
    • www.idogendel.com (Hebrew)
Re: Character access in WideStrings, Unicode etc.
« Reply #4 on: March 16, 2010, 09:02:08 am »
Ok, now both work (using Utf8Decode on the string parameters). Great!

Only question now is.. when and how to use these techniques! :) - I mean, should I always use WideString instead of String in multilanguage applications? Should I convert to/from utf8 every time I send/receive strings to LCL components? At this point I feel like I'm Voodoo programming... is there a concise guide for modern string usage?

And, more imporant, can I no longer trust the old "[]" to get and set single characters in strings?!  :o

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1933
Re: Character access in WideStrings, Unicode etc.
« Reply #5 on: March 16, 2010, 09:04:52 am »
Lazarus works with UTF-8.
For some helper stuff: http://wiki.lazarus.freepascal.org/Theodp

idog

  • Full Member
  • ***
  • Posts: 121
    • www.idogendel.com (Hebrew)
Re: Character access in WideStrings, Unicode etc.
« Reply #6 on: March 16, 2010, 09:22:41 am »
Lazarus works with UTF-8.
For some helper stuff: http://wiki.lazarus.freepascal.org/Theodp

Mind boggling  :) I'll have to take a deeper look later. But I did notice you used simple  indexing in your examples (e.g.
Code: [Select]
s[i]). My question is, what does this expression return? is it always an 8-bit Char type? Or does it depend on the string type? Is the Copy function the only way to safely isolate characters - and how can I change individual characters within an existing string?

(Sorry for the frequent editing...)
« Last Edit: March 16, 2010, 09:25:53 am by idog »

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: Character access in WideStrings, Unicode etc.
« Reply #7 on: March 16, 2010, 09:36:53 am »
Only question now is.. when and how to use these techniques! :) - I mean, should I always use WideString instead of String in multilanguage applications?
It's up to you. You can use utf8 for multilanguage applications.

Should I convert to/from utf8 every time I send/receive strings to LCL components? At this point I feel like I'm Voodoo programming...
yes, you should. You can also try to override RTL widestring manager, so any ansi-wide are automatically converted utf8-wide, but this might effect non LCL functions. So i strongly discourage you from doing this.

To avoid all the time utf8<->wide conversion you can use utf8 for string processing (so no widestrings at all). But it might be hard and require much changes for the existing code.

And, more imporant, can I no longer trust the old "[]" to get and set single characters in strings?!  :o
You CAN'T, if you're working with UTF8 or any multibyte encoding.
You can, if you're working with WideStrings but only when no surrogate symbols are used. And i don't know any application that supports them.


One more reminder. While LCL is using utf8 strings are its encoding. RTL is still using system current locale, and it might NOT be UTF8. So, you might really need to make Voodoo programming. like:
Code: [Select]
var
  s : WideString;
  fs: TFileStream;
begin
  if not OpenDialog1.Execute then Exit;

  // opendialog is LCL, so it uses utf8
  s:=UTF8Decode(OpenDialog1.FileName);

  // TFileStream is RTL, so native RTL conversion should be used
  // don't use utf8encode() for it
  fs:=TFileStream.Create(s, fmOpenRead);

  // Memo is LCL again, so convert a string to utf8
  Memo1.Lines.Add('opening file: ', UTF8Encode(s));
« Last Edit: March 16, 2010, 09:39:55 am by skalogryz »

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1933
Re: Character access in WideStrings, Unicode etc.
« Reply #8 on: March 16, 2010, 10:03:35 am »
My question is, what does this expression return? is it always an 8-bit Char type?

No, it's not possible to represent all unicode code points in the char range.
It returns an UTF8String (default) OR you can get a 32bit character:
See the definiton:
Code: [Select]
    property UTF8Chars[Index: Integer]: UTF8String read GetUTF8Char write PutUTF8Char; default;
    property UCS4Chars[Index: Integer]: UCS4Char read GetUCS4Char write PutUCS4Char;

If you scan the string like

Code: [Select]
repeat
AUCS4Char:=s.Next;
until s.Done;
then it returns an UCS4Char (32bit);

idog

  • Full Member
  • ***
  • Posts: 121
    • www.idogendel.com (Hebrew)
Re: Character access in WideStrings, Unicode etc.
« Reply #9 on: March 16, 2010, 08:02:24 pm »
Theo: Sorry, I didn't make myself clear - I meant the brackets as used by the FPC native string handling, not indexed properties. I mean, if I define:
Code: [Select]
var ws : WideString;
...
x := ws[1];
What is x? A simple Char? a WideChar? Does @ws[2] - @ws[1] = 2?

I need to research all these string types and uses in depth, I feel I got seriously left behind :) BTW, there's an Embarcadero webinar coming up on Legacy code and Unicode compatibility, And they also give away a whitepaper on the topic. Maybe it could help a little:
http://www.embarcadero.com/rad-in-action/migration-upgrade-center

In the meanwhile, here's a more practical question: For dictionaries and large word/string collections, one would traditionally build hash tables for fast access, and would use the first char as key:

Code: [Select]
type
  THashTable = Array [Char] of TSomeInnerDataStructure;

I actually used "Array [Char]" plenty of times in the past, both for English and Hebrew. Obviously, this won't work with utf8. So except for an intermediate conversion table (e.g. Hash index<=>UCS4Char), what can be done?

Thanks again for all the info,

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: Character access in WideStrings, Unicode etc.
« Reply #10 on: March 16, 2010, 08:39:37 pm »
Code: [Select]
var ws : WideString;
...
x := ws[1];
What is x? A simple Char? a WideChar? Does @ws[2] - @ws[1] = 2?
1) widechar
2) no
3) yes
4) yes. @ws[2]-@ws[1] = sizeof(widechar) = sizeof(word) = 2 :)

 

TinyPortal © 2005-2018