Forum > General

Character access in WideStrings, Unicode etc.

(1/3) > >>

idog:
Hi,

It should be simple, but despite all my searching I can't seem to find a clear answer...
I'm trying to write a small function that converts a string, one character at a time, according to two other strings serving as keys/values.

For example, the string "AAC" should become "113" using "ABCD" as keys and "1234" as values.

In the good old days I'd write something like this (simplified):


--- Code: ---function conv(const src, keys, values : string) : string;
var
  j : integer;

begin
  Result := src;
  for j := 1 to Length(src) do
   Result[j] := values[Pos(src[j], keys)];
end;

--- End code ---
 

Now, with all the unicode and widestrings, this simply doesn't work for strings with, in my case, Hebrew characters. The [] reference gives access only to 8-bit Chars.

How can I rewrite the above function to accommodate all kinds of strings/characters? thanks!

skalogryz:

--- Quote from: idog link=topic=8910.msg43189#msg43189 ---How can I rewrite the above function to accommodate all kinds of strings/characters? thanks!

--- End quote ---

WideString conversion, but  make sure that src, keys and values don't have surrogate characters.

--- Code: ---function convwide(const src, keys, values: WideString): WideString;
var
  j : Integer;
begin
  Result := src;
  for j := 1 to Length(src) do
   Result[j] := values[Pos(src[j], keys)];
end;

--- End code ---

This one should work on ANY utf8 encoded string

// NOTE: the function has been fixed, after Idog's test

--- Code: ---// should work on ANY utf8 encoded string
function convutf8(const src, keys, values: string): string;
var
  i : Integer;
  p : Integer;
begin
  i:=1;
  Result:='';
  // note: UTF8Length is expensive call, it's better to call it once!
  for i:=1 to UTF8Length(src) do begin
    p:=Utf8Pos( Utf8Copy(src, i, 1), keys);
    if p>0 then Result:=Result+Utf8Copy(values, p, 1);
  end;
end;

--- End code ---

I'm attaching the sample, that uses both conversions. Cyrillic characters are used by default, just replace them with Hebrew chars

// the attached sample can be found in later posts

idog:
Wow, thanks for the effort! But there are still some unresolved issues.

Your program works well in "Wide" mode, but when I use the very same function in my code it doesn't work (the result is added into a TMemo.Lines, and I just see "?" characters). I'm sending the "Keys" and "Values" as constants, e.g.


--- Code: ---Memo1.Lines.Add(convwide('YNET', 'YNET', 'טמקא')));

--- End code ---

As for the utf8 conversion, in your program it works only in one direction - when src and Keys are Hebrew and Values is English. If they are the other way around, using the same strings as in the code above, I get "ט?מ" in the result (first two characters of the expected four-character result, with a "?" between them). See attached screenshot.

skalogryz:

--- Quote from: idog on March 16, 2010, 08:13:30 am ---Your program works well in "Wide" mode, but when I use the very same function in my code it doesn't work (the result is added into a TMemo.Lines, and I just see "?" characters). I'm sending the "Keys" and "Values" as constants, e.g.

--- End quote ---

when you're using direct ansi to wide convertion, RTL comes to play, and treats ansi string as local encoding. While LCL accepts strings in UTF8. So you need to take care about converting wide strings into utf8 first:

--- Code: ---Memo1.Lines.Add(utf8Encode(convwide('YNET', 'YNET', 'טמקא'))));

--- End code ---


--- Quote ---As for the utf8 conversion, in your program it works only in one direction - when src and Keys are Hebrew and Values is English.

--- End quote ---
Sorry. I've fixed the function text above (UTF8Copy should be used instead of Copy) . i'm resending the source code.

idog:
Ok, now both work (using Utf8Decode on the string parameters). Great!

Only question now is.. when and how to use these techniques! :) - I mean, should I always use WideString instead of String in multilanguage applications? Should I convert to/from utf8 every time I send/receive strings to LCL components? At this point I feel like I'm Voodoo programming... is there a concise guide for modern string usage?

And, more imporant, can I no longer trust the old "[]" to get and set single characters in strings?!  :o

Navigation

[0] Message Index

[#] Next page

Go to full version