Recent

Author Topic: String functions. conversion from Delphi  (Read 25714 times)

mm7

  • Full Member
  • ***
  • Posts: 193
  • PDP-11 RSX Pascal, Turbo Pascal, Delphi, Lazarus
Re: String functions. conversion from Delphi
« Reply #15 on: March 29, 2015, 07:05:55 pm »
Yuha, there are many places where I should change Length to UTF8Length
And, as you said, UTF8Length should produce same result for ASCII string as Length does. So, what the point keep using ASCII Length? (except speed. But for my case the correctness is priority)

Apart from that, I've found weird behaviour of compiler.
There are two built-in Length functions. One for String another for dynamic arrays. They both have different signatures. Compiler recognizes type of parameter and uses this or that. I override one, for another one FPC should just use built-in.
Why it does not? Is it a bug?

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: String functions. conversion from Delphi
« Reply #16 on: March 29, 2015, 07:38:12 pm »
Yuha, there are many places where I should change Length to UTF8Length
And, as you said, UTF8Length should produce same result for ASCII string as Length does. So, what the point keep using ASCII Length? (except speed. But for my case the correctness is priority)

Because it is the "right" thing to do with UTF-8. Read this :
  http://wiki.freepascal.org/UTF8_strings_and_characters
again carefully + search information from internet and maybe you get the idea.

Your comment "But for my case the correctness is priority" gives the impression that the ascii functions would give wrong results sometimes. They don't! That is the beauty of UTF-8.
Your questions prove that you did not understand the fundamental idea yet.

Could you please copy a code snippet where you think that UTF8 functions are needed. I can look at it.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

mm7

  • Full Member
  • ***
  • Posts: 193
  • PDP-11 RSX Pascal, Turbo Pascal, Delphi, Lazarus
Re: String functions. conversion from Delphi
« Reply #17 on: March 30, 2015, 05:56:21 pm »
Your comment "But for my case the correctness is priority" gives the impression that the ascii functions would give wrong results sometimes. They don't!
Of course they do.
Could you please copy a code snippet where you think that UTF8 functions are needed. I can look at it.

Well (I am amazed that I have to illustrate such an obvious thing)
Code: [Select]
// fpc -Fu/usr/share/lazarus/1.4RC2/components/lazutils/lib/x86_64-linux/ ASCIIvsUTF8.pas
program ASCIIvsUTF8;
{$mode delphi}
uses SysUtils, LazUTF8;

var S:String = 'Русское слово';
 SA,SU:String;
begin

 writeln('get length');
 writeln(S,':',Length(S));
 writeln(S,':',UTF8Length(S));

 writeln('cut to the length 10');
 SA:=S; SU:=S;
 if Length(S) > 10 then SA:=Copy(S,1,10);
 if UTF8Length(S) > 10 then SU:=UTF8Copy(S,1,10);
 writeln(SA+'|');
 writeln(SU+'|');

 writeln('extend to the length 20');
 SA:=S; SU:=S;
 if Length(S) < 20 then SA:=S+StringOfChar(' ',20-Length(S));
 if UTF8Length(S) < 20 then SU:=S+StringOfChar(' ',20-UTF8Length(S));
 writeln(SA+'|');
 writeln(SU+'|');

 writeln('Uppercase');
 SA:=Uppercase(S);
 SU:=UTF8Uppercase(S);
 writeln(SA);
 writeln(SU);

 writeln('Lowercase');
 SA:=Lowercase(S);
 SU:=UTF8Lowercase(S);
 writeln(SA);
 writeln(SU);
end.
Output
Code: [Select]
get length
Русское слово:25
Русское слово:13
cut to the length 10
Русск|
Русское сл|
extend to the length 20
Русское слово|
Русское слово       |
Uppercase
Русское слово
РУССКОЕ СЛОВО
Lowercase
Русское слово
русское слово

You can see that UTF8 variant of processing is always correct. But ASCII is always wrong here.
Adding to this, you can get all kinds of artifacts in case if last UTF8 char is partially cut by Copy, like I described in beginning of the thread. GTK2 widgets become very upset and do all kinds of funny things when they meet malformed UTF8 strings.

« Last Edit: March 30, 2015, 05:59:21 pm by mm7 »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9857
  • Debugger - SynEdit - and more
    • wiki
Re: String functions. conversion from Delphi
« Reply #18 on: March 30, 2015, 06:43:40 pm »
Code: [Select]
if Length(S) < 20 then SA:=S+StringOfChar(' ',20-Length(S));
 if UTF8Length(S) < 20 then SU:=S+StringOfChar(' ',20-UTF8Length(S));
You can see that UTF8 variant of processing is always correct. But ASCII is always wrong here.
Adding to this, you can get all kinds of artifacts in case if last UTF8 char is partially cut by Copy, like I described in beginning of the thread. GTK2 widgets become very upset and do all kinds of funny things when they meet malformed UTF8 strings.

And the utf8 version will also get several cases wrong. (And so does utf16 in Delphi, afaik)

Letters like äöüâáô... can be one or 2 utf codepoints. And utf8length (as does length in Delphi) returns the number of codepoints, but not the number of chars.

There are even some accented letters, that are always 2 codepoints.

In unicode, (utf8 and utf18) for some of those letters there are individual codepoint, but not for all.
All of them can be represented as basechar (e.g. "a" + combining mark (e.g. "`").

And more utf8 can contain none printable control chars. It is of course up to you, if you want to add space for them or not. But if you do not then visual alignment will fail. Some of them are
- zero width space
- soft hyphen http://en.wikipedia.org/wiki/Soft_hyphen
- RTL/LTR marker
...

And least not last, there are surrogate pairs (2 codepoints for one char)). Though they are only for rather uncommon chars.


Mind you, the same issues exist when using length in Delphi (at least to the best of my knowledge)
« Last Edit: March 30, 2015, 06:46:28 pm by Martin_fr »

mm7

  • Full Member
  • ***
  • Posts: 193
  • PDP-11 RSX Pascal, Turbo Pascal, Delphi, Lazarus
Re: String functions. conversion from Delphi
« Reply #19 on: March 30, 2015, 10:07:25 pm »

And the utf8 version will also get several cases wrong. (And so does utf16 in Delphi, afaik)

Letters like äöüâáô... can be one or 2 utf codepoints. And utf8length (as does length in Delphi) returns the number of codepoints, but not the number of chars.
...
I think if these cases are processed wrongly, it means that it is a bug and UTF8Length should be fixed to take these combining characters into account (to skip them). They have specific codepoints and it is just a matter of agreement of developers and fixing UTF8Length (and other functions).
Non-printables should be counted. Traditionally. I think.
I do not understand a need to use multi-codepoint surrogates when UTF8 already has huge codespace (that is even artificially limited to 4 bytes). Presumably such surrogates can appear after conversion to UTF16 and back?
Definitely they have to be normalized to single codepoint UTF8. I think.
« Last Edit: March 30, 2015, 10:09:05 pm by mm7 »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9857
  • Debugger - SynEdit - and more
    • wiki
Re: String functions. conversion from Delphi
« Reply #20 on: March 30, 2015, 11:02:47 pm »
Surrogate pairs exists in utf16 too. (All of the cases I described exist in utf 16 too)

As for the way utf8Length (and others work): Their are use case for both behaviours.

Utf8Length mimics what Delphi does in length.
You also should remember that utf8 length was char=byte before, even for multibyte encodings (yes long before utf, there where ANSI codepages with multibyte encodings). So in a way it stays true to that too. (Even if that was probably not true to the original intend)

What you would like is  Utf8CharCount(string) That would need to follow the (quite complex) unicode standard.
And since that is extended every now and then, it should not be hardcoded, but call the appropriate libraries on the platform.

-------------
Another note, that may not apply to your case. (And only if you care about East Asian text)

If this is about visual alignment when used with a monospaced font:

Even with a monospaced font. 2 strings with the same amount of visible chars, are not necessarily of the same visual length. (see "half width" and "full width")

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: String functions. conversion from Delphi
« Reply #21 on: March 30, 2015, 11:05:42 pm »
I think if these cases are processed wrongly, it means that it is a bug and UTF8Length should be fixed to take these combining characters into account (to skip them). They have specific codepoints and it is just a matter of agreement of developers and fixing UTF8Length (and other functions).
Non-printables should be counted. Traditionally. I think.
I do not understand a need to use multi-codepoint surrogates when UTF8 already has huge codespace (that is even artificially limited to 4 bytes). Presumably such surrogates can appear after conversion to UTF16 and back?

UTF8Length counts codepoints.
Multi-codepoint surrogates decomposed characters are a feature of Unicode character definition, not a feature of any encoding. It means even UTF-32 does not solve this problem.
Why are they defined as they are? I don't know, you must ask the people who created the Unicode standard.

Quote
Definitely they have to be normalized to single codepoint UTF8. I think.

Normalizing is a big task because there are so many combinations.
Anyway, lets forget the surrogate pairs decomposed characters for a moment. They will always be a problem regardless of the encoding or the LazUTF8 functions you use.

Earlier I referred only to cases where one Unicode character occupies only one codepoint.
Your example again copies a fixed width of 10 chars. Then you obviously must use UTF8Length and UTF8Copy.
How often you REALLY need such code? Not very often I guess, or then you have some weird code.
Typical code works with strings got from user input or from files or DBs, finds sub-string positions and acts accordingly. A good example is the SplitInHalf function from the wiki page.
It takes in 2 Unicode strings and returns 2 Unicode strings, but does not use any UTF-8 specific functions, yet it works correctly always.

Code: [Select]
function SplitInHalf(Txt, Separator: string; out Half1, Half2: string): Boolean;
var
  i: Integer;
begin
  i := Pos(Separator, Txt);
  Result := i > 0;
  if Result then
  begin
    Half1 := Copy(Txt, 1, i-1);
    Half2 := Copy(Txt, i+Length(Separator), Length(Txt));
  end;
end;

I bet the majority of your code falls into this category, too. Why would you always chop fixed length pieces from your strings? I don't believe you.

Another story indeed are the surrogate pairs decomposed Unicode characters. UTF8Pos(), UTF8Copy() or UTF8Length() cannot handle them gracefully. LCL does not have a special functions for them.
« Last Edit: April 04, 2015, 04:58:50 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

mm7

  • Full Member
  • ***
  • Posts: 193
  • PDP-11 RSX Pascal, Turbo Pascal, Delphi, Lazarus
Re: String functions. conversion from Delphi
« Reply #22 on: March 31, 2015, 04:49:51 pm »
...
I bet the majority of your code falls into this category, too. Why would you always chop fixed length pieces from your strings? I don't believe you.
Juha, eys and no. As usual in our life. :)

Yes, one part of code does parsing where uses "find-and-split". And here ASCII functions work well. But UTF8 would work equally well except, may be, performance.

Another part of code does formatted output of text tables. Those old style ones, with '-', '+' and '|' "frame" characters. (Despite these tables have output to graphical windows, and another modern way of formatting like HTML could be used, it is still kind of a "requirement". I am porting program from Delphi to Lazarus, and the target for the time been is a maximal similarity with existing one. Otherwise users will say it is all wrong. (joke)  :) )

So, here I need either to cut or to expand values to a required size.
That is why.

If you are interested, and you do ship/vessel modeling, and you are Linux x64_86 GTK2 user, you can download it from here http://sourceforge.net/projects/freeship-plus-in-lazarus/
I.e. you can model your kayak (or a tanker  :D ) and estimate its hydrostatics and hydrodynamic features.

Another story indeed are the surrogate pairs. UTF8Pos(), UTF8Copy() or UTF8Length() cannot handle them gracefully. LCL does not have a special functions for surrogate pairs. Delphi has but very few people use them.
Is there a fundamental problem to add this functionality to LazUTF8, either into existing UTF8*() or to new ones? Or it is just "not done yet" and anybody's help is welcome?

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9857
  • Debugger - SynEdit - and more
    • wiki
Re: String functions. conversion from Delphi
« Reply #23 on: March 31, 2015, 05:17:20 pm »
Another part of code does formatted output of text tables. Those old style ones, with '-', '+' and '|' "frame" characters.

In that case you probably only deal with latin chars? Because as I said, East Asian chars have "full width" chars, that are (in a monospaced font, and on a console output) twice the width. (yet they are one char, and usually also one codepoint)

Quote
Is there a fundamental problem to add this functionality to LazUTF8, either into existing UTF8*() or to new ones? Or it is just "not done yet" and anybody's help is welcome?

I do not know from Juha's comment if Delphi's function is for surrogate pairs, combining codepoints, or both.

Their are 2 considerations.

1) The work must be done by somebody. I guess if it is done (with sufficient quality) then it can be added.

2) Hardcoding it to the LCL, is not the best solution, but still acceptable I guess. (Or maybe , if the below does not apply, then it is all fine)

Afaik the utf specs are updated regularly. No idea if ever either of the 2 classes have add ons. I would guess that surrogate pairs are unlikely to be extended. No idea about combining.
Depending on this hardcoding means compiled apps need to be rebuild with a newer LCL if it happens.

If instead an external library was called (afaik any OS should provide this nowadays) then the update would be within the library.

Overall, this (2) may be a minor point.... So its down to (1) a person.

---------------------------
You find a range of combining (or at least what they were a few years back) in SynEdit. It does not include surrogate pairs.

components\synedit\synedittextbuffer.pp
TSynEditStringList.LogicPosIsCombining

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: String functions. conversion from Delphi
« Reply #24 on: March 31, 2015, 07:58:23 pm »
I do not know from Juha's comment if Delphi's function is for surrogate pairs, combining codepoints, or both.

Delphi has at least :
  http://docwiki.embarcadero.com/Libraries/XE7/en/System.Character.IsHighSurrogate
  http://docwiki.embarcadero.com/Libraries/XE7/en/System.Character.IsLowSurrogate
[Edit] They are only for UTF-16 surrogate pairs.

The UTF-8 functions in LazUTF8 work with codepoints by definition. They will not be changed.
New functions for surrogate pairs decomposed characters will be added if somebody makes them. So, using the old phrase: Patches are welcome.

Quote
You find a range of combining (or at least what they were a few years back) in SynEdit. It does not include surrogate pairs.
components\synedit\synedittextbuffer.pp
TSynEditStringList.LogicPosIsCombining

I am not sure what "combining" mean here. Is it about glyps?
« Last Edit: April 04, 2015, 05:43:23 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9857
  • Debugger - SynEdit - and more
    • wiki
Re: String functions. conversion from Delphi
« Reply #25 on: March 31, 2015, 08:18:49 pm »
combining are codepoints that add to the codepoint before.


e.g "a" + <accent>

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: String functions. conversion from Delphi
« Reply #26 on: March 31, 2015, 10:14:39 pm »
combining are codepoints that add to the codepoint before.

e.g "a" + <accent>

Ok, so it is the same surrogate pair concept then.
My answer was based on confusion with terms. :(
« Last Edit: April 04, 2015, 05:03:34 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

jarto

  • Full Member
  • ***
  • Posts: 106
Re: String functions. conversion from Delphi
« Reply #27 on: April 04, 2015, 12:09:55 pm »
Having read all this discussion, I took a look at Utf8Length.

Utf8Length basically runs through the string counting utf8-characters with UTF8CharacterLength. UTF8CharacterLength analyzes 1-4 characters from the called position and returns the length of the utf8 character.

However, there are no checks in UTF8CharacterLength against faulty data. If you call Utf8Length with anything but a valid utf8 string, UTF8CharacterLength may read past the end of the string and cause a range check error.  :o

Should I file a bug report?

BeniBela

  • Hero Member
  • *****
  • Posts: 906
    • homepage
Re: String functions. conversion from Delphi
« Reply #28 on: April 04, 2015, 12:49:30 pm »

How often you REALLY need such code? Not very often I guess, or then you have some weird code.


I use it all the time.

And it was common enough writeln got the special  : : syntax for it. Btw. did anyone fix that to work on displayed character level for unicode?

(and dangerous on Android, because it crashes if you call a Java function with an invalid utf-8 string)




Ok, so it is the same surrogate pair concept then.

Except you can have arbitrary many accents. ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

And they exist.

There are no surrogate pairs in UTF 8  !

 

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: String functions. conversion from Delphi
« Reply #29 on: April 04, 2015, 02:58:52 pm »
However, there are no checks in UTF8CharacterLength against faulty data. If you call Utf8Length with anything but a valid utf8 string, UTF8CharacterLength may read past the end of the string and cause a range check error.  :o

Should I file a bug report?

Yes Jarto, please report. It is clearly a bug.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018