Recent

Author Topic: A simple sane question to end insanity! TStringList in unicode mode  (Read 24102 times)

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #15 on: August 15, 2017, 09:37:58 am »
My apologies, JuhaManninen, I did read the page with the link you gave me but I was not able to understand it. It must be me, so let me take a second read and see what it says exactly.

Not trying to be difficult.  :) It's just this is not a straightforward topic.

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #16 on: August 15, 2017, 10:46:53 am »
So, I've read these pages again and I hope I understood this right.
Here's a sample program I wrote, based on what I thought I read in these pages...
Code: Pascal  [Select][+][-]
  1. program Project1;
  2. Uses LazUTF8, Classes;
  3. Var SL : TStringList;
  4. begin
  5.   SL := TStringList.Create;
  6.  
  7.   With Sl do
  8.   begin
  9.     Sorted        := true;
  10.     CaseSensitive := True;
  11.     Add('Garçon'  );
  12.     Add('Èternuer');
  13.     Add('éternuer');
  14.     Add('B');
  15.     Add('a');
  16.     //Next line does not compile
  17.     //Add(StringOfChar('ö', 5));
  18.   end;
  19.   Writeln('1: ' , SL[0]);
  20.   Writeln('2: ' , SL[1]);
  21.   Writeln('3: ' , SL[2]);
  22.   Writeln('4: ' , SL[3]);
  23.   Writeln('5: ' , SL[4]);
  24.   readln;
  25.   SL.Free;
  26. end.                
  27.  

When I run this code, I get the following output:
1: a
2: B
3: éternuer
4: Eternuer
5: Garçon

Three things to note:
  • Is 'a' supposed to be less than 'B'? In the ascii table, 'a' is greater than 'B'. Am I forgetting to switch to the right code page?
  • The accent on the upper E is missing... it should be É but I get back E.
  • StringOfChar seems to be hardcoded to ansiChar and it does not seem to switch to unicode. In fact, if you uncomment the one commented line of code, you'll get a compiler error.


Now, when I run the code I wrote with the drop in replacement for TStringList, I get the following (correct) output:
1: Garçon
2: Éternuer
3: B
4: éternuer
5: ööööö
6: a

Perhaps I'm still not understanding what the pages you referenced really meant to do, so, please help correct this sample program.



Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #17 on: August 15, 2017, 10:58:42 am »
I filed a bug against the string manager, because that is what caused it as far as I know. Plain code will work on any platform but windows (that has a string manager installed by default) unless you install a string manager...Problem is the string manager makes a compare insensitive... which causes all code that relies on a sensitive compare to fail..
Mantis 0032271
[edit]
Mantis moderators were so friendly to link 32270 to 32271 already. It is obviously being looked into.
« Last Edit: August 15, 2017, 11:03:20 am by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4468
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #18 on: August 15, 2017, 06:09:47 pm »
Is 'a' supposed to be less than 'B'? In the ascii table, 'a' is greater than 'B'. Am I forgetting to switch to the right code page?
Code pages are history but the sorting rules on Unicode depend on locale. The rules are complex and even an uppercased letter can differ in different locales.
It may also be a bug in string manager as Thaddy wrote. My knowledge is not enough now.

Quote
The accent on the upper E is missing... it should be É but I get back E.
That is unexpected. What could cause it? I will test when I boot Windows.

Quote
StringOfChar seems to be hardcoded to ansiChar and it does not seem to switch to unicode. In fact, if you uncomment the one commented line of code, you'll get a compiler error.
That is expected. 'ö' in UTF-8 takes 2 bytes.

Quote
Perhaps I'm still not understanding what the pages you referenced really meant to do, so, please help correct this sample program.
Your code looks good. Sorting Unicode however is close to black magic. Did you take locale info into account in your own replacement for TStringList?
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #19 on: August 15, 2017, 09:43:13 pm »
Here's a sample program I wrote, based on what I thought I read in these pages...
Code: Pascal  [Select][+][-]
  1. program Project1;
  2. Uses LazUTF8, Classes;
  3. Var SL : TStringList;
  4. begin
  5.   SL := TStringList.Create;
  6.  
  7.   With Sl do
  8.   begin
  9.     Sorted        := true;
  10.     CaseSensitive := True;
  11.     Add('Garçon'  );
  12.     Add('Èternuer');
  13.     Add('éternuer');
  14.     Add('B');
  15.     Add('a');
  16.     //Next line does not compile
  17.     //Add(StringOfChar('ö', 5));
  18.   end;
  19.   Writeln('1: ' , SL[0]);
  20.   Writeln('2: ' , SL[1]);
  21.   Writeln('3: ' , SL[2]);
  22.   Writeln('4: ' , SL[3]);
  23.   Writeln('5: ' , SL[4]);
  24.   readln;
  25.   SL.Free;
  26. end.                
  27.  

When I run this code, I get the following output:
1: a
2: B


I suspect that this output was before adding LazUTF8 unit. LazUTF8 uses UTF8CompareStr:
Code: Pascal  [Select][+][-]
  1. function UTF8CompareStr(S1: PChar; Count1: SizeInt; S2: PChar; Count2: SizeInt
  2.   ): PtrInt;
  3. var
  4.   Count: SizeInt;
  5. begin
  6.   Result := 0;
  7.   if Count1>Count2 then
  8.     Count:=Count2
  9.   else
  10.     Count:=Count1;
  11.   Result := CompareMemRange(Pointer(S1),Pointer(S2), Count); // Note: CompareMemRange can handle nil if Count=0
  12.   if Result<>0 then exit;
  13.   if Count1>Count2 then
  14.     Result:=1
  15.   else if Count1<Count2 then
  16.     Result:=-1
  17.   else
  18.     Result:=0;
  19. end;
  20.  

Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #20 on: August 15, 2017, 10:07:59 pm »
He has a point. It fails. You did not check!  >:(
Here's the reduced code:
Code: Pascal  [Select][+][-]
  1. //Good:
  2. program testansicomparegood;
  3. {$mode delphi}
  4. uses sysutils;
  5. begin
  6.   writeln('Aa vs aA ',AnsiCompareStr('Aa','aA'),' should be negative');
  7.   writeln('ab vs aa ',AnsiCompareStr('ab','aa'),' should be positive');
  8.   writeln('aa vs aa ', AnsiCompareStr('aa','aa'),' should be zero');
  9. end.
  10.  
  11. //Bad:
  12. program testansicomparebad;
  13. {$mode delphi}
  14. uses cwstring, sysutils; // string manager
  15. begin
  16.   writeln('Aa vs aA ',AnsiCompareStr('Aa','aA'),' should be negative');
  17.   writeln('ab vs aa ',AnsiCompareStr('ab','aa'),' should be positive');
  18.   writeln('aa vs aa ', AnsiCompareStr('aa','aa'),' should be zero');
  19. end.
Now, shut up.<very angry!, not grumpy,  >:D >:D > Bug is everywhere and it is a big one.
On Windows it always fails, on linux (any) it fails after installing a string manager....
« Last Edit: August 15, 2017, 10:25:43 pm by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #21 on: August 15, 2017, 10:13:12 pm »
In my case, I've done three things:
1. {$modeswitch unicodeStrings} : If I don't include this switch, I get a raft of warning messages
2. {$codepage utf-8}
3. My Case Insensitive method rely on this bit of code inside a loop
Code: Pascal  [Select][+][-]
  1.     V1 := Ord(S1[i]);
  2.     V2 := Ord(S2[i]);
  3.     If V1 = V2 then continue
  4.     else begin
  5.        Result := V1 - V2;
  6.        exit;
  7.     end;
  8.  

4. For Case Insensitive, the code is slightly modified as follows
Code: Pascal  [Select][+][-]
  1.     V1 := Ord(TCharater.ToLower(S1[i]));
  2.     V2 := Ord(TCharacterToLower(S2[i]));
  3.     If V1 = V2 then continue
  4.     else begin
  5.        Result := V1 - V2;
  6.        exit;
  7.     end;
  8.  

Thus far, the implementation of Ord and ToLower in the Character unit seems to work. However, I have not tested it without the two compiler conditionals and I don't claim that my implementations are optimal. If there's interest, I can post more of that code.

Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #22 on: August 15, 2017, 10:22:48 pm »
The only thing that matters is that you revealed a big bug. Don't try to work around it for now.
KUDOS for reporting it. It does not help to try different encodings. It is a bug.
« Last Edit: August 15, 2017, 10:24:53 pm by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #23 on: August 15, 2017, 10:25:14 pm »
The accent on the upper E is missing... it should be É but I get back E.

Change your console font to a true type font, and prepare it to accept UTF8 encoding, something along these lines:
Code: Pascal  [Select][+][-]
  1.   oldCP := GetConsoleOutputCP();
  2.   SetConsoleOutputCP(CP_UTF8);
  3.   SetTextCodePage(Output, CP_UTF8);
  4.   Writeln('1: ' , SL[0]);
  5.   Writeln('2: ' , SL[1]);
  6.   Writeln('3: ' , SL[2]);
  7.   Writeln('4: ' , SL[3]);
  8.   Writeln('5: ' , SL[4]);
  9.   SetConsoleOutputCP(oldCP);

Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #24 on: August 15, 2017, 10:27:39 pm »
Change your console font to a true type font, and prepare it to accept UTF8 encoding, something along these lines:
[
Nonsense. It is a bug.
(I used to give a similar answer, though, but in this case it is a big bug, as my simplified code demonstrates)
It is a big bug because it is low-level and affects sort order.....And hence any classes that rely on it also fail.
« Last Edit: August 15, 2017, 10:30:30 pm by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #25 on: August 15, 2017, 10:36:47 pm »
Change your console font to a true type font, and prepare it to accept UTF8 encoding, something along these lines:
[
Nonsense. It is a bug.
(I used to give a similar answer, though, but in this case it is a big bug, as my simplified code demonstrates)
It is a big bug because it is low-level and affects sort order.....And hence any classes that rely on it also fail.

I am not talking about the bug (or feature). It is about printing É in his console.

Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #26 on: August 15, 2017, 10:44:09 pm »
I am not talking about the bug (or feature). It is about printing É in his console.
I am aware of that. In this case stick to the subject...
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #27 on: August 15, 2017, 10:47:48 pm »
I am not talking about the bug (or feature). It is about printing É in his console.
I am aware of that. In this case stick to the subject...

Don't be silly, he brought that up.

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #28 on: August 15, 2017, 11:01:08 pm »
Hi engkin,

To Thaddy's point, the code I wrote to handle unicode, displays the text properly on the console with no font change whatsoever.

Thaddy,

Thanks for the advice. Well received. I was simply replying to a question that had been asked earlier about how I was managing this. Perfectly happy to wait for a fix.
« Last Edit: August 15, 2017, 11:03:15 pm by EganSolo »

Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #29 on: August 15, 2017, 11:01:08 pm »
I am not talking about the bug (or feature). It is about printing É in his console.
I am aware of that. In this case stick to the subject...

Don't be silly, he brought that up.
Yes , but you and me understand the bug. He doesn't. So shout bug instead of confused.
At least I trust you to have some understanding of what he has found, try it, and see it fails.
Dead giveaway: it is NOT the high level compare stuff
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

 

TinyPortal © 2005-2018