Recent

Author Topic: Sorting Middle Eastern and other Language Strings  (Read 13775 times)

Avishai

  • Hero Member
  • *****
  • Posts: 1021
Sorting Middle Eastern and other Language Strings
« on: September 30, 2011, 01:00:02 pm »
Lazarus/FPC has a problem sorting Middle Eastern and other language strings.  TComboBox, TListBox, TStringGrid and TStringList.  Each control sorts differently.  I have a QuickSort routine that works for Hebrew but I do not know if it works for other Middle Eastern Languages.  I do not know Arabic Alphabet :(  if some one could check for other Middle Easter and other languages and let me know, I will send it to FPC for patch if it works.  I checked sorting English first and it was OK.

Also it needs to be written better.  Too hard to read (I did not write it).

Use 2 TListBox
      1 TButton
   ListBox1.Sorted:= False (add letters of your Alphabet)
   ListBox2.Sorted:= False (empty)
   Button1  (OnClick should sort Alphabet and show in ListBox2)

Quote
procedure QuickSort(var A: TStringList);

  procedure Sort(L, R: Integer);
  var
    I, J: Integer;
    Y, X:string;
  begin
    I:= L; J:= R; X:= A[(L+R) DIV 2];
    repeat
      while strIcomp(pchar(A),pchar(X))<0 do inc(I);
      while StrIComp(pchar(X),pchar(A[J]))<0 do dec(J);
      if I <= J then
      begin
        Y:= A; A:= A[J]; A[J]:= Y;
        inc(I); dec(J);
      end;
    until I > J;
    if L < J then Sort(L,J);
    if I < R then Sort(I,R);
  end;

begin
  Sort(0,A.Count-1);
end;

procedure TForm1.Button1Click(Sender: TObject);
var
  StrLst: TStringList;
begin
  StrLst:= TStringList.Create;
    StrLst.CommaText:= ListBox1.Items.CommaText;
    QuickSort(StrLst);
    ListBox2.Items.CommaText:= StrLst.CommaText;
  FreeAndNil(StrLst);
end;
« Last Edit: October 01, 2011, 08:46:28 pm by Avishai »
Lazarus Trunk / fpc 2.6.2 / Win32

Avishai

  • Hero Member
  • *****
  • Posts: 1021
Re: Sorting Middle Eastern and other Language Strings
« Reply #1 on: October 01, 2011, 09:16:28 pm »
I have attached a test program using the QuickSort code so that you can check sorting for your languages.
Lazarus Trunk / fpc 2.6.2 / Win32

volvo877

  • Newbie
  • Posts: 3
Re: Sorting Middle Eastern and other Language Strings
« Reply #2 on: October 02, 2011, 09:46:59 am »
Why should you use QuickSort, if you have CustomSort? Plain and simple:
Code: [Select]
function mySort(List: TStringList; Index1, Index2: Integer): Integer;
begin
   Result := stricomp( PChar(List[Index1]), PChar(List[Index2]) );
end;

procedure TForm1.Button1Click(Sender: TObject);
var
   StrLst: TStringList;
begin
   StrLst := TStringList.Create;
   StrLst.Assign(ListBox1.Items);
   StrLst.CustomSort(@mySort);
   ListBox2.Items.Assign(StrLst);
   FreeAndNil(StrLst);
end;
Working for all languages (checked with Russian, Hebrew and Arabic)...

Avishai

  • Hero Member
  • *****
  • Posts: 1021
Re: Sorting Middle Eastern and other Language Strings
« Reply #3 on: October 02, 2011, 10:09:53 am »
Thanks volvo877

The point is that FPC TStringList.Sort uses QuickSort but the code does not work for all languages.  If it worked for all languages then no extra code would be needed. But you are right, your code works for all the languages that I know to check (just 2).  I am interested to see if it works for Spanish.  I was told that there are some different rules for Spanish.  Something about letter groups, but I didn't understand what he was saying since I don't know Spanish (I wish I did).  But I think maybe he wasn't serious.

[Added]
It should be that all that is needed for any control that has Lists of strings is to set the Sorted property to true, no matter what language is used.  There should not be a need for workarounds like this.
« Last Edit: October 02, 2011, 10:32:04 am by Avishai »
Lazarus Trunk / fpc 2.6.2 / Win32

Ocye

  • Hero Member
  • *****
  • Posts: 518
    • Scrabble3D
Re: Sorting Middle Eastern and other Language Strings
« Reply #4 on: October 02, 2011, 10:57:02 am »
I was told that there are some different rules for Spanish.  Something about letter groups, but I didn't understand what he was saying since I don't know Spanish (I wish I did).
Spanish uses digraphs for LL, RR and CH. Because unicode doesn't support these letters as a single character you have to apply own code.
Lazarus 1.7 (SVN) FPC 3.0.0

Avishai

  • Hero Member
  • *****
  • Posts: 1021
Re: Sorting Middle Eastern and other Language Strings
« Reply #5 on: October 02, 2011, 11:04:44 am »
Thanks Ocye,  So it is true what he told me.  He didn't explain it as clearly as you.  That must make sorting lists fun  ;)  I wonder how many other languages have cases like this.  I had no idea that it was such a big deal to sort strings.  Maybe my dream of one sort routine for all languages is not realistic.
Lazarus Trunk / fpc 2.6.2 / Win32

Bandbaz

  • New Member
  • *
  • Posts: 40
Re: Sorting Middle Eastern and other Language Strings
« Reply #6 on: October 02, 2011, 01:53:55 pm »
Quote
I have attached a test program using the QuickSort code so that you can check sorting for your languages.

Free Pascal Compiler version 2.4.2-0 [2010/11/20] for x86_64
Target OS: Linux for x86-64
QuickSort.lpr(16,30) Error: Identifier not found "RequireDerivedFormResource"
QuickSort.lpr(22) Fatal: There were 1 errors compiling module, stopping

Bandbaz

  • New Member
  • *
  • Posts: 40
Re: Sorting Middle Eastern and other Language Strings
« Reply #7 on: October 02, 2011, 02:06:15 pm »
ok .. I've just ignored that line and it works.

there are some problems on Farsi (persian)
how can I help?
is it enough if I write the sorted letters here for you?

Avishai

  • Hero Member
  • *****
  • Posts: 1021
Re: Sorting Middle Eastern and other Language Strings
« Reply #8 on: October 02, 2011, 02:21:43 pm »
Thank you for the feedback Bandbaz.  I was hoping someone would test for Farsi :)  It seems that sorting in some languages is not so simple.  I tried volvo877 code and it worked for English and Hebrew and he also tested for Russian and Arabic.
Lazarus Trunk / fpc 2.6.2 / Win32

Bandbaz

  • New Member
  • *
  • Posts: 40
Re: Sorting Middle Eastern and other Language Strings
« Reply #9 on: October 02, 2011, 02:47:44 pm »
Arabic and Farsi use almost the same characters. only with some small differences. (for example there are 4 more characters in Farsi and one or two are different in computer!)

I added all Farsi characters in Listbox1 in your program
when I click on sort button, the only problem is that those extra and different characters are sorted at the end of the list instead of their places.
so .. it's working for Arabic, but not for Farsi
there are also 2 mistakes in sorting. but I'm not sure if this is a mistake or maybe an other different between Farsi and Arabic.

BTW, if I can help on this, tell me  ;)

Bandbaz

  • New Member
  • *
  • Posts: 40
Re: Sorting Middle Eastern and other Language Strings
« Reply #10 on: October 02, 2011, 03:00:10 pm »
same with volvo877 code

Avishai

  • Hero Member
  • *****
  • Posts: 1021
Re: Sorting Middle Eastern and other Language Strings
« Reply #11 on: October 02, 2011, 03:02:43 pm »
Thank you Bandbaz.  For now it looks like I need to gather more information and then try to figure out a way to use that information to at least improve sorting for different languages.  Any help will be appreciated.

When I saw that TListBox, TComboBox, TStringGrid and TStringList use different sorting routines and each one gives different results, It seemed to the best approach might be to fix TStringList and then each could use TStringList.Sort so that they all gave the same result.
Lazarus Trunk / fpc 2.6.2 / Win32

Zoran

  • Hero Member
  • *****
  • Posts: 1980
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: Sorting Middle Eastern and other Language Strings
« Reply #12 on: October 02, 2011, 03:04:39 pm »
I wonder how many other languages have cases like this.  I had no idea that it was such a big deal to sort strings.  Maybe my dream of one sort routine for all languages is not realistic.

Croat, Serbian and Bosnian latin alphabet has three digraphs which are considered separate letters (LJ, NJ and DŽ) and have their separate places in alphabet. For example, words LEPTIR, LOPTA, LJUBAV should be sorted in that order (LJUBAV should be under word LOPTA, because letter (digraph) "LJ" comes after whole "L" is listed). Simple latin sort would put LJUBAV between LEPTIR and LOPTA, which is wrong in these languages.
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

Avishai

  • Hero Member
  • *****
  • Posts: 1021
Re: Sorting Middle Eastern and other Language Strings
« Reply #13 on: October 02, 2011, 03:13:17 pm »
Thanks Zoran.  Wow.  It's looking like TStringList and others almost need to have 'property Language' added in the published section just to do a simple (not so simple) sort. :) 
Lazarus Trunk / fpc 2.6.2 / Win32

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: Sorting Middle Eastern and other Language Strings
« Reply #14 on: October 02, 2011, 04:51:39 pm »
Most of latin languages also have accented characters, which are wrongly sorted.

The main problem seems to be the sorting speed for large lists, which is good for standard QuickSort.  So latin languages users identify a kind of "computer sorting", which is different of the usual, because the programmer is not able to find a feasable better way to sort the large lists than using standard QuickSort.
« Last Edit: October 02, 2011, 05:11:58 pm by typo »

 

TinyPortal © 2005-2018