Lazarus

Free Pascal => General => Topic started by: BillMcEnaney on May 19, 2022, 07:07:24 pm

Title: Removing duplicates from an array
Post by: BillMcEnaney on May 19, 2022, 07:07:24 pm
Hi, everyone,

Here's my Pascal function to remove duplicates from a list stored in an array. Please forgive me for any FPC-related mistakes because I'm used to ANSI Standard Pascal.

Here are the type and constant declarations for the program.

const
   Capacity = 80;

type
   Index_Range = 0..Capacity;
   Element_Tupe = integer;
   List = array[Index_Range] of Element_Type;

Code: [Select]
function Remove_Duplicates(Table: List; Table_Length: Index_Range): List;
   var
      Result: List;
      Here, There: Index_Range;
      Duplicates: set of Element_Type;
   begin
      There := 1;
      for Here ::= 1 to Table_Length do
         if Table[Here] not in Duplicates
         then
            begin
               include(Duplicates, Table[Here]);
               Result[There] := Table[Here];
               There := succ(There)
            end;
         if Duplicates = []
            then Result := Table;
        Remove_Duplicates := Result
   end;

The compiler made my laptop complain about the if-statement until I changed Element_Type to 0..5. So I wonder whether the set of integers between 1 and MaxInt is too big to represent with a Pascal set. Am I right about that?

I could overwrite duplicates in the original list. But I use the set and the result list to speed up the function.
Title: Re: Removing duplicates from an array
Post by: howardpc on May 19, 2022, 11:38:45 pm
So I wonder whether the set of integers between 1 and MaxInt is too big to represent with a Pascal set. Am I right about that?
You wonder correctly. FPC sets are limited to Byte range ordinals (0..255).
There are generics libraries that offer sets with greater ranges and more functionalities (e.g. a set Count property) than the built-in FPC set implementation.
For instance see here (https://github.com/avk959/LGenerics) .
Title: Re: Removing duplicates from an array
Post by: BobDog on May 20, 2022, 07:17:14 pm

Is this it in fp maybe?
Code: Pascal  [Select][+][-]
  1.  
  2. const
  3.    Capacity = 255;
  4.  
  5. type
  6.    Index_Range = 0..Capacity;
  7.    Element_Type = integer;
  8.    List = array[Index_Range] of Element_Type;
  9.    aod= array of Element_Type;
  10.    
  11.    function Remove_Duplicates(Table: List; Table_Length: Index_Range):aod;
  12.    var
  13.       Result:array of Element_type;//
  14.      
  15.       Here, There: Index_Range;
  16.       Duplicates:set of index_range;
  17.    begin
  18.     setlength(result,capacity);
  19.     Duplicates:=[0];
  20.       There := 1;
  21.       for Here := 1 to Table_Length do
  22.          if (Table[Here] in Duplicates) =false then
  23.          
  24.             begin
  25.                include(Duplicates, Table[Here]);
  26.                Result[There] := Table[Here];
  27.                There := succ(There)
  28.             end;
  29.          if Duplicates = []
  30.             then Result := Table;
  31.             setlength(result,There);
  32.         Remove_Duplicates := Result
  33.    end;
  34.    
  35.    
  36.   function range(mymin:int32;mymax:int32):int32;
  37.    begin
  38.    range:=trunc(int((Random*(Mymax-mymin+1)))+MyMin);
  39.   end;
  40.  
  41.    
  42.    
  43.    var
  44.    i:integer;
  45.    table:List;
  46.    answer:aod;
  47.    begin
  48.    for i:=0 to high(table) do table[i]:=range(100,110);
  49.    writeln('Before:');
  50.    for i:=0 to high(table) do write(table[i],' ');
  51.    answer:=Remove_Duplicates(table,high(table));
  52.    writeln;
  53.    writeln('After:');
  54.    for i:=1 to high(answer) do write(answer[i],' ');
  55.    writeln;
  56.    writeln('Press return to end');
  57.    readln;
  58.    end.
  59.  
  60.  

And my attempt, keeping the order and using dynamic arrays and template (generics), but not sets.

Code: Pascal  [Select][+][-]
  1.  
  2. generic function CleanUp<T>(a:T):T;
  3. var
  4.  n1,n2,flag:int32;
  5.  count:int32=1;
  6.  ret:T;
  7. begin
  8.   setlength(ret,high(a));
  9.   ret[0]:=a[0];
  10.   For n1 :=1 To high(a) do
  11.   begin
  12.    flag:=0;
  13.    For n2 :=0 to  n1-1 do
  14.     begin
  15.     If (a[n1]=a[n2]) Then
  16.      begin
  17.       flag:=1;
  18.       break;
  19.       end;
  20.      end;
  21.  If (flag=0) Then
  22.     begin
  23.      ret[count]:=a[n1];
  24.      count:=count+1;
  25.     end;
  26.  end;
  27.  setlength(ret,count);
  28.  exit(ret);
  29. end;
  30.  
  31.  function range(mymin:int32;mymax:int32):int32;
  32.    begin
  33.    range:=trunc(int((Random*(Mymax-mymin+1)))+MyMin);
  34.   end;
  35.  
  36.  
  37. type aoi=array of int32;
  38. type aos=array of char;
  39. var ans:aoi=nil;
  40. var ret:aos=nil;
  41. var j:int32;
  42. var i:aoi;
  43. var s:aos;
  44.  
  45. begin
  46. setlength(i,255);
  47. for j:=0 to high(i) do i[j]:=range(100,110);
  48. writeln('Before:');
  49. for j:=low(i) to high(i) do write(i[j],' ');
  50. writeln;
  51. writeln('After:');
  52. ans:=specialize CleanUp<aoi>(i);
  53. for j:=low(ans) to high(ans) do write(ans[j],' ');
  54. writeln;
  55. writeln;
  56. writeln;
  57. setlength(s,50);
  58. for j:=0 to high(s) do if range(1,10)>5 then s[j]:=chr(range(65,90)) else s[j]:=chr(range(97,122));//s[j]:=range
  59. writeln('Before:');
  60. for j:=low(s) to high(s) do write(s[j],' ');
  61. writeln;
  62. writeln('After:');;
  63. ret:=specialize CleanUp<aos>(s);
  64. for j:=low(ret) to high(ret) do write(ret[j],' ');
  65. writeln;
  66. writeln('Press return to end');
  67. readln;
  68.  
  69. end.
  70.  
  71.  
  72.  
Title: Re: Removing duplicates from an array
Post by: 440bx on May 20, 2022, 07:57:27 pm
Just a suggestion, while sets in FPC are limited to 256 elements, bitpacked arrays of boolean are not.

That would allow creating a bitmap of duplicates to be used to remove them in a subsequent pass.

Title: Re: Removing duplicates from an array
Post by: avk on May 23, 2022, 12:26:19 pm
The bitmap for the above range (1..MaxInt) must be at least 228 bytes in size and must be initialized.
It doesn't look very practical, does it?
Title: Re: Removing duplicates from an array
Post by: Warfley on May 23, 2022, 12:50:38 pm
Using (bit)sets is not a real alternative for large value ranges. The alternative by BobDog is inplace, but has complexity of O(n²) and is therefore quite slow.
Therefore I would suggest as an alternative, to use a dynamic set datastructure, either a hashset, which has an average lookup complexity of O(1) or a sorted set (e.g. RBT) which has a lookup time of O(logn), but the advantage of being easiely pre allocatable.

E.g.
Code: Pascal  [Select][+][-]
  1. uses
  2.   Generics.Collections;
  3.  
  4. type
  5.   TIntArray = array of Integer;
  6.   TIntSet = specialize THashSet<Integer>;
  7.  
  8. function RemoveDublicates(Data: TIntArray): TIntArray;
  9. var
  10.   Visited: TIntSet;
  11.   WriteHead: SizeInt;
  12.   Elem: Integer;
  13. begin
  14.   SetLength(Result, Length(Data));
  15.   WriteHead := 0;
  16.   Visited := TIntSet.Create;
  17.   try
  18.     for Elem in Data do
  19.       if not Visited.Contains(Elem) then
  20.       begin
  21.         Result[WriteHead] := Elem;
  22.         Inc(WriteHead);
  23.         Visited.Add(Elem);
  24.       end;
  25.   finally
  26.     Visited.Free;
  27.   end;
  28.   SetLength(Result, WriteHead);
  29. end;
  30.  
  31. var
  32.   Data: Array of Integer = (5,2,1,3,5,2,3,4,1,2);
  33.   Elem: Integer;
  34. begin
  35.   Data := RemoveDublicates(Data);
  36.   for Elem in Data do
  37.     WriteLn(Elem);
  38.   ReadLn;
  39. end.
With a hash set the average runtime complexity of this function should be O(n), and therefore quite efficient, while not blowing up the memory consumption like a bitset would
Title: Re: Removing duplicates from an array
Post by: 440bx on May 23, 2022, 01:18:17 pm
The bitmap for the above range (1..MaxInt) must be at least 228 bytes in size and must be initialized.
It doesn't look very practical, does it?
You got that right, I didn't say it was a small set  :D  but even a 32 bit process can afford the (approx) 260MB it will take.

I think the really critical part is the number of elements in the array.  For a large number of elements, 2 or 3 hundred thousand+, it might be reasonable to temporarily have a bitmap that large.  Honestly, I don't know where the cut-off is but, for smaller counts, I have no doubt there are smaller and faster ways.

Title: Re: Removing duplicates from an array
Post by: BobDog on May 23, 2022, 05:36:22 pm
Using (bit)sets is not a real alternative for large value ranges. The alternative by BobDog is inplace, but has complexity of O(n²) and is therefore quite slow.
Therefore I would suggest as an alternative, to use a dynamic set datastructure, either a hashset, which has an average lookup complexity of O(1) or a sorted set (e.g. RBT) which has a lookup time of O(logn), but the advantage of being easiely pre allocatable.

E.g.
Code: Pascal  [Select][+][-]
  1. uses
  2.   Generics.Collections;
  3.  
  4. type
  5.   TIntArray = array of Integer;
  6.   TIntSet = specialize THashSet<Integer>;
  7.  
  8. function RemoveDublicates(Data: TIntArray): TIntArray;
  9. var
  10.   Visited: TIntSet;
  11.   WriteHead: SizeInt;
  12.   Elem: Integer;
  13. begin
  14.   SetLength(Result, Length(Data));
  15.   WriteHead := 0;
  16.   Visited := TIntSet.Create;
  17.   try
  18.     for Elem in Data do
  19.       if not Visited.Contains(Elem) then
  20.       begin
  21.         Result[WriteHead] := Elem;
  22.         Inc(WriteHead);
  23.         Visited.Add(Elem);
  24.       end;
  25.   finally
  26.     Visited.Free;
  27.   end;
  28.   SetLength(Result, WriteHead);
  29. end;
  30.  
  31. var
  32.   Data: Array of Integer = (5,2,1,3,5,2,3,4,1,2);
  33.   Elem: Integer;
  34. begin
  35.   Data := RemoveDublicates(Data);
  36.   for Elem in Data do
  37.     WriteLn(Elem);
  38.   ReadLn;
  39. end.
With a hash set the average runtime complexity of this function should be O(n), and therefore quite efficient, while not blowing up the memory consumption like a bitset would

Here are the results for 10 million elements.

BobDog
1250 milliseconds
1055 1059 1072 1085 1060 1086 1042 1062 1065 1038 1044 1030 1090 1005 1097 1027 1048 1079 1082 1053 1057 1039 1093 1084 1007 1034 1008 1002 1037 1096 1078 1014 1087 1098 1047 1080 1046 1052 1068 1011 1064 1058 1054 1095 1076 1010 1041 1026 1018 1074 1021 1013 1001 1032 1015 1061 1022 1091 1045 1036 1070 1006 1067 1017 1075 1031 1003 1099 1016 1100 1025 1024 1066 1019 1009 1088 1051 1028 1012 1029 1069 1092 1000 1040 1083 1071 1043 1077 1089 1073 1050 1023 1035 1081 1049 1020 1094 1063 1004 1056 1033



Warfley
219 milliseconds
1055 1059 1072 1085 1060 1086 1042 1062 1065 1038 1044 1030 1090 1005 1097 1027 1048 1079 1082 1053 1057 1039 1093 1084 1007 1034 1008 1002 1037 1096 1078 1014 1087 1098 1047 1080 1046 1052 1068 1011 1064 1058 1054 1095 1076 1010 1041 1026 1018 1074 1021 1013 1001 1032 1015 1061 1022 1091 1045 1036 1070 1006 1067 1017 1075 1031 1003 1099 1016 1100 1025 1024 1066 1019 1009 1088 1051 1028 1012 1029 1069 1092 1000 1040 1083 1071 1043 1077 1089 1073 1050 1023 1035 1081 1049 1020 1094 1063 1004 1056 1033

array of length 10000000 with range(1000,1100) each element




Title: Re: Removing duplicates from an array
Post by: avk on May 24, 2022, 03:46:10 pm
...
I think the really critical part is the number of elements in the array.  For a large number of elements, 2 or 3 hundred thousand+, it might be reasonable to temporarily have a bitmap that large.  Honestly, I don't know where the cut-off is but, for smaller counts, I have no doubt there are smaller and faster ways.

It is possible to try to roughly estimate the effect of range size and array length on performance.
I am used LGenerics on a Win64 machine:
Code: Pascal  [Select][+][-]
  1. program proj;
  2.  
  3. {$mode objfpc}{$h+}
  4. {$modeswitch advancedrecords}
  5. {$maxstacksize $40000000}
  6.  
  7. uses
  8.   SysUtils, Math, lgUtils, lgHashSet, lgHash, lgTreeSet;
  9.  
  10. var
  11.   ArrayLen: Integer = 10000;
  12.   RepCount: Integer = 1000;
  13.  
  14. const
  15.   TryCount = 5;
  16.  
  17. type
  18.   THelper = record
  19.     class function HashCode(aValue: DWord): SizeInt; static; inline;
  20.     class function Equal(L, R: DWord): Boolean; static; inline;
  21.   end;
  22.  
  23.   generic TRangeTest<const UpBound: Integer> = record
  24.   const
  25.     UPPER = UpBound or Ord(UpBound = 0);
  26.  
  27.   type
  28.     TElType           = 1..UPPER;
  29.     TElArray          = array of TElType;
  30.     TRemoveDuplicates = function(const a: array of TElType): TElArray;
  31.  
  32.     TTestFun = record
  33.       Fun: TRemoveDuplicates;
  34.       DisplayText: string;
  35.     end;
  36.  
  37.     TBitSet      = specialize TGSet<TElType>;
  38.     THashSetType = specialize TGLiteChainHashSet<TElType, THelper>;
  39.     THashSet     = THashSetType.TSet;
  40.     TTreeSet     = specialize TGLiteTreeSet<TElType, DWord>;
  41.  
  42.     class function  BitSet(const a: array of TElType): TElArray; static;
  43.     class function  HashSet(const a: array of TElType): TElArray; static;
  44.     class function  TreeSet(const a: array of TElType): TElArray; static;
  45.     class procedure Execute; static;
  46.   end;
  47.  
  48. class function THelper.HashCode(aValue: DWord): SizeInt;
  49. begin
  50.   Result := JdkHash(aValue);
  51. end;
  52.  
  53. class function THelper.Equal(L, R: DWord): Boolean;
  54. begin
  55.   Result := L = R;
  56. end;
  57.  
  58. class function TRangeTest.BitSet(const a: array of TElType): TElArray;
  59. var
  60.   s: TBitSet;
  61.   I, J: SizeInt;
  62. begin
  63.   SetLength(Result, Length(a));
  64.   J := 0;
  65.   for I := 0 to High(a) do
  66.     begin
  67.       if a[I] in s then continue;
  68.       s.Include(a[I]);
  69.       Result[J] := a[I];
  70.       Inc(J);
  71.     end;
  72.   SetLength(Result, J);
  73. end;
  74.  
  75. {$warn 5089 off}
  76. class function TRangeTest.HashSet(const a: array of TElType): TElArray;
  77. var
  78.   s: THashSet;
  79.   I, J: SizeInt;
  80. begin
  81.   s.EnsureCapacity(Min(Ord(High(TElType)) - Ord(Low(TElType)) + 1, Length(a)));
  82.   SetLength(Result, Length(a));
  83.   J := 0;
  84.   for I := 0 to High(a) do
  85.     if s.Add(a[I]) then
  86.       begin
  87.         Result[J] := a[I];
  88.         Inc(J);
  89.       end;
  90.   SetLength(Result, J);
  91. end;
  92.  
  93. class function TRangeTest.TreeSet(const a: array of TElType): TElArray;
  94. var
  95.   s: TTreeSet;
  96.   I, J: SizeInt;
  97. begin
  98.   s.EnsureCapacity(Min(Ord(High(TElType)) - Ord(Low(TElType)) + 1, Length(a)));
  99.   SetLength(Result, Length(a));
  100.   J := 0;
  101.   for I := 0 to High(a) do
  102.     if s.Add(a[I]) then
  103.       begin
  104.         Result[J] := a[I];
  105.         Inc(J);
  106.       end;
  107.   SetLength(Result, J);
  108. end;
  109.  
  110. class procedure TRangeTest.Execute;
  111. var
  112.   a, u: TElArray;
  113.   CurrFun: TTestFun;
  114.   I, J: SizeInt;
  115.   BestScore, Score: QWord;
  116. const
  117.   TestFuns: array of TTestFun = (
  118.     (Fun: @BitSet;  DisplayText: '    BitSet:  '),
  119.     (Fun: @HashSet; DisplayText: '    HashSet: '),
  120.     (Fun: @TreeSet; DisplayText: '    TreeSet: ')
  121.   );
  122. begin
  123.   WriteLn('  Subrange: 1..', UPPER);
  124.   SetLength(a, ArrayLen);
  125.   for I := 0 to High(a) do
  126.     a[I] := Random(UPPER) + 1;
  127.  
  128.   for CurrFun in TestFuns do
  129.     begin
  130.       BestScore := High(QWord);
  131.       for I := 1 to TryCount do
  132.         begin
  133.           Score := GetTickCount64;
  134.           for J := 1 to RepCount do
  135.             u := CurrFun.Fun(a);
  136.           Score := GetTickCount64 - Score;
  137.           if Score < BestScore then
  138.             BestScore := Score;
  139.         end;
  140.       WriteLn(CurrFun.DisplayText, BestScore);
  141.     end;
  142. end;
  143.  
  144. procedure RunTest;
  145. begin
  146.   WriteLn('Array length: ', ArrayLen);
  147.   specialize TRangeTest<1024>.Execute;
  148.   specialize TRangeTest<65536>.Execute;
  149.   specialize TRangeTest<4194304>.Execute;
  150.   specialize TRangeTest<268435456>.Execute;
  151.   specialize TRangeTest<2147483647>.Execute;
  152.   WriteLn;
  153. end;
  154.  
  155. begin
  156.   RunTest;
  157.   ArrayLen := 50000; RepCount := 200;
  158.   RunTest;
  159.   ArrayLen := 200000; RepCount := 50;
  160.   RunTest;
  161.   ArrayLen := 1000000; RepCount := 10;
  162.   RunTest;
  163.   ArrayLen := 5000000; RepCount := 2;
  164.   RunTest;
  165.   ReadLn;
  166. end.
  167.  

And I got something like this:
Code: Text  [Select][+][-]
  1. Array length: 10000
  2.   Subrange: 1..1024
  3.     BitSet:  0
  4.     HashSet: 78
  5.     TreeSet: 655
  6.   Subrange: 1..65536
  7.     BitSet:  15
  8.     HashSet: 171
  9.     TreeSet: 1342
  10.   Subrange: 1..4194304
  11.     BitSet:  124
  12.     HashSet: 156
  13.     TreeSet: 1388
  14.   Subrange: 1..268435456
  15.     BitSet:  10484
  16.     HashSet: 156
  17.     TreeSet: 1372
  18.   Subrange: 1..2147483647
  19.     BitSet:  80138
  20.     HashSet: 171
  21.     TreeSet: 1373
  22.  
  23. Array length: 50000
  24.   Subrange: 1..1024
  25.     BitSet:  0
  26.     HashSet: 78
  27.     TreeSet: 608
  28.   Subrange: 1..65536
  29.     BitSet:  46
  30.     HashSet: 343
  31.     TreeSet: 1684
  32.   Subrange: 1..4194304
  33.     BitSet:  47
  34.     HashSet: 374
  35.     TreeSet: 1919
  36.   Subrange: 1..268435456
  37.     BitSet:  2231
  38.     HashSet: 374
  39.     TreeSet: 1918
  40.   Subrange: 1..2147483647
  41.     BitSet:  16177
  42.     HashSet: 374
  43.     TreeSet: 1918
  44.  
  45. Array length: 200000
  46.   Subrange: 1..1024
  47.     BitSet:  0
  48.     HashSet: 78
  49.     TreeSet: 608
  50.   Subrange: 1..65536
  51.     BitSet:  31
  52.     HashSet: 203
  53.     TreeSet: 1638
  54.   Subrange: 1..4194304
  55.     BitSet:  31
  56.     HashSet: 592
  57.     TreeSet: 2777
  58.   Subrange: 1..268435456
  59.     BitSet:  718
  60.     HashSet: 592
  61.     TreeSet: 2823
  62.   Subrange: 1..2147483647
  63.     BitSet:  4227
  64.     HashSet: 592
  65.     TreeSet: 2808
  66.  
  67. Array length: 1000000
  68.   Subrange: 1..1024
  69.     BitSet:  0
  70.     HashSet: 78
  71.     TreeSet: 608
  72.   Subrange: 1..65536
  73.     BitSet:  15
  74.     HashSet: 141
  75.     TreeSet: 1622
  76.   Subrange: 1..4194304
  77.     BitSet:  47
  78.     HashSet: 795
  79.     TreeSet: 4836
  80.   Subrange: 1..268435456
  81.     BitSet:  312
  82.     HashSet: 826
  83.     TreeSet: 5070
  84.   Subrange: 1..2147483647
  85.     BitSet:  1046
  86.     HashSet: 826
  87.     TreeSet: 5070
  88.  
  89. Array length: 5000000
  90.   Subrange: 1..1024
  91.     BitSet:  0
  92.     HashSet: 78
  93.     TreeSet: 593
  94.   Subrange: 1..65536
  95.     BitSet:  15
  96.     HashSet: 140
  97.     TreeSet: 1591
  98.   Subrange: 1..4194304
  99.     BitSet:  62
  100.     HashSet: 717
  101.     TreeSet: 6489
  102.   Subrange: 1..268435456
  103.     BitSet:  234
  104.     HashSet: 889
  105.     TreeSet: 7628
  106.   Subrange: 1..2147483647
  107.     BitSet:  405
  108.     HashSet: 889
  109.     TreeSet: 7613  
  110.  
Title: Re: Removing duplicates from an array
Post by: 440bx on May 24, 2022, 05:08:22 pm
It is possible to try to roughly estimate the effect of range size and array length on performance.
Thank you @avk.

From the results you posted I'd estimate the threshold to be in the vicinity of 2 million elements.  That's quite a bit more than I than initially thought.

That really shows the value of testing off-the-cuff estimates :)

Title: Re: Removing duplicates from an array
Post by: Thaddy on May 29, 2022, 08:14:14 am
Note OP's capacity is 81, not 80...
Title: Re: Removing duplicates from an array
Post by: circular on May 29, 2022, 08:59:14 am
Another approach is to sort the array. Then it is easy to remove the duplicates with a loop, by having a second index that progress only when a new value is found and where it is copied to. This can be done inplace. In the end set the length of the array of the number of distinct items found.
Title: Re: Removing duplicates from an array
Post by: Thaddy on May 29, 2022, 01:31:20 pm
Another approach is using generics.collections and TArray<T>. And indeed it needs to sort, which it can.
That should be a two liner with the correct TDuplicates for TList<T>
Not tested (yet) but it should work.

The magic is in combining the array and the list.
Title: Re: Removing duplicates from an array
Post by: avk on May 29, 2022, 02:45:53 pm
Another approach is to sort the array. Then it is easy to remove the duplicates with a loop, by having a second index that progress only when a new value is found and where it is copied to. This can be done inplace. In the end set the length of the array of the number of distinct items found.

Of course, in LGenerics the T(some)ArrayHelper.SelectDistinct() method does exactly that, but this method does not preserve the order of the elements in the array.
Title: Re: Removing duplicates from an array
Post by: BobDog on May 30, 2022, 03:13:33 pm
Another approach is using generics.collections and TArray<T>. And indeed it needs to sort, which it can.
That should be a two liner with the correct TDuplicates for TList<T>
Not tested (yet) but it should work.

The magic is in combining the array and the list.
Anyway, it is too slow to sort the elements in a large array.
I tested with quicksort, the sort time taken is about the total task time taken in my previous method, then you have to remove the duplicates, easy enough in a sorted array.
I didn't bother making a remove procedure, it wasn't worth it.


Title: Re: Removing duplicates from an array
Post by: avk on May 30, 2022, 05:59:15 pm
...
Anyway, it is too slow to sort the elements in a large array.
I tested with quicksort, the sort time taken is about the total task time taken in my previous method, then you have to remove the duplicates, easy enough in a sorted array.
I didn't bother making a remove procedure, it wasn't worth it.

It seems your conclusion is too hasty.
I tried running TGOrdinalArrayHelper.SelectDistinct (denoted Sorting) in the benchmark from my post #8 and got the following timings:
Code: Text  [Select][+][-]
  1. Array length: 10000
  2.   Subrange: 1..1024
  3.     Sorting: 47
  4.   Subrange: 1..65536
  5.     Sorting: 78
  6.   Subrange: 1..4194304
  7.     Sorting: 94
  8.   Subrange: 1..268435456
  9.     Sorting: 109
  10.   Subrange: 1..2147483647
  11.     Sorting: 109
  12.  
  13. Array length: 50000
  14.   Subrange: 1..1024
  15.     Sorting: 47
  16.   Subrange: 1..65536
  17.     Sorting: 109
  18.   Subrange: 1..4194304
  19.     Sorting: 93
  20.   Subrange: 1..268435456
  21.     Sorting: 109
  22.   Subrange: 1..2147483647
  23.     Sorting: 109
  24.  
  25. Array length: 200000
  26.   Subrange: 1..1024
  27.     Sorting: 47
  28.   Subrange: 1..65536
  29.     Sorting: 156
  30.   Subrange: 1..4194304
  31.     Sorting: 110
  32.   Subrange: 1..268435456
  33.     Sorting: 125
  34.   Subrange: 1..2147483647
  35.     Sorting: 125
  36.  
  37. Array length: 1000000
  38.   Subrange: 1..1024
  39.     Sorting: 46
  40.   Subrange: 1..65536
  41.     Sorting: 93
  42.   Subrange: 1..4194304
  43.     Sorting: 140
  44.   Subrange: 1..268435456
  45.     Sorting: 140
  46.   Subrange: 1..2147483647
  47.     Sorting: 140
  48.  
  49. Array length: 5000000
  50.   Subrange: 1..1024
  51.     Sorting: 46
  52.   Subrange: 1..65536
  53.     Sorting: 78
  54.   Subrange: 1..4194304
  55.     Sorting: 171
  56.   Subrange: 1..268435456
  57.     Sorting: 140
  58.   Subrange: 1..2147483647
  59.     Sorting: 140
  60.  
Title: Re: Removing duplicates from an array
Post by: 440bx on May 31, 2022, 12:09:59 am
@avk,

There is one thing I don't understand in the results you posted, it is: the (I presume) time taken for 5 million elements is the same as the time taken for 1 million elements.  I understand sorting time isn't linear but, it should take longer to sort 5 million elements than 1 million.

Am I misinterpreting the results you posted ?
Title: Re: Removing duplicates from an array
Post by: avk on May 31, 2022, 06:50:36 am
If you look inside the benchmark code, you can see that the same number of elements (10M) is processed each time for all array lengths.
BTW in this case (TGOrdinalArrayHelper) the sorting time should be exactly linear.
Title: Re: Removing duplicates from an array
Post by: SymbolicFrank on May 31, 2022, 08:45:36 am
If you want to keep the order as it is, it's probably still simpler to first create an index, then sort and remove the duplicates and last to use the index to put it back in the original order.
Title: Re: Removing duplicates from an array
Post by: BobDog on May 31, 2022, 01:20:57 pm

Here is my contribution remove duplicates by sorting.
Code: Pascal  [Select][+][-]
  1.  
  2. uses
  3. sysutils;
  4.  
  5.  //{$rangeChecks on}
  6.  
  7.  
  8. generic Procedure QuickSort<T>(Var arr:array of T;Start:Int32;Finish:Int32);
  9. Var
  10.   i,j: Int64;
  11.   x,temp:T;
  12. Begin
  13.   i := Start ;
  14.   j := finish;
  15.   x := arr[(Start+finish) Div 2];
  16.   While I <= J Do
  17.     Begin
  18.       While arr[I] < x Do
  19.         i:=i+ 1;
  20.       While arr[J] > x Do
  21.         j:=j- 1;
  22.       If I<=J Then
  23.         Begin
  24.           temp := arr[j];
  25.           arr[j] := arr[i];
  26.           arr[i] := temp;
  27.           I:= i+1;
  28.           J:= j- 1;
  29.         End;
  30.     End;
  31.   If J > Start Then specialize QuickSort<T>(arr,Start,J);
  32.   If I < Finish Then specialize QuickSort<T>(arr,I,Finish);
  33. End;
  34.  
  35. generic function RemoveDuplicates<T,Q>(arr:T):T;
  36. var i,count:int32;
  37. var ret:T;
  38. begin
  39. specialize quicksort<Q>(arr,0,high(arr));
  40. count:=0;
  41. setlength (ret,length(arr));
  42. ret[0]:=arr[0];
  43. for i:=0 to high(arr)-1 do
  44.   begin
  45.    if arr[i]<>arr[i+1] then
  46.    begin
  47.     count:=count+1;
  48.     ret[count]:=arr[i+1];
  49.    end;
  50. end;
  51. setlength(ret,count+1);
  52. exit(ret);
  53. end;
  54.  
  55. function range(mymin:int32;mymax:int32):int32;
  56.    begin
  57.    range:=trunc(int((Random*(Mymax-mymin+1)))+MyMin);
  58.   end;
  59.  
  60. type aoi=array of int32;
  61. var
  62. ans,i:aoi;
  63. j:int32;
  64. t:int64;
  65.  
  66.  
  67. begin
  68. ans:=nil;
  69. i:=nil;
  70. setlength(i,5000000);
  71. for j:=0 to high(i) do i[j]:=range(1,1024);
  72. writeln('Please wait');
  73. t:=gettickcount64;
  74. ans:=specialize RemoveDuplicates<aoi,int32>(i);
  75. writeln(' Time taken = ',gettickcount64-t);
  76. for j:=low(ans) to high(ans) do write(ans[j],' ');
  77. writeln;
  78. writeln('Press return to end . . .');
  79. readln;
  80. end.
  81.  

It is easy to implement this way, but for indexing and returning to proper positions I'll have to think about.
Title: Re: Removing duplicates from an array
Post by: SymbolicFrank on May 31, 2022, 01:36:20 pm
Alternatively, you can sort the index. It amounts to the same thing.
Title: Re: Removing duplicates from an array
Post by: avk on June 05, 2022, 12:15:24 pm

Here is my contribution remove duplicates by sorting.
  ...
It is easy to implement this way, but for indexing and returning to proper positions I'll have to think about.

Maybe the simplest way:

It is clear that in order for the performance of this method to compete with the  hashset, both sorts must be very fast.
Title: Re: Removing duplicates from an array
Post by: jamie on June 05, 2022, 02:52:08 pm
you realize of course one could simply use a HASH table which allows for faster interactions.

Just sort the localized hashes.
Title: Re: Removing duplicates from an array
Post by: Thaddy on June 05, 2022, 03:57:22 pm
you realize of course one could simply use a HASH table which allows for faster interactions.

Just sort the localized hashes.
Sorting hashes is utter nonsense. Ignore that.
Why?
Because hashes have no natural order, just a unique value, which means the sort has no meaning at all.
There is no relationship between the two, except that hashes are already sorted by nature..
Unique should be taken with a pinch of salt: accidents will happen. Elvis Costello is right.
Sorting on hash and then using it is O(n) so futile compared to real sorts.

https://www.youtube.com/watch?v=Ae_bdGihxy8
And:
https://en.wikipedia.org/wiki/Hashed_array_tree
Title: Re: Removing duplicates from an array
Post by: avk on June 05, 2022, 05:00:03 pm
you realize of course one could simply use a HASH table which allows for faster interactions.
...

Are you one of those who only reads the last post in a thread?

...
Just sort the localized hashes.

Funny approach, how would you implement it?
Title: Re: Removing duplicates from an array
Post by: jamie on June 05, 2022, 07:37:37 pm
Some of you are funny, have little common sense and yet you want to bite the hand that feeds you.

 Bad planning starts an app off in the wrong direction.

 Have fun sobbing in your sorrows while others like myself have no issues generating blazing fast execution code.

 what a sad state of affairs.


Title: Re: Removing duplicates from an array
Post by: avk on June 05, 2022, 07:44:25 pm
Amen.
Title: Re: Removing duplicates from an array
Post by: Thaddy on June 05, 2022, 08:32:50 pm
what a sad state of affairs.
Because you can't read? You got at least two proper answers.
And Amen.
Title: Re: Removing duplicates from an array
Post by: jamie on June 06, 2022, 01:41:51 pm
 Of all people @Thaddy, you should be last one stepping up to the plate.

 King of bloat and when some bloat does not work as desired add some more bloat on top of that bloat.


 The vicious cycle continues.

 Good job @Thaddy. Hope you get some more blinded sided followers to add to your crusade.

 There are some that are simply unemployable, meanwhile I need to get back to work and make some real headway.

   Happy Bloating.
Title: Re: Removing duplicates from an array
Post by: avk on June 06, 2022, 02:35:27 pm
It's a shame that the topic ceases to be constructive due to the efforts of participants who have contributed nothing but empty ranting to it.
Title: Re: Removing duplicates from an array
Post by: Thaddy on June 06, 2022, 09:01:54 pm
I am never bloating. You can't read. And where appropiate I always provide a compilable example.
IOW you are fired! in the unlikely case you ever worked for me. I know a rat compared to a rabbit.

Shut this.
Title: Re: Removing duplicates from an array
Post by: SymbolicFrank on June 07, 2022, 11:41:15 am
Turning them into indexed hashes and sorting those sounds good to me. Depending on how you sort, comparing them is often the thing that takes the most time and with hashes you only have to read the data once.
Then again, if the entries are long sentences, finding the first difference might be faster. I can think of edge cases either way.
Title: Re: Removing duplicates from an array
Post by: Joanna on June 16, 2022, 03:27:49 am
I understand that knowing how to remove duplicates from an array has academic merit but in real life programming wouldn’t it be better to work on the problem of duplicate values getting into the array to begin with? It seems like no matter what you do it’s going to take a lot of extra time and effort to undo the duplicates in array.
Has there ever been considered an Function indexof(Avalue) to detect duplicates of an array or maybe a flag to not allow duplicates? If seen this for various other things so it seems like it should be possible to implement them for arrays.
Title: Re: Removing duplicates from an array
Post by: Ten_Mile_Hike on August 01, 2022, 12:17:06 am
Code: Text  [Select][+][-]
  1. Wow,
  2.  Hash tables, Log n ( time analysis), collections, helpers, generics, sorting procedures...
  3. You guys are way over my level of ability. Nevertheless; here is how I would detect and
  4. eliminate duplicates in an array and YES; my way is probably inefficient in both memory
  5. and processor utilization, but it is just my way.
  6.  
  7. =============pseudo code=================
  8. var
  9.   Array_dups[0..9] of integer=[8,6,9,11,6,22,3,8,6];
  10.   Array_Fixed[0..9999] of integer; {Declare *expected* max range}
  11.   x:integer;
  12.  
  13. Begin
  14.    Initialize_All_Elements_to_Zero (array_Fixed);
  15.    for x:=0 to High(Array_Dups) Do
  16.      Array_Fixed[Array_Dups[x]]
  17.            :=  Array_Fixed[Array_Dups[x]]+1;
  18.  
  19. //////////////////////////////////////////////////////////////////////////////
  20.  At this point the elements of Array_fixed indicate the following
  21.          all elements  =0 did not occur in Array_Dups
  22.          all elements  =1 occurred only once in Array_Dups
  23.          all elements  >1 had duplicates in Array_Dups
  24.  
  25. It is now trivial to assign the values of Array_Fixed back into
  26. Array_Dups or manipulate or assign them in any manner that you choose
  27.  
  28.  
TinyPortal © 2005-2018