Forum > General

Removing duplicates from an array

<< < (2/7) > >>

Warfley:
Using (bit)sets is not a real alternative for large value ranges. The alternative by BobDog is inplace, but has complexity of O(n²) and is therefore quite slow.
Therefore I would suggest as an alternative, to use a dynamic set datastructure, either a hashset, which has an average lookup complexity of O(1) or a sorted set (e.g. RBT) which has a lookup time of O(logn), but the advantage of being easiely pre allocatable.

E.g.

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---uses  Generics.Collections; type  TIntArray = array of Integer;  TIntSet = specialize THashSet<Integer>; function RemoveDublicates(Data: TIntArray): TIntArray;var  Visited: TIntSet;  WriteHead: SizeInt;  Elem: Integer;begin  SetLength(Result, Length(Data));  WriteHead := 0;  Visited := TIntSet.Create;  try    for Elem in Data do      if not Visited.Contains(Elem) then      begin        Result[WriteHead] := Elem;        Inc(WriteHead);        Visited.Add(Elem);      end;  finally    Visited.Free;  end;  SetLength(Result, WriteHead);end; var  Data: Array of Integer = (5,2,1,3,5,2,3,4,1,2);  Elem: Integer;begin  Data := RemoveDublicates(Data);  for Elem in Data do    WriteLn(Elem);  ReadLn;end.With a hash set the average runtime complexity of this function should be O(n), and therefore quite efficient, while not blowing up the memory consumption like a bitset would

440bx:

--- Quote from: avk on May 23, 2022, 12:26:19 pm ---The bitmap for the above range (1..MaxInt) must be at least 228 bytes in size and must be initialized.
It doesn't look very practical, does it?

--- End quote ---
You got that right, I didn't say it was a small set  :D  but even a 32 bit process can afford the (approx) 260MB it will take.

I think the really critical part is the number of elements in the array.  For a large number of elements, 2 or 3 hundred thousand+, it might be reasonable to temporarily have a bitmap that large.  Honestly, I don't know where the cut-off is but, for smaller counts, I have no doubt there are smaller and faster ways.

BobDog:

--- Quote from: Warfley on May 23, 2022, 12:50:38 pm ---Using (bit)sets is not a real alternative for large value ranges. The alternative by BobDog is inplace, but has complexity of O(n²) and is therefore quite slow.
Therefore I would suggest as an alternative, to use a dynamic set datastructure, either a hashset, which has an average lookup complexity of O(1) or a sorted set (e.g. RBT) which has a lookup time of O(logn), but the advantage of being easiely pre allocatable.

E.g.

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---uses  Generics.Collections; type  TIntArray = array of Integer;  TIntSet = specialize THashSet<Integer>; function RemoveDublicates(Data: TIntArray): TIntArray;var  Visited: TIntSet;  WriteHead: SizeInt;  Elem: Integer;begin  SetLength(Result, Length(Data));  WriteHead := 0;  Visited := TIntSet.Create;  try    for Elem in Data do      if not Visited.Contains(Elem) then      begin        Result[WriteHead] := Elem;        Inc(WriteHead);        Visited.Add(Elem);      end;  finally    Visited.Free;  end;  SetLength(Result, WriteHead);end; var  Data: Array of Integer = (5,2,1,3,5,2,3,4,1,2);  Elem: Integer;begin  Data := RemoveDublicates(Data);  for Elem in Data do    WriteLn(Elem);  ReadLn;end.With a hash set the average runtime complexity of this function should be O(n), and therefore quite efficient, while not blowing up the memory consumption like a bitset would

--- End quote ---

Here are the results for 10 million elements.

BobDog
1250 milliseconds
1055 1059 1072 1085 1060 1086 1042 1062 1065 1038 1044 1030 1090 1005 1097 1027 1048 1079 1082 1053 1057 1039 1093 1084 1007 1034 1008 1002 1037 1096 1078 1014 1087 1098 1047 1080 1046 1052 1068 1011 1064 1058 1054 1095 1076 1010 1041 1026 1018 1074 1021 1013 1001 1032 1015 1061 1022 1091 1045 1036 1070 1006 1067 1017 1075 1031 1003 1099 1016 1100 1025 1024 1066 1019 1009 1088 1051 1028 1012 1029 1069 1092 1000 1040 1083 1071 1043 1077 1089 1073 1050 1023 1035 1081 1049 1020 1094 1063 1004 1056 1033



Warfley
219 milliseconds
1055 1059 1072 1085 1060 1086 1042 1062 1065 1038 1044 1030 1090 1005 1097 1027 1048 1079 1082 1053 1057 1039 1093 1084 1007 1034 1008 1002 1037 1096 1078 1014 1087 1098 1047 1080 1046 1052 1068 1011 1064 1058 1054 1095 1076 1010 1041 1026 1018 1074 1021 1013 1001 1032 1015 1061 1022 1091 1045 1036 1070 1006 1067 1017 1075 1031 1003 1099 1016 1100 1025 1024 1066 1019 1009 1088 1051 1028 1012 1029 1069 1092 1000 1040 1083 1071 1043 1077 1089 1073 1050 1023 1035 1081 1049 1020 1094 1063 1004 1056 1033

array of length 10000000 with range(1000,1100) each element




avk:

--- Quote from: 440bx on May 23, 2022, 01:18:17 pm ---...
I think the really critical part is the number of elements in the array.  For a large number of elements, 2 or 3 hundred thousand+, it might be reasonable to temporarily have a bitmap that large.  Honestly, I don't know where the cut-off is but, for smaller counts, I have no doubt there are smaller and faster ways.

--- End quote ---

It is possible to try to roughly estimate the effect of range size and array length on performance.
I am used LGenerics on a Win64 machine:

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---program proj; {$mode objfpc}{$h+}{$modeswitch advancedrecords}{$maxstacksize $40000000} uses  SysUtils, Math, lgUtils, lgHashSet, lgHash, lgTreeSet; var  ArrayLen: Integer = 10000;  RepCount: Integer = 1000; const  TryCount = 5; type  THelper = record    class function HashCode(aValue: DWord): SizeInt; static; inline;    class function Equal(L, R: DWord): Boolean; static; inline;  end;   generic TRangeTest<const UpBound: Integer> = record  const    UPPER = UpBound or Ord(UpBound = 0);   type    TElType           = 1..UPPER;    TElArray          = array of TElType;    TRemoveDuplicates = function(const a: array of TElType): TElArray;     TTestFun = record      Fun: TRemoveDuplicates;      DisplayText: string;    end;     TBitSet      = specialize TGSet<TElType>;    THashSetType = specialize TGLiteChainHashSet<TElType, THelper>;    THashSet     = THashSetType.TSet;    TTreeSet     = specialize TGLiteTreeSet<TElType, DWord>;     class function  BitSet(const a: array of TElType): TElArray; static;    class function  HashSet(const a: array of TElType): TElArray; static;    class function  TreeSet(const a: array of TElType): TElArray; static;    class procedure Execute; static;  end; class function THelper.HashCode(aValue: DWord): SizeInt;begin  Result := JdkHash(aValue);end; class function THelper.Equal(L, R: DWord): Boolean;begin  Result := L = R;end; class function TRangeTest.BitSet(const a: array of TElType): TElArray;var  s: TBitSet;  I, J: SizeInt;begin  SetLength(Result, Length(a));  J := 0;  for I := 0 to High(a) do    begin      if a[I] in s then continue;      s.Include(a[I]);      Result[J] := a[I];      Inc(J);    end;  SetLength(Result, J);end; {$warn 5089 off}class function TRangeTest.HashSet(const a: array of TElType): TElArray;var  s: THashSet;  I, J: SizeInt;begin  s.EnsureCapacity(Min(Ord(High(TElType)) - Ord(Low(TElType)) + 1, Length(a)));  SetLength(Result, Length(a));  J := 0;  for I := 0 to High(a) do    if s.Add(a[I]) then      begin        Result[J] := a[I];        Inc(J);      end;  SetLength(Result, J);end; class function TRangeTest.TreeSet(const a: array of TElType): TElArray;var  s: TTreeSet;  I, J: SizeInt;begin  s.EnsureCapacity(Min(Ord(High(TElType)) - Ord(Low(TElType)) + 1, Length(a)));  SetLength(Result, Length(a));  J := 0;  for I := 0 to High(a) do    if s.Add(a[I]) then      begin        Result[J] := a[I];        Inc(J);      end;  SetLength(Result, J);end; class procedure TRangeTest.Execute;var  a, u: TElArray;  CurrFun: TTestFun;  I, J: SizeInt;  BestScore, Score: QWord;const  TestFuns: array of TTestFun = (    (Fun: @BitSet;  DisplayText: '    BitSet:  '),    (Fun: @HashSet; DisplayText: '    HashSet: '),    (Fun: @TreeSet; DisplayText: '    TreeSet: ')  );begin  WriteLn('  Subrange: 1..', UPPER);  SetLength(a, ArrayLen);  for I := 0 to High(a) do    a[I] := Random(UPPER) + 1;   for CurrFun in TestFuns do    begin      BestScore := High(QWord);      for I := 1 to TryCount do        begin          Score := GetTickCount64;          for J := 1 to RepCount do            u := CurrFun.Fun(a);          Score := GetTickCount64 - Score;          if Score < BestScore then            BestScore := Score;        end;      WriteLn(CurrFun.DisplayText, BestScore);    end;end; procedure RunTest;begin  WriteLn('Array length: ', ArrayLen);  specialize TRangeTest<1024>.Execute;  specialize TRangeTest<65536>.Execute;  specialize TRangeTest<4194304>.Execute;  specialize TRangeTest<268435456>.Execute;  specialize TRangeTest<2147483647>.Execute;  WriteLn;end; begin  RunTest;  ArrayLen := 50000; RepCount := 200;  RunTest;  ArrayLen := 200000; RepCount := 50;  RunTest;  ArrayLen := 1000000; RepCount := 10;  RunTest;  ArrayLen := 5000000; RepCount := 2;  RunTest;  ReadLn;end. 
And I got something like this:

--- Code: Text  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---Array length: 10000  Subrange: 1..1024    BitSet:  0    HashSet: 78    TreeSet: 655  Subrange: 1..65536    BitSet:  15    HashSet: 171    TreeSet: 1342  Subrange: 1..4194304    BitSet:  124    HashSet: 156    TreeSet: 1388  Subrange: 1..268435456    BitSet:  10484    HashSet: 156    TreeSet: 1372  Subrange: 1..2147483647    BitSet:  80138    HashSet: 171    TreeSet: 1373 Array length: 50000  Subrange: 1..1024    BitSet:  0    HashSet: 78    TreeSet: 608  Subrange: 1..65536    BitSet:  46    HashSet: 343    TreeSet: 1684  Subrange: 1..4194304    BitSet:  47    HashSet: 374    TreeSet: 1919  Subrange: 1..268435456    BitSet:  2231    HashSet: 374    TreeSet: 1918  Subrange: 1..2147483647    BitSet:  16177    HashSet: 374    TreeSet: 1918 Array length: 200000  Subrange: 1..1024    BitSet:  0    HashSet: 78    TreeSet: 608  Subrange: 1..65536    BitSet:  31    HashSet: 203    TreeSet: 1638  Subrange: 1..4194304    BitSet:  31    HashSet: 592    TreeSet: 2777  Subrange: 1..268435456    BitSet:  718    HashSet: 592    TreeSet: 2823  Subrange: 1..2147483647    BitSet:  4227    HashSet: 592    TreeSet: 2808 Array length: 1000000  Subrange: 1..1024    BitSet:  0    HashSet: 78    TreeSet: 608  Subrange: 1..65536    BitSet:  15    HashSet: 141    TreeSet: 1622  Subrange: 1..4194304    BitSet:  47    HashSet: 795    TreeSet: 4836  Subrange: 1..268435456    BitSet:  312    HashSet: 826    TreeSet: 5070  Subrange: 1..2147483647    BitSet:  1046    HashSet: 826    TreeSet: 5070 Array length: 5000000  Subrange: 1..1024    BitSet:  0    HashSet: 78    TreeSet: 593  Subrange: 1..65536    BitSet:  15    HashSet: 140    TreeSet: 1591  Subrange: 1..4194304    BitSet:  62    HashSet: 717    TreeSet: 6489  Subrange: 1..268435456    BitSet:  234    HashSet: 889    TreeSet: 7628  Subrange: 1..2147483647    BitSet:  405    HashSet: 889    TreeSet: 7613   

440bx:

--- Quote from: avk on May 24, 2022, 03:46:10 pm ---It is possible to try to roughly estimate the effect of range size and array length on performance.

--- End quote ---
Thank you @avk.

From the results you posted I'd estimate the threshold to be in the vicinity of 2 million elements.  That's quite a bit more than I than initially thought.

That really shows the value of testing off-the-cuff estimates :)

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version