Recent

Author Topic: How to remove duplicate entries from very large text files  (Read 19053 times)

Gizmo

  • Hero Member
  • *****
  • Posts: 719
Re: How to remove duplicate entries from very large text files
« Reply #30 on: September 16, 2016, 04:09:35 pm »
Fungus

Thanks...to clarify, I'm not referring to hashes here as such now - just using an integer list. I'm thinking that if we can work out a way of computing an integer value for a string that is reversable, then once it has been found in the input file once, it won't be added again. Lets pretend 'ABC' equalled 589654, for arguments sake, then next line read the program does, if it happens to also be 'ABC', it will compute '589654' for it, look it up in the list and know it's already been found and thus not add it again.

I guess I could always just use the MD5 or SHA-1 digests, and store the binary digests in the list but then there's the reversing issue - I'd still need the originating value, given that hashes are non-reversable (thus the use of HashLists in the first place). Collisions are an irritating news issue really. Engineering a hash collision is one thing, but over true data it is not at all common, and, for SHA-1, you're still looking at 1 in 1.46 trillion trillion trillion trillion to one odds of one random line of data having the same SHA-1 hash as another.
« Last Edit: September 16, 2016, 04:11:41 pm by Gizmo »
Lazarus 2.0.4 and fpc 3.0.4 - Linux Mint 19 LTS, Windows 10 64 and Mac OSX Catlina
Useful Page to remember : http://wiki.freepascal.org/Cross_compiling#From_Linux_x64_to_Linux_i386

rvk

  • Hero Member
  • *****
  • Posts: 4407
Re: How to remove duplicate entries from very large text files
« Reply #31 on: September 16, 2016, 04:18:46 pm »
The issue remaining will still be that a hash of 64bit cannot be considered 100% unique and therefore you cannot be sure that all duplicates are removed and you might even end up with the removal of words that are not duplicated.
Such number would be essentially the same as a hash.

I would still use a hash-number with this method. But I wouldn't create a separate index file. I would store the position with the hash-number (multiple same hash-numbers can exists so you would duplicate them because the position-numbers would be unique).

When you search that hash-list you would need to iterate all duplicate numbers and use a second filestream (of the same file) to reread the word in question to see if you're really dealing with a duplicate. That way you have the index in-memory (as pointers to the original file) and don't need to build a separate file for that.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6721
  • Debugger - SynEdit - and more
    • wiki
Re: How to remove duplicate entries from very large text files
« Reply #32 on: September 16, 2016, 04:21:27 pm »
I'm thinking that if we can work out a way of computing an integer value for a string that is reversable, then once it has been found in the input file once, it won't be added again. Lets pretend 'ABC' equalled 589654, for arguments sake, then next line read the program does, if it happens to also be 'ABC', it will compute '589654' for it, look it up in the list and know it's already been found and thus not add it again.

1) if 'ABC', it will compute '589654', there may be other words, that also compute to '589654' . This can not be avoided (unless all words come from a known list, and never ever can there be any other word).
But you can store all words that compute the same, under that number. (separate with a #00 byte)

2) This is what a hash table is. This is what TFPHashList does.
You just need a version that doesnt force shortstrings.


rvk

  • Hero Member
  • *****
  • Posts: 4407
Re: How to remove duplicate entries from very large text files
« Reply #33 on: September 16, 2016, 04:22:35 pm »
I'm thinking that if we can work out a way of computing an integer value for a string that is reversable, then once it has been found in the input file once, it won't be added again. Lets pretend 'ABC' equalled 589654, for arguments sake, then next line read the program does, if it happens to also be 'ABC', it will compute '589654' for it, look it up in the list and know it's already been found and thus not add it again.
That would be brilliant if you can put it into an Int64. But how large are your biggest possible words???
No way you can fit a 20 letter word in an Int64.

For 36 letter/numbers, which is not even all letters, unicode?? you would need 7 bits.
(count them... 1,2,4,8,16,32,64. That's 7 bits.)
That's about 9 letters per 64bit. So your words can't exceed 9 characters. Do you have bigger words ??????
« Last Edit: September 16, 2016, 04:24:29 pm by rvk »

Gizmo

  • Hero Member
  • *****
  • Posts: 719
Re: How to remove duplicate entries from very large text files
« Reply #34 on: September 16, 2016, 04:58:35 pm »
Use NameOfIndex.

Edit:
Set the Capacity at the beginning.
Use FindIndexOf instead of Find.

As a side note, consider using something like BufStream to increase the speed of reading.

Engkin...nice and clear suggestion as always Engkin, thanks. But sadly, still no joy.

If I try using

Code: [Select]

if HashList.FindIndexOf(printableword)) = -1 then
begin
  //Add the values to the list etc
end;

the memory usage is still the same...it jumps up by 150Mb per second. And if I set HashList.Capacity to MaxInt or 2147483647, it almost immediately crashes reporting that the max capacity has been reached. So I take it back out and the program does start the deduplication for a few seconds before absorbing 1.3Gb RAM and then it crashes.

Any other ideas?
Lazarus 2.0.4 and fpc 3.0.4 - Linux Mint 19 LTS, Windows 10 64 and Mac OSX Catlina
Useful Page to remember : http://wiki.freepascal.org/Cross_compiling#From_Linux_x64_to_Linux_i386

rvk

  • Hero Member
  • *****
  • Posts: 4407
Re: How to remove duplicate entries from very large text files
« Reply #35 on: September 16, 2016, 05:13:15 pm »
Not tested and just something I just typed... (ugh...)
It could probably be made much more efficient with a generic record TList.
Also I have not sorted the TList so you'll need to iterate through the whole list.
Making it sorted would be much faster.

But hey... it's just a little concept:
(how does this do for big files??)

Code: Pascal  [Select][+][-]
  1. type
  2.   PMyHash = ^TMyHash;
  3.   TMyHash = record
  4.     HashValue: int64;
  5.     Position: int64;
  6.   end;
  7.  
  8.   TMyHashList = class(TList)
  9.   private
  10.     function Get(Index: integer): PMyHash;
  11.   public
  12.     destructor Destroy; override;
  13.     function Add(Value: PMyHash): integer;
  14.     property Items[Index: integer]: PMyHash read Get; default;
  15.   end;
  16.  
  17. function TMyHashList.Add(Value: PMyHash): integer;
  18. begin
  19.   Result := inherited Add(Value);
  20. end;
  21.  
  22. destructor TMyHashList.Destroy;
  23. var
  24.   i: integer;
  25. begin
  26.   for i := 0 to Count - 1 do
  27.     FreeMem(Items[i]);
  28.   inherited;
  29. end;
  30.  
  31. function TMyHashList.Get(Index: integer): PMyHash;
  32. begin
  33.   Result := PMyHash(inherited Get(Index));
  34. end;
  35.  
  36. function ReadLine(const Stream: TStream; var Line: string): boolean;
  37. var
  38.   RawLine: string;
  39.   ch: AnsiChar;
  40. begin
  41.   Result := False;
  42.   ch := #0;
  43.   while (Stream.Read(ch, 1) = 1) and (ch <> #13) and (ch <> #10) do
  44.   begin
  45.     Result := True;
  46.     RawLine := RawLine + ch;
  47.   end;
  48.   Line := RawLine;
  49.   if (ch = #13) then
  50.   begin
  51.     Result := True;
  52.     if (Stream.Read(ch, 1) = 1) and (ch <> #10) then
  53.       Stream.Seek(-1, soCurrent); // unread it if not LF character.
  54.   end;
  55. end;
  56.  
  57. procedure WriteLine(const Stream: TStream; Line: string);
  58. begin
  59.   Stream.Write(Line[1], Length(Line));
  60.   Stream.Write(LineEnding, Length(LineEnding));
  61. end;
  62.  
  63. function FPHash(const s: shortstring): longword;
  64. var
  65.   p, pmax: PChar;
  66. begin
  67. {$push}
  68. {$Q-}
  69.   Result := 0;
  70.   p := @s[1];
  71.   pmax := @s[length(s) + 1];
  72.   while (p < pmax) do
  73.   begin
  74.     Result := longword(longint(Result shl 5) - longint(Result)) xor longword(P^);
  75.     Inc(p);
  76.   end;
  77. {$pop}
  78. end;
  79.  
  80. procedure Deduplicate(BigTextFile: string);
  81. var
  82.   FOut, FIn, Idx: TFileStream;
  83.   Hash: TMyHashList;
  84.   RHash: TMyHash;
  85.   MyHash: PMyHash;
  86.   Line, Check: string;
  87.   Hs: integer;
  88.   AddValue: boolean;
  89. begin
  90.   FOut := TFileStream.Create(BigTextFile + '-DeDuplicated', fmCreate);
  91.   FIn := TFileStream.Create(BigTextFile, fmOpenRead or fmShareDenyWrite);
  92.   Idx := TFileStream.Create(BigTextFile, fmOpenRead or fmShareDenyWrite); // for reread
  93.   Hash := TMyHashList.Create;
  94.   try
  95.     RHash.Position := FIn.Position;
  96.     while ReadLine(FIn, Line) do
  97.     begin
  98.       RHash.HashValue := FPHash(Line);
  99.       AddValue := True;
  100.       for Hs := 0 to Hash.Count - 1 do
  101.       begin
  102.         if Hash.Items[Hs]^.HashValue = RHash.HashValue then
  103.         begin
  104.           Idx.Position := Hash.Items[Hs]^.Position;
  105.           if ReadLine(Idx, Check) then AddValue := (Line <> Check);
  106.           if not AddValue then break;
  107.         end;
  108.       end;
  109.       if AddValue then
  110.       begin
  111.         GetMem(MyHash, SizeOf(TMyHash));
  112.         MyHash^.HashValue := RHash.HashValue;
  113.         MyHash^.Position := RHash.Position;
  114.         Hash.Add(MyHash);
  115.         WriteLine(FOut, Line);
  116.       end;
  117.       RHash.Position := FIn.Position;
  118.     end;
  119.   finally
  120.     FOut.Free;
  121.     FIn.Free;
  122.     Idx.Free;
  123.     Hash.Free;
  124.   end;
  125. end;
« Last Edit: September 16, 2016, 05:45:58 pm by rvk »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6721
  • Debugger - SynEdit - and more
    • wiki
Re: How to remove duplicate entries from very large text files
« Reply #36 on: September 16, 2016, 05:57:59 pm »
a bit more details on the idea from my last post

calchash(string): integer; // use the hash function from tfphashlist.

in dummy code

hope I didnt miss any error.

memory usage will be constant, all extra goes into files

Code: Pascal  [Select][+][-]
  1.   table_size = 100000000; // 100 million = 800 MB
  2.   table: array of int64;
  3.   setlength(table, table_size);
  4.  
  5.   foreach word_in_input do begin
  6.      h := calchash_word mod table_size;
  7.      known := table[h];
  8.  
  9.      if (p and $8000000000000000) = 0 then p := p - 1
  10.      // table was initialized with 0, so we store all pos to output incremented by 1
  11.  
  12.    
  13.      if known = -1 then begin
  14.        p = output.pos; // current pos in output
  15.        output.writeln(word);
  16.        table[h] := (p+1) and $7ffffffffffffff;  // reserve the top bit for other purpose / 63 bit are enough for file pos
  17.        continue; // next word
  18.     end;
  19.  
  20.      dup := false;
  21.      known_orig := known;
  22.  
  23.      // top bit indicates that there is more than one word for this value
  24.      while (not dup) and ((known & $8000000000000000) <> 0) do begin
  25.          extra_file.seek(known and $7ffffffffffffff, from_start); // from start
  26.          known_prev :=    extra_file.read_int64();
  27.          word_pos :=    extra_file.read_int64();
  28.          output_2nd_handle.seek(word_pos);
  29.          // read word from output
  30.          dup := word = word_from_output    
  31.          known := known_prev;
  32.      end;
  33.  
  34.      if not dup then begin
  35.         output_2nd_handle.seek(known, from_start);
  36.         // read word from output
  37.         dup := word = word_from_output    
  38.         output_2nd_handle.seek(-1, from_end); // to eol
  39.      end;
  40.  
  41.      if dup then continue;  // next word
  42.  
  43.      // write info to extra file for hashes with many words
  44.        p = output.pos; // current pos in output
  45.        output.writeln(word);
  46.        x := extra_file.pos;
  47.        extra_file.seek(-1, from_end); // go to eol // apend
  48.        extra_file.write_int64(known_orig);
  49.        extra_file.write_int64(p);
  50.        table[h] := x or $8000000000000000;  // pos in xtra file
  51.  
  52.   end
  53.  

Fungus

  • Sr. Member
  • ****
  • Posts: 352
Re: How to remove duplicate entries from very large text files
« Reply #37 on: September 16, 2016, 07:49:46 pm »
Fair enough, suggestions are rolling in.. So, here is my suggestion:

Code: Pascal  [Select][+][-]
  1. Program DeDupAndSort;
  2.  
  3. {$H+}
  4.  
  5. Uses Classes, SysUtils;
  6.  
  7. Const
  8.   CRLF : String = #13#10;
  9.  
  10. Type
  11.   PInt64 = ^Int64;
  12.  
  13.   //Helper class to read and write lines
  14.   TStringIO = Class Helper For TFileStream
  15.     Function ReadLine: RawByteString;
  16.     Procedure WriteLine(Const S: String);
  17.   End;
  18.  
  19. Function TStringIO.ReadLine: RawByteString;
  20. Var C: Char;
  21. Begin
  22.   //Helper class ReadLine
  23.   Result:= '';
  24.   While Position < Size Do Begin
  25.     Read(C, SizeOf(C));
  26.     If C = #10 Then Exit
  27.     Else If C <> #13 Then Result:= Result + C;
  28.   End;
  29. End;
  30.  
  31. Procedure TStringIO.WriteLine(Const S: String);
  32. Var L: Integer;
  33. Begin
  34.   //Helper class WriteLine
  35.   L:= Length(S);
  36.   If L > 0 Then Write(S[1], L);
  37.   Write(CRLF[1], Length(CRLF));
  38. End;
  39.  
  40. //Global variables
  41. Var
  42.    InFile, OutFile: TFileStream;
  43.    OutIndex: TList;
  44.    FileName, CurWord: String;
  45.    InsertAt: Integer;
  46.    Tick, MemUsed: Int64;
  47.  
  48. Procedure MakeRandomFile;
  49. Const cLetters : String = 'abcdefghijklmnopqrstuvxyzABCDEFGHIJKLMNOPQRSTUVXYZ';
  50.       cWordCount = 1000000;
  51.       cMinWordLen = 5;
  52.       cMaxWordLen = 30;
  53. Var W: String;
  54.     I, WL, WI: Integer;
  55. Begin
  56.   //Create a file of random words where every 10th word is duplicated
  57.   Randomize;
  58.   With TFileStream.Create(FileName, fmCreate) Do Try
  59.     For I:= 1 To cWordCount Do Begin
  60.       If I Mod 10 <> 0 Then Begin
  61.         W:= '';
  62.         WL:= Random(cMaxWordLen - cMinWordLen) + cMinWordLen;
  63.         For WI:= 1 To WL Do W:= W + cLetters[Random(Length(cLetters)-1)+1];
  64.       End;
  65.       WriteLine(W);
  66.     End;
  67.   Finally
  68.     Free;
  69.   End;
  70. End;
  71.  
  72. Function Lookup(Const S: String): Boolean;
  73. Var First, Last, Idx, Cmp: Integer;
  74.     LCase, CurStr: String;
  75. Begin
  76.   //Lookup "S" using the sorted index
  77.   Result:= False;
  78.   If OutIndex.Count = 0 Then Begin
  79.     //Empty, adjust insert position
  80.     InsertAt:= 0;
  81.     Exit;
  82.   End;
  83.  
  84.   //Lower case for case insensitivity
  85.   LCase:= LowerCase(S);
  86.  
  87.   //Adjust first and last element
  88.   First:= 0;
  89.   Last:= OutIndex.Count - 1;
  90.  
  91.   //This is a basic binary search routine, removing half of the entries
  92.   //for each lookup. This is extremely efficient for large data amounts
  93.   //but it requires the values to be sorted
  94.   While First <= Last Do Begin
  95.  
  96.     //Find index to test (middle of the search range)
  97.     Idx:= ((Last - First) Div 2) + First;
  98.  
  99.     //Load the element as lowercase from OutFile
  100.     OutFile.Position:= PInt64(OutIndex[Idx])^;
  101.     CurStr:= LowerCase(OutFile.ReadLine);
  102.  
  103.     //Compare elements
  104.     Cmp:= CompareStr(LCase, CurStr);
  105.  
  106.     //Handle comparison:
  107.     If Cmp < 0 Then Last:= Idx - 1 //If less, we cut away upper half
  108.     Else If Cmp > 0 Then First:= Idx + 1 //If greater, we cut away the lower half
  109.     Else Begin
  110.       //We have a match / duplicate
  111.       Result:= True;
  112.       Exit;
  113.     End;
  114.  
  115.   End;
  116.  
  117.   //Adjust the insert position of the tested string
  118.   If Cmp > 0 Then Inc(Idx);
  119.   InsertAt:= Idx;
  120.  
  121. End;
  122.  
  123. Procedure Insert(Const S: String);
  124. Var I: PInt64;
  125. Begin
  126.   //Insert a string to OutFIle and its index
  127.   New(I);
  128.   I^:= OutFile.Size;
  129.   OutFile.Position:= I^;
  130.   OutFile.WriteLine(S);
  131.   OutIndex.Insert(InsertAt, I);
  132. End;
  133.  
  134. Procedure ExportSorted;
  135. Var I: Integer;
  136. Begin
  137.   //Export a sorted version of the de-duplicated file
  138.   With TFileStream.Create(FileName + '_sort', fmCreate) Do Try
  139.     For I:= 0 To OutIndex.Count - 1 Do Begin
  140.       OutFile.Position:= PInt64(OutIndex[I])^;
  141.       WriteLine(OutFile.ReadLine);
  142.     End;
  143.   Finally
  144.     Free;
  145.   End;
  146. End;
  147.  
  148. Procedure FreeOutIndex;
  149. Var I: Integer;
  150. Begin
  151.   //Release the index
  152.   For I:= 0 To OutIndex.Count - 1 Do Dispose(PInt64(OutIndex[I]));
  153.   OutIndex.Free;
  154. End;
  155.  
  156. Begin
  157.  
  158.   //Create a file name
  159.   FileName:= ExtractFilePath(ParamStr(0)) + 'bigtextfile.txt';
  160.  
  161.   //If the file does not exists, generate it
  162.   If Not FileExists(FileName) Then MakeRandomFile;
  163.  
  164.   //Create input file, output file and index
  165.   InFile:= TFileSTream.Create(FileName, fmOpenRead);
  166.   OutFile:= TFileStream.Create(FileName + '_ddup', fmCreate);
  167.   OutIndex:= TList.Create;
  168.  
  169.   Try
  170.  
  171.     //Store memory used
  172.     MemUsed:= GetHeapStatus.TotalAllocated;
  173.  
  174.     //Store tick
  175.     Tick:= GetTickCount64;
  176.  
  177.     While InFile.Position < InFile.Size Do Begin
  178.  
  179.       //Read current word
  180.       CurWord:= InFile.ReadLine;
  181.  
  182.       //Lookup word and insert if unique
  183.       If Not Lookup(CurWord) Then Insert(CurWord)
  184.       Else WriteLn('Removing duplicate: ' + CurWord);
  185.  
  186.     End;
  187.  
  188.     //Show memory used
  189.     MemUsed:= GetHeapStatus.TotalAllocated - MemUsed;
  190.     WriteLn(Format('Total memory used is %d bytes', [MemUsed]));
  191.  
  192.     //Show time used
  193.     Tick:= GetTickCount64 - Tick;
  194.     WriteLn(Format('Duplicates removed in %.3f seconds', [Tick/1000.0]));
  195.  
  196.     //Export a sorted list of unique words and show speed
  197.     Tick:= GetTickCount64;
  198.     ExportSorted;
  199.     Tick:= GetTickCount64 - Tick;
  200.     WriteLn(Format('Export of sorted file completed in %.3f seconds', [Tick/1000.0]));
  201.  
  202.   Finally
  203.  
  204.     //Release da grease!
  205.     FreeOutIndex;
  206.     OutFile.Free;
  207.     InFile.Free;
  208.  
  209.   End;
  210.  
  211.   //Yep!!
  212.   WriteLn('Done');
  213.  
  214. End.

It will generate a file just shy of 20MB and remove the duplicates and output both an unsorted and a sorted version of the result, on my system this takes 304.284 seconds. Since no hashes or CRC's are used the output is 100% accurate.

Gizmo

  • Hero Member
  • *****
  • Posts: 719
Re: How to remove duplicate entries from very large text files
« Reply #38 on: September 16, 2016, 09:59:17 pm »
Just a very quick reply to say thanks so much you three for the awesome suggestions which I will try out on Sunday night (no time between now and then). It amazes me how you guys quickly come up with something like it's second nature when it's taken me a week of battles. It's really very kind of you. I will report back on the outcome.
Lazarus 2.0.4 and fpc 3.0.4 - Linux Mint 19 LTS, Windows 10 64 and Mac OSX Catlina
Useful Page to remember : http://wiki.freepascal.org/Cross_compiling#From_Linux_x64_to_Linux_i386

BeniBela

  • Hero Member
  • *****
  • Posts: 763
    • homepage
Re: How to remove duplicate entries from very large text files
« Reply #39 on: September 16, 2016, 11:18:59 pm »
You could speed that up with a Bloom filter

It is a probabilistic hash structure that detects almost all unique lines. Then you only need to check the remaining possible duplicates.

shobits1

  • Sr. Member
  • ****
  • Posts: 278
  • .
Re: How to remove duplicate entries from very large text files
« Reply #40 on: September 17, 2016, 02:17:18 am »
I have this idea but don't know if it is efficient.
how about creating a dictionary(or array) for all the words in the text file. then translate the text file using indexes from dictionary this will reduce the memory usage. now sort the items on the new file check for duplicate and remove them from original text file.

illustration:
original text (size: 118 bytes)
Lorem ipsum dolor sit amet
consectetur adipiscing elit.
Maecenas egestas purus mauris, vitae
consectetur adipiscing elit.

the dict will hold
1:Lorem 2:ipsum 3:dolor 4:sit 5:amet 6:consectetur 7:adipiscing 8:elit 9:. 10:Maecenas 11:egestas 12:purus 13:mauris 14:,  15:vitae

new file will be (key:value) or array (size: 35 bytes)
1:$0102030405
2:$06070809
3:$0A0B0C0D0E0F
4:$06070809

sort new file by value
1:$0102030405
2:$06070809
4:$06070809
3:$0A0B0C0D0E0F

delete all unique value will result
4:$06070809

read and write original text skipping remaining items will result.
1:$0102030405
2:$06070809
3:$0A0B0C0D0E0F

one thing thought, this will need to read the text file many times but it will save memory a lot when the text includes unicode characters.

what do you think.  %)
« Last Edit: September 17, 2016, 02:19:02 am by shobits1 »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6721
  • Debugger - SynEdit - and more
    • wiki
Re: How to remove duplicate entries from very large text files
« Reply #41 on: September 17, 2016, 03:00:58 am »
I have this idea but don't know if it is efficient.
how about creating a dictionary(or array) for all the words in the text file. then translate the text file using indexes from dictionary this will reduce the memory usage. now sort the items on the new file check for duplicate and remove them from original text file.

1) Building the dictionary may be costly, as you must check each (sub-)word that you add, if it is already present (for that you must either keep the words in memory, or read the file over and over and over.)

2) The dictionary only saves space, if
  a) there are duplicate (sub-)words
  b) the words are longer than the resulting number (they will be if your numbers a of variable byte len, no padding)

3) You may still run out of memory

4) most important: If you are going to sort, you may sort the original list? what is the benefit of translating it first?

shobits1

  • Sr. Member
  • ****
  • Posts: 278
  • .
Re: How to remove duplicate entries from very large text files
« Reply #42 on: September 17, 2016, 08:05:55 am »
1- very True, but I'm hoping the words counts won't be big tens of thousands.
2.a- in file with gigabyte of string it bound to be many many duplicated words and if case-sensitiveness doesn't matter there'll be more.

2.b- no padding needed... and in most cases there'll be no lose in space at least for words with 4 chars when words count exceed 2^16 (if the file is unicode there will be huge memory reduction).

3- everything has limit.

4- already proposed by molly and the first thing that comes to my mind, but OP didn't take interest in it, maybe the file should not be sorted. also translating the file would result in binary file so small that makes using conventional sort method possible.

 

TinyPortal © 2005-2018