Recent

Author Topic: How to remove duplicate entries from very large text files  (Read 24400 times)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: [SOLVED] How to remove duplicate entries from very large text files
« Reply #15 on: September 12, 2016, 04:52:05 pm »
compile using 64-bit ? :-)

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: [SOLVED] How to remove duplicate entries from very large text files
« Reply #16 on: September 13, 2016, 04:09:55 pm »
Gents

> compile using 64-bit ? :-)

I have a couple of packages that seem to be "uncompilable" with 64-bit compiling enabled. I have the cross compiler addon installed and have tried switching the compiler ptions to Win64 and x64 but the package CSVDocument fails along with something else. But that may be the solution, if I can work out how to compile it OK.

I noticed in HL.Create, there's a eventually a call to SetHashCapacity that has the line :

Code: [Select]
ReallocMem(FHashTable, FHashCapacity * sizeof(integer));

Does that not give a 32-bit limit there? If it was Int64 instead, maybe that would be better (I tried changing it and recompiling with no change - same problem - but then I'm not sure if the unit itself was recompiled with the new setting (are all units recompiled when you press F9?)

How can I use Lazarus to tell me exactly how much memory the HashList itself is using, or what limits it has hit? I have tried outputing HL.Capacity which seems to give a count of how many entries there are when the source data doesn't cause it to crash due to being out of memory. It seems it is the injestion of data into the hash list that is causing the memory error - it doesn't manage to complete that step. So I don't think it's the querying of it.  I have attached some debug data thrown by the compiler; do any of you know what it means? I've never understood what the data means (I know it's assembly jump commands etc but I don't know what it actually means)

« Last Edit: September 13, 2016, 04:45:09 pm by Gizmo »

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: How to remove duplicate entries from very large text files
« Reply #17 on: September 13, 2016, 07:28:14 pm »
If FHashCapacity is a signed 32 bit value (like integer), then the size of the table will be limited to 2^31 entries (2147483648) since bit 32 is used for signing. If FHashCapacity is not signed (eg. Cardinal) then the limit will double. But if your code is restricted to 32 bit compilation then will not be able to handle more than just under 4GB of memory (dunno exactely how FPC manages memory, though).

You should consider to do the trick with segmenting your large files into smaller chunks like described above. May not be a clean sollution, but it would most likely steer you free of problems and give you an opportunity to sort the entries for faster searching as well. It may even result in hastables being unnecessary since the segmented word lists may fit into memory :-)

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: How to remove duplicate entries from very large text files
« Reply #18 on: September 14, 2016, 01:24:39 pm »
Well, interestingly, I have conducted some further tests.

I copied my program to one of my dedicated workstations (instead of using a VM) that has dozens of Gb RAM free (94Gb all together!). I then ran my program over a 2Gb source file, which generated a 150Mb list file, and deduplication completed OK on this system, using, at most, 1.2Gb RAM for the deduplication via hashlists. So that was fairly encouraging - at least it works with reasonably sized input on machines with more more than a couple of Gb of RAM free (where my VM only has just over 1Gb RAM free). 

So, I then tried it on the same PC with an 8Gb source file. Generating the initial output worked very efficiently thanks to use of Byte Arrays (that RVK suggested to me ages ago). It uses a mere 18Mb all the way through. Then comes the de-duplication, which, as you can see, started using lots of RAM to just over 1.3Gb when it crashed with an out of memory issue. Yet the PC has about 40Gb RAM free of its 94Gb allocation (see resources screenshot).

Now if the crash was happening nearer the 2Gb or 4Gb limits, I'd assume this was some kind of 32-bit limitation of one of the hash list data types, as you point out Fungus. But 1.3Gb is a strange limit.

So long story short, with source data of 2Gb or 3Gb, it works OK and can deduplicate the output. But anything bigger and the initial output is fine but deduplication fails due to memory, as soon as 1.3 to 1.4Gb RAM is occupied.

I wonder if more memory efficient hash lists exist for Freepascal aside from TFPHashList. 
« Last Edit: September 14, 2016, 01:30:27 pm by Gizmo »

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: How to remove duplicate entries from very large text files
« Reply #19 on: September 14, 2016, 01:40:36 pm »
It is my experience that you cannot fully rely on the readings in the process list. Pleasy try to open the Resource Monitor in the bottom of the Performance tab, here you can se detailed info about memory usage for individual processes. You must also know that the system it self may limit the amount of memory that is available to a program. So even if there is 40GB of free memory one single process may not be able to aquire all of it since this could cause a cascade of Out Of Memory from other programs and system processes, leading to a system that can no longer function = crash.

http://www.freepascal.org/~micha/fpc/faq.html#general-heap

EDIT: You may want to check how much heap memory you have available in your program and see if this is coherent with the point of receiving the out of memory message. Please note that the value of "TotalAddrSpace" might change due to memory usage: http://www.freepascal.org/docs-html/rtl/system/theapstatus.html
« Last Edit: September 14, 2016, 02:18:42 pm by Fungus »

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: How to remove duplicate entries from very large text files
« Reply #20 on: September 16, 2016, 11:53:52 am »
Gents

Can any of you see memory inefficincies in this code? Is there a better way I can write it to prevent this RAM hogging during the HL.Find cycle? Once I get this done, we can consider the matter solved! But currently, its still unsolved because technically, unless the source data is small enough to allow the hashlist to operate within the 1Gb or so of free RAM, the solution still doesn't work. 

Code: [Select]
...
var
 TmpForDeDuplicating : TextFile;
 fs : TFileStream;
 HL : TFPHashList;
 // TFPHashlist uses SHORTSTRINGS!!! Max length 255.
 ptr : ^shortstring;
 printableword, s : shortstring;
 i : integer;
...
begin
...
 try
    HL := TFPHashList.Create;
    try
      FS := TFileStream.Create(BigTextFile+'-DeDuplicated', fmCreate);
      try
        AssignFile(TmpForDeDuplicating, BigTextFile);
        Reset(TmpForDeDuplicating);

        while not EOF(TmpForDeDuplicating) do
          begin
            // Read each line of the output file and see if the word is
            // in the hash list already, and add it if it isn't
            readln(TmpForDeDuplicating, printableword);
           
            // Check length does not excceed shortstring max of 255
            // If it does, discard and move on. Else, add to list only if it is not already existing in the list (thus de-duplicating)
            if Length(printableword) >= 255 then continue;
              if HL.Find(printableword) = nil then
              begin
                new(ptr);
                ptr^ := printableword;
                HL.Add(ptr^, ptr);  // *** RAM usage increases in this part by about 150Mb p\s until system crashes, unless source data small enough for program to succeed within 1Gb of RAM usage ***
              end;
          end;

        // Now output the unique words to a new wordlist file
        for i := 0 to HL.Count -1 do
          begin
            ptr := HL[i];
            s := ptr^ + #13#10;
            FS.Write(s[1], Length(s));
            dispose(ptr);  // *** RAM usage rapidly gets deallocated here, on those occasions where the loop above manages to complete.
          end;
      finally
        // Free temp output file
        CloseFile(TmpForDeDuplicating);
        lblCompletionNotification.Caption:= 'De-duplication of source file complete!';
      end;
    finally
      FS.Free;
    end;
  finally
    HL.Free;
  end;
...

Then the original file containing duplicates is deleted and the deduplicated file is renamed to that of the original filename.

« Last Edit: September 16, 2016, 12:22:51 pm by Gizmo »

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: How to remove duplicate entries from very large text files
« Reply #21 on: September 16, 2016, 12:07:14 pm »
Gents

Can any of you see memory inefficincies in this code? Is there a better way I can write it to prevent this RAM hogging during the HL.Find cycle? Once I get this done, we can consider the matter solved! But currently, its still unsolved because technically, unless the source data is small enough to allow the hashlist to operate within the 1Gb or so of free RAM, the solution still doesn't work. 

Code: Pascal  [Select][+][-]
  1. ...
  2.             new(ptr);
  3.             ptr^ := printableword;
  4.             HL.Add(ptr^, ptr); //Here you are storing both printableword as shortstring and pshortstring
  5. ...
  6.  
  7.  

Then the original file containing duplicates is deleted and the deduplicated file is renamed to that of the original filename.


With the code above you are using twice the memory required to check for duplicates. That is not very efficient when memory usage is of concern.
« Last Edit: September 16, 2016, 12:18:24 pm by Fungus »

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: How to remove duplicate entries from very large text files
« Reply #22 on: September 16, 2016, 12:24:07 pm »
Quote
you are using twice the memory required to check for duplicates. That is not very efficient

Can you think of a better of doing it by any chance?

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: How to remove duplicate entries from very large text files
« Reply #23 on: September 16, 2016, 12:30:12 pm »
Use NameOfIndex.

Edit:
Set the Capacity at the beginning.
Use FindIndexOf instead of Find.

As a side note, consider using something like BufStream to increase the speed of reading.
« Last Edit: September 16, 2016, 12:48:46 pm by engkin »

rvk

  • Hero Member
  • *****
  • Posts: 6163
Re: How to remove duplicate entries from very large text files
« Reply #24 on: September 16, 2016, 12:38:12 pm »
I don't think TFPHashList is the way to go if you already run into memory troubles with it with very large files.

TFPHashList saves a shortstring for every hash-value and I don't see a way around that.
How many words have these 3GB+ files anyway ??

It might be better (but less efficient regarding speed) to save the hash in a table with an index/position to the word in the original file. You would need to use a TStream to read the file in that case. If you create a hash for a new word you can see if you have it already on the (sorted) list. If not, add it to the list with the position within the file. If it hits a duplicate, read (all) the duplicates and check if the word at the position of the file are the same. If it's the same, go back to correct position and read on. If there is no real duplicate add it to the (sorted) list after which you also go back to the correct position.

With this method you only have the hashes and positions in a sorted list (which would be much more memory efficient) but when hitting a duplicate hash you would need to read that word to make sure you have a duplicate.

The other, already mentioned, method would be a small database with index.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9870
  • Debugger - SynEdit - and more
    • wiki
Re: How to remove duplicate entries from very large text files
« Reply #25 on: September 16, 2016, 12:48:21 pm »
Lists in fpc (including some internals in TFPHashList) grow by reallocation.

That is, if you have a list of size (actually capacity) X (e.g. 1 GB) and you need to more, then growing usually means: allocating X+n (e.g. 1.2 GB) in addition to the existing 1 GB, copy the data, free the old mem.
So temporarily you need twice the mem.

But that isn't really the problem.
Unless you know the amount of words in advance, and you have the memory to hold them all.
Then you can set the capacity of such lists by hand, while the list is still empty.

-----------
Using shortstrings is not good either.
http://www.freepascal.org/docs-html/prog/progsu162.html
Code: Pascal  [Select][+][-]
  1. A shortstring occupies as many bytes as its maximum length plus one.
afaik shortstring (with no explicit max length) is 255 chars, that means 256 bytes for each word.

You can hack this, but that is always high risk.

Better find or write a hast-list/table that does not need shortstring.

-----------
But in the end, you need to ask, what is the biggest file you might get.

If it can fit into memory, and you expect that it will keep that max size in future, the you can keep the current approach.

But if you need to parse a file with 20 GB data, you must assume the worst case, that each word is unique (no duplicates at all). So you must store 20 GB of words in your hash. Add to that some overhead for organizing the storage (you may have more than a billion different words, so that could be significant overhead.)

Even if a 64bit system could handle this, it will do a lot of disk swapping, and be slow.
Of course if you use disk storage yourself, it will take time too, but you may be able to organize it more efficient.

------------
An easy way is to use a database. (using bulk insert, statements).
Databases can do unique indexes, so they have all you need.

------------
If you do it yourself, you need to investigate how to build efficient indexes on disk.

You could keep a fixed size hash table in memory, but store the bucket date in a file.

hashtable: array of int64
setlength(hashtable, 50_percent_of_mem_avail); // only once at start of your app
// real mem, no swap.

hashtable[integer_hash_value] := int64_pos_of_bucketDate_in_indexfile

The bucketdata must be able to hold several entries for conflicts.
So the data in the file could be
- 8 bytes int 64: nil or previous bucketdata-pos in file
- 2 bytes word: length of string
- textdata

if you add an entry to your hashtable, you append the data to the end of file (after checking that it isnt a duplicate).
If there was previous data, then the new entry points back to the prev data.


If you need to check if words are in the index file, you can collect them and check them in batches. you can sort the batch by the bucketdata-pos in file, that way you reduce the amount of seeking you need to do in the indexfile.

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: How to remove duplicate entries from very large text files
« Reply #26 on: September 16, 2016, 01:44:56 pm »
Urgh...I feel blue :-(

This all seems like terribly hard work. I didn't think it would be this difficult. So much work for additional feature, for which there's other tools available that can do it anyway. The logical side me of thinking "Life's too short. Crack on with something else", but the programmer side of me is thinking if I give up now, then all this time has been wasted.

Thanks for your suggestions RVK, Engkin, Fungus, and Martin....I'll re-read them all a few times and think about how to proceed. I just wanted to make sure there was no obviously stupid thing I was doing wrong with my code segment. But by the looks of it, hashtables are just not scaled for this volume of data. I thought there might be away of ensuring only less bytes were used or that perhaps I could plumb for a different hash type to whatever TFPHashList uses by default. e.g. if it was using SHA256 and I could perhaps have knocked it down to MD5 or even a basic CRC or something. But AFAIK, you can't do that.

« Last Edit: September 16, 2016, 01:46:58 pm by Gizmo »

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: How to remove duplicate entries from very large text files
« Reply #27 on: September 16, 2016, 02:18:54 pm »
How many words / lines would you estimate are in the largest of your files?

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: How to remove duplicate entries from very large text files
« Reply #28 on: September 16, 2016, 03:43:46 pm »
tens of thousands in some, hundreds of thousands in others, and low millions in others.

I have thought (potentially) of another way.

Assume input data in a text file as follows :

ABC
The cat and the dog
Mary had a little lamb
mywebsite.com
...
and so on

Rather than using a HashList, what about TListInt64 from FGL unit with some kind of integer representation of each of the strings to be added to the list with use of TListInt64.IndexOf being used to search the list to see if the value has already been found and input to it? If IndexOf returns -1, then it is not in the list, so use TLIstInt64.Add to add it for each line from the source file.

The only 'snag' is that you can't convert the string 'ABC' to a regular integer number (except using StrToInt for those values that happen to be '1234'). But perhaps there is another way to get round this? Is there a way to store an integer representation of 'ABC' or 'Mary had a little lamb'? If so, if I can do that, it might just work. 

e.g.
Code: [Select]
uses
..FGL;

var
 IntList : TListInt64

begin
... Initiate handles to input files and output streams etc...
IntList := TIntList64.Create;
while not EOF(InputFile) do
begin
  readln(TmpFile, printableword)
  if IntList.IndexOf(StrToInt(printableword)) = -1 then   // Or however you convert a non-digit string to an integer?
    begin
      IntList.Add(StrToInt(printableword));   // Or however you convert a non-digit string to an integer?
end;
...

and then write it out to a file with

for i := 0 to IntList.Count -1 do
begin
  s := IntToStr(IntList.Items[i]);  // Or however you convert the integer value of a string back to a string
  FileStream.Write(s[q], Length(s);
end;

Anyway, my point (if I can get around the problem with StrToInt only working for characters 0-9) is that the outcome could be a 64-bit integer list in memory containing, for each string in the source file, an integer value of each unique string (because only those not found in the list would be added to it, just like with the hashlists). Upon completion, a reverse conversion from integer to string for each item in the list can be conducted and the result written to a new file, which would be the de-duplicated output.

Probably all much slower, and possibly not even do-able, but I think it would get round the memory issue. So, assuming people think this could work, is there a way to store an integer representation of a non-numerical string, like 'Peter' or 'Hello World'? 'Hello Word', for example, is 0x48 (72 decimal), 0x65 (101 decimal), 0x6C (108 decimal), 0x6C (108 decimal), 0x6F (111 decimal), 0x20 (32 decimal), 0x57 (87 decimal), 0x6F (111 decimal), 0x72 (114 decimal), 0x6C (108 decimal), 0x64 (100 decimal).
« Last Edit: September 16, 2016, 04:04:20 pm by Gizmo »

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: How to remove duplicate entries from very large text files
« Reply #29 on: September 16, 2016, 04:04:02 pm »
Even with millions of lines / words is should really not be that big a challenge to make something work. You idea of creating a list of 64bit hashes/CRC's with TListInt64 could be a possibility. The issue remaining will still be that a hash of 64bit cannot be considered 100% unique and therefore you cannot be sure that all duplicates are removed and you might even end up with the removal of words that are not duplicated.

You can see a list of hashes here and find something with 64bit size: https://en.wikipedia.org/wiki/List_of_hash_functions I've also seen someone use an MD5 digest and simply remove half of it to get a 64bit hash - I'm not sure I'd recommend that, though.

I would still recommend the method of creating a sorted index of the output file - this may not perform that well, but it would be 100% accurate.

 

TinyPortal © 2005-2018