Recent

Author Topic: How to remove duplicate entries from very large text files  (Read 24396 times)

Gizmo

  • Hero Member
  • *****
  • Posts: 831
How to remove duplicate entries from very large text files
« on: August 25, 2016, 09:01:03 pm »
I'm thinking how to tackle a problem.

I have a situation where very large text files (word lists, essentially) need to be read and the duplicates removed.

For small files of several Mb, a TStringList could be used, but I'm not sure how to tackle this with large files of sometimes several Gb. If the whole file cannot be loaded into memory, how does the program know what may lay ahead in the file? e.g. if the word "Peter" is found on line 5 but also on line 1,000,000,000, how can my program know about the entry on that millionth line when it reads the first one on line 5?

I'm guessing a TFileStream will be needed but I'm still unsure how to efficiently tackle the problem without reading the whole file from start to end for every line in the file. At the moment, that's the only way I can think of : i.e. ReadLine from stream, check every other line to see if it equals line being read. If not, leave. If it does, delete. But that is very inefficient it seems to me.

I don't expect code necessarily as I realise I haven't given anything yet. I'm just after ideas from those with the experience.

Ta
« Last Edit: September 13, 2016, 04:10:05 pm by Gizmo »

Phil

  • Hero Member
  • *****
  • Posts: 2737
Re: How to remove duplicate entries from very large text files
« Reply #1 on: August 25, 2016, 09:15:44 pm »
e.g. if the word "Peter" is found on line 5 but also on line 1,000,000,000, how can my program know about the entry on that millionth line when it reads the first one on line 5?

It can't, unless you've indexed the file, eg, put it in a database.


molly

  • Hero Member
  • *****
  • Posts: 2330
Re: How to remove duplicate entries from very large text files
« Reply #2 on: August 25, 2016, 09:18:50 pm »
In case it is allowed to sort the list/file you could opt for an external merge sort.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9867
  • Debugger - SynEdit - and more
    • wiki
Re: How to remove duplicate entries from very large text files
« Reply #3 on: September 10, 2016, 02:12:08 pm »
You create a hash-table with all the words.
https://en.wikipedia.org/wiki/Hash_table

For an example see synedit
unit SynPluginSyncroEdit;
class TSynPluginSyncroEditWordsHash



JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4468
  • I like bugs.
Re: How to remove duplicate entries from very large text files
« Reply #4 on: September 10, 2016, 02:46:02 pm »
You create a hash-table with all the words.
Yes, hash-table is the way to go. This is pseudo-code, not using a real hash class:
Code: Pascal  [Select][+][-]
  1. var Seen: TSomeHashListType;
  2. ...
  3.   if not Seen[MyText] then
  4.   begin
  5.     // ... Do your thing with the first MyText instance ...
  6.     Seen.Add(Mytext);
  7.   end;
  8.  
Unfortunately the container classes are spread around FPC and Lazarus libs but there are many choices.

Note: the hash-table does not store all lines in memory. It only stores the unique ones. Thus 1,000,000,000 lines is no problem if most of them are duplicates.
Hash table has constant read time, O(1), thus it shines with big amounts of data.
« Last Edit: September 10, 2016, 02:50:18 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: How to remove duplicate entries from very large text files
« Reply #5 on: September 10, 2016, 03:02:46 pm »
If the hash value is larger (in bytes) than the average word length a hash table will not work. SHA1 is 20 bytes so if the average word length is less than 20 characters a SHA1 hash table would use more memory than the text file it self.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9867
  • Debugger - SynEdit - and more
    • wiki
Re: How to remove duplicate entries from very large text files
« Reply #6 on: September 10, 2016, 03:13:21 pm »
If the hash value is larger (in bytes) than the average word length a hash table will not work. SHA1 is 20 bytes so if the average word length is less than 20 characters a SHA1 hash table would use more memory than the text file it self.
You wouldnt use sha1 for a hashtable. You would need 2 ^^ 160 buckets for it.

Hash values would usually be a range integers. Maybe 100.000, maybe 5.000.000, maybe more or less.... depends on the data, and memory avail.
buckets: array of bucketdata;
setlength(buckets, 100.000);

Real live hash tables also do not have perfect hashes, so you get duplicates, but that is no problem.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9867
  • Debugger - SynEdit - and more
    • wiki
Re: How to remove duplicate entries from very large text files
« Reply #7 on: September 10, 2016, 03:19:44 pm »
If you do have (or expect to have) more words, than fit into memory, then you need to store your hashtable on the filesystem.

Since that is slow, you may have to employ additional tricks, such as dividing the data and the hash table into smaller parts, and load/check all combinations of the parts.

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: How to remove duplicate entries from very large text files
« Reply #8 on: September 10, 2016, 03:29:08 pm »
I wish I'd read this before I just visited to update you with the solution!

But yes guys, hash tables were the way to go. A colleague of mine pointed me in the right direction and I can now read in the text files using hash tables and pointers, adding only values for which are not already known to the hash table, and then I use a filestream to output the content of the hashlist. For files of a few Gb, it seems to work OK. Not tried it yet on huge files. The only snag is dealing with Unicode text files because I can't use ReadLn to read each line, but that issue I have raised in a seperate thread.

But yes, I have gone for TFPHashList from unit 'Contnrs' (http://www.freepascal.org/docs-html/fcl/contnrs/tfphashlist.html)

Thanks for the help guys
« Last Edit: September 10, 2016, 03:32:31 pm by Gizmo »

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: How to remove duplicate entries from very large text files
« Reply #9 on: September 10, 2016, 03:30:21 pm »
Real live hash tables also do not have perfect hashes, so you get duplicates, but that is no problem.

Why not if the purpose of the operation is to remove duplicates?

If I should solve this issue I would create a list of 32bit values with file positions for each word. I would then use a binary search algorithm to reduce the file accesses required to look up each word while copying from one file to another. If there is 5mio words is a file, this approach would use ~20MB of memory and the performance would be quite reasonable :)

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9867
  • Debugger - SynEdit - and more
    • wiki
Re: How to remove duplicate entries from very large text files
« Reply #10 on: September 10, 2016, 03:50:06 pm »
Real live hash tables also do not have perfect hashes, so you get duplicates, but that is no problem.

Why not if the purpose of the operation is to remove duplicates?
Duplicate hash values, not duplicate words.
2 different words can have the same hash value (same as 2 different text docs can have the same sha1)

Please read the wiki, I linked

Quote
If I should solve this issue I would create a list of 32bit values with file positions for each word. I would then use a binary search algorithm

a binary search is O(log n)
a (perfect) hash is O(1)

of course there is no perfect hash, so depending on conflict handling it may be that a hash has a worst case of O(log n) too. But in reality, in nearly 100% of all cases it will be close to O(1).

see https://en.wikipedia.org/wiki/Big_O_notation for O() notation.

shobits1

  • Sr. Member
  • ****
  • Posts: 271
  • .
Re: How to remove duplicate entries from very large text files
« Reply #11 on: September 10, 2016, 06:24:49 pm »
hash tables were the way to go. A colleague of mine pointed me in the right direction and I can now read in the text files using hash tables and pointers, adding only values for which are not already known to the hash table
if you use hashes you need to check for collisions or you will lose (risk losing) data, so be careful.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4468
  • I like bugs.
Re: How to remove duplicate entries from very large text files
« Reply #12 on: September 10, 2016, 09:43:24 pm »
if you use hashes you need to check for collisions or you will lose (risk losing) data, so be careful.

Nonsense. Just use any of the existing hash list classes and you are good.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: [SOLVED] How to remove duplicate entries from very large text files
« Reply #13 on: September 12, 2016, 03:55:08 pm »
Guys...I am struggling with memory issues it seems.

The program works fine if the source text file is say 100Mb or whatever. But in one test I gave it a 1Gb text file and sure it enough I got the "Out of memory" error.

I'm not sure if there's any way around that other than the ellaborate tricks Martin talked of?

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: [SOLVED] How to remove duplicate entries from very large text files
« Reply #14 on: September 12, 2016, 04:51:59 pm »
Have you checked how much memory your program uses?

A trick you could use is to split your file into to alphabetic chunks. One file for words starting with "a" and one for words starting with "b" and so on and each of these files would have their own hashtable. You can then line-by-line read the huge source file and check/update the corresponding alphabetical file. This would, of course, cause a lot of file I/O which will slow the progress of removing duplicates, but it would consume less memory. After you're done removing duplicates you can then merge all the small chunks into one big file again.

 

TinyPortal © 2005-2018