How to remove duplicate entries from very large text files

Gizmo

Hero Member
Posts: 831

How to remove duplicate entries from very large text files

« on: August 25, 2016, 09:01:03 pm »

I'm thinking how to tackle a problem.

I have a situation where very large text files (word lists, essentially) need to be read and the duplicates removed.

For small files of several Mb, a TStringList could be used, but I'm not sure how to tackle this with large files of sometimes several Gb. If the whole file cannot be loaded into memory, how does the program know what may lay ahead in the file? e.g. if the word "Peter" is found on line 5 but also on line 1,000,000,000, how can my program know about the entry on that millionth line when it reads the first one on line 5?

I'm guessing a TFileStream will be needed but I'm still unsure how to efficiently tackle the problem without reading the whole file from start to end for every line in the file. At the moment, that's the only way I can think of : i.e. ReadLine from stream, check every other line to see if it equals line being read. If not, leave. If it does, delete. But that is very inefficient it seems to me.

I don't expect code necessarily as I realise I haven't given anything yet. I'm just after ideas from those with the experience.

Ta

« Last Edit: September 13, 2016, 04:10:05 pm by Gizmo »

Logged

Lazarus 2.2.6 FPC 3.2.2- Linux Mint 21 LTS, Windows 10 64 and Mac OSX 14.0
Useful Pages to remember :

http://wiki.freepascal.org/Cross_compiling#From_Linux_x64_to_Linux_i386
https://wiki.freepascal.org/macOS_Big_Sur_changes_for_developers#ARM64.2FAArch64.2FApple_Silicon_Support

Phil

Hero Member
Posts: 2737

Re: How to remove duplicate entries from very large text files

« Reply #1 on: August 25, 2016, 09:15:44 pm »

Quote from: Gizmo on August 25, 2016, 09:01:03 pm

e.g. if the word "Peter" is found on line 5 but also on line 1,000,000,000, how can my program know about the entry on that millionth line when it reads the first one on line 5?

It can't, unless you've indexed the file, eg, put it in a database.

Logged

molly

Hero Member
Posts: 2330

Re: How to remove duplicate entries from very large text files

« Reply #2 on: August 25, 2016, 09:18:50 pm »

In case it is allowed to sort the list/file you could opt for an external merge sort.

Logged

Martin_fr

Administrator
Hero Member
Posts: 9867
Debugger - SynEdit - and more

Re: How to remove duplicate entries from very large text files

« Reply #3 on: September 10, 2016, 02:12:08 pm »

You create a hash-table with all the words.
https://en.wikipedia.org/wiki/Hash_table

For an example see synedit
unit SynPluginSyncroEdit;
class TSynPluginSyncroEditWordsHash

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

JuhaManninen

Global Moderator
Hero Member
Posts: 4468
I like bugs.

Re: How to remove duplicate entries from very large text files

« Reply #4 on: September 10, 2016, 02:46:02 pm »

Quote from: Martin_fr on September 10, 2016, 02:12:08 pm

You create a hash-table with all the words.

Yes, hash-table is the way to go. This is pseudo-code, not using a real hash class:

Code: Pascal [Select][+]

var Seen: TSomeHashListType;
...
  if not Seen[MyText] then 
  begin
    // ... Do your thing with the first MyText instance ...
    Seen.Add(Mytext);
  end;
 

Unfortunately the container classes are spread around FPC and Lazarus libs but there are many choices.

Note: the hash-table does not store all lines in memory. It only stores the unique ones. Thus 1,000,000,000 lines is no problem if most of them are duplicates.
Hash table has constant read time, O(1), thus it shines with big amounts of data.

« Last Edit: September 10, 2016, 02:50:18 pm by JuhaManninen »

Logged

Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Fungus

Sr. Member
Posts: 353

Re: How to remove duplicate entries from very large text files

« Reply #5 on: September 10, 2016, 03:02:46 pm »

If the hash value is larger (in bytes) than the average word length a hash table will not work. SHA1 is 20 bytes so if the average word length is less than 20 characters a SHA1 hash table would use more memory than the text file it self.

Logged

Martin_fr

Administrator
Hero Member
Posts: 9867
Debugger - SynEdit - and more

Re: How to remove duplicate entries from very large text files

« Reply #6 on: September 10, 2016, 03:13:21 pm »

Quote from: Fungus on September 10, 2016, 03:02:46 pm

If the hash value is larger (in bytes) than the average word length a hash table will not work. SHA1 is 20 bytes so if the average word length is less than 20 characters a SHA1 hash table would use more memory than the text file it self.

You wouldnt use sha1 for a hashtable. You would need 2 ^^ 160 buckets for it.

Hash values would usually be a range integers. Maybe 100.000, maybe 5.000.000, maybe more or less.... depends on the data, and memory avail.
buckets: array of bucketdata;
setlength(buckets, 100.000);

Real live hash tables also do not have perfect hashes, so you get duplicates, but that is no problem.

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

Martin_fr

Administrator
Hero Member
Posts: 9867
Debugger - SynEdit - and more

Re: How to remove duplicate entries from very large text files

« Reply #7 on: September 10, 2016, 03:19:44 pm »

If you do have (or expect to have) more words, than fit into memory, then you need to store your hashtable on the filesystem.

Since that is slow, you may have to employ additional tricks, such as dividing the data and the hash table into smaller parts, and load/check all combinations of the parts.

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

Gizmo

Hero Member
Posts: 831

Re: How to remove duplicate entries from very large text files

« Reply #8 on: September 10, 2016, 03:29:08 pm »

I wish I'd read this before I just visited to update you with the solution!

But yes guys, hash tables were the way to go. A colleague of mine pointed me in the right direction and I can now read in the text files using hash tables and pointers, adding only values for which are not already known to the hash table, and then I use a filestream to output the content of the hashlist. For files of a few Gb, it seems to work OK. Not tried it yet on huge files. The only snag is dealing with Unicode text files because I can't use ReadLn to read each line, but that issue I have raised in a seperate thread.

But yes, I have gone for TFPHashList from unit 'Contnrs' (http://www.freepascal.org/docs-html/fcl/contnrs/tfphashlist.html)

Thanks for the help guys

« Last Edit: September 10, 2016, 03:32:31 pm by Gizmo »

Logged

Fungus

Sr. Member
Posts: 353

Re: How to remove duplicate entries from very large text files

« Reply #9 on: September 10, 2016, 03:30:21 pm »

Quote from: Martin_fr on September 10, 2016, 03:13:21 pm

Real live hash tables also do not have perfect hashes, so you get duplicates, but that is no problem.

Why not if the purpose of the operation is to remove duplicates?

If I should solve this issue I would create a list of 32bit values with file positions for each word. I would then use a binary search algorithm to reduce the file accesses required to look up each word while copying from one file to another. If there is 5mio words is a file, this approach would use ~20MB of memory and the performance would be quite reasonable

Logged

Martin_fr

Administrator
Hero Member
Posts: 9867
Debugger - SynEdit - and more

Re: How to remove duplicate entries from very large text files

« Reply #10 on: September 10, 2016, 03:50:06 pm »

Quote from: Fungus on September 10, 2016, 03:30:21 pm

Quote from: Martin_fr on September 10, 2016, 03:13:21 pm
Real live hash tables also do not have perfect hashes, so you get duplicates, but that is no problem.

Why not if the purpose of the operation is to remove duplicates?

Duplicate hash values, not duplicate words.
2 different words can have the same hash value (same as 2 different text docs can have the same sha1)

Please read the wiki, I linked

Quote

If I should solve this issue I would create a list of 32bit values with file positions for each word. I would then use a binary search algorithm

a binary search is O(log n)
a (perfect) hash is O(1)

of course there is no perfect hash, so depending on conflict handling it may be that a hash has a worst case of O(log n) too. But in reality, in nearly 100% of all cases it will be close to O(1).

see https://en.wikipedia.org/wiki/Big_O_notation for O() notation.

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

shobits1

Sr. Member
Posts: 271
.

Re: How to remove duplicate entries from very large text files

« Reply #11 on: September 10, 2016, 06:24:49 pm »

Quote from: Gizmo on September 10, 2016, 03:29:08 pm

hash tables were the way to go. A colleague of mine pointed me in the right direction and I can now read in the text files using hash tables and pointers, adding only values for which are not already known to the hash table

if you use hashes you need to check for collisions or you will lose (risk losing) data, so be careful.

Logged

JuhaManninen

Global Moderator
Hero Member
Posts: 4468
I like bugs.

Re: How to remove duplicate entries from very large text files

« Reply #12 on: September 10, 2016, 09:43:24 pm »

Quote from: shobits1 on September 10, 2016, 06:24:49 pm

if you use hashes you need to check for collisions or you will lose (risk losing) data, so be careful.

Nonsense. Just use any of the existing hash list classes and you are good.

Logged

Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Gizmo

Hero Member
Posts: 831

Re: [SOLVED] How to remove duplicate entries from very large text files

« Reply #13 on: September 12, 2016, 03:55:08 pm »

Guys...I am struggling with memory issues it seems.

The program works fine if the source text file is say 100Mb or whatever. But in one test I gave it a 1Gb text file and sure it enough I got the "Out of memory" error.

I'm not sure if there's any way around that other than the ellaborate tricks Martin talked of?

Logged

Fungus

Sr. Member
Posts: 353

Re: [SOLVED] How to remove duplicate entries from very large text files

« Reply #14 on: September 12, 2016, 04:51:59 pm »

Have you checked how much memory your program uses?

A trick you could use is to split your file into to alphabetic chunks. One file for words starting with "a" and one for words starting with "b" and so on and each of these files would have their own hashtable. You can then line-by-line read the huge source file and check/update the corresponding alphabetical file. This would, of course, cause a lot of file I/O which will slow the progress of removing duplicates, but it would consume less memory. After you're done removing duplicates you can then merge all the small chunks into one big file again.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: How to remove duplicate entries from very large text files (Read 24396 times)

Gizmo

How to remove duplicate entries from very large text files

Phil

Re: How to remove duplicate entries from very large text files

molly

Re: How to remove duplicate entries from very large text files

Martin_fr

Re: How to remove duplicate entries from very large text files

JuhaManninen

Re: How to remove duplicate entries from very large text files

Fungus

Re: How to remove duplicate entries from very large text files

Martin_fr

Re: How to remove duplicate entries from very large text files

Martin_fr

Re: How to remove duplicate entries from very large text files

Gizmo

Re: How to remove duplicate entries from very large text files

Fungus

Re: How to remove duplicate entries from very large text files

Martin_fr

Re: How to remove duplicate entries from very large text files

shobits1

Re: How to remove duplicate entries from very large text files

JuhaManninen

Re: How to remove duplicate entries from very large text files

Gizmo

Re: [SOLVED] How to remove duplicate entries from very large text files

Fungus

Re: [SOLVED] How to remove duplicate entries from very large text files

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook