Lists in fpc (including some internals in TFPHashList) grow by reallocation.
That is, if you have a list of size (actually capacity) X (e.g. 1 GB) and you need to more, then growing usually means: allocating X+n (e.g. 1.2 GB) in addition to the existing 1 GB, copy the data, free the old mem.
So temporarily you need twice the mem.
But that isn't really the problem.
Unless you know the amount of words in advance, and you have the memory to hold them all.
Then you can set the capacity of such lists by hand, while the list is still empty.
-----------
Using shortstrings is not good either.
http://www.freepascal.org/docs-html/prog/progsu162.htmlA shortstring occupies as many bytes as its maximum length plus one.
afaik shortstring (with no explicit max length) is 255 chars, that means 256 bytes for each word.
You can hack this, but that is always high risk.
Better find or write a hast-list/table that does not need shortstring.
-----------
But in the end, you need to ask, what is the biggest file you might get.
If it can fit into memory, and you expect that it will keep that max size in future, the you can keep the current approach.
But if you need to parse a file with 20 GB data, you must assume the worst case, that each word is unique (no duplicates at all). So you must store 20 GB of words in your hash. Add to that some overhead for organizing the storage (you may have more than a billion different words, so that could be significant overhead.)
Even if a 64bit system could handle this, it will do a lot of disk swapping, and be slow.
Of course if you use disk storage yourself, it will take time too, but you may be able to organize it more efficient.
------------
An easy way is to use a database. (using bulk insert, statements).
Databases can do unique indexes, so they have all you need.
------------
If you do it yourself, you need to investigate how to build efficient indexes on disk.
You could keep a fixed size hash table in memory, but store the bucket date in a file.
hashtable: array of int64
setlength(hashtable, 50_percent_of_mem_avail); // only once at start of your app
// real mem, no swap.
hashtable[integer_hash_value] := int64_pos_of_bucketDate_in_indexfile
The bucketdata must be able to hold several entries for conflicts.
So the data in the file could be
- 8 bytes int 64: nil or previous bucketdata-pos in file
- 2 bytes word: length of string
- textdata
if you add an entry to your hashtable, you append the data to the end of file (after checking that it isnt a duplicate).
If there was previous data, then the new entry points back to the prev data.
If you need to check if words are in the index file, you can collect them and check them in batches. you can sort the batch by the bucketdata-pos in file, that way you reduce the amount of seeking you need to do in the indexfile.