Hi gli,
Why does it have to be files?
Can you not use and embedded SQLite 3.0 database?
Takes all the pain of filtering and parsing those files.
Just my 2c.
Cheers,
Gus
TIniFile (https://freepascal.org/docs-html/current/fcl/inifiles/tinifile.html) will probably be the more simple, and if not the fastest it should be fast enough even for 5000 files.
With such a large number of files your bottleneck will always be IO and you can't avoid it no matter what, since you'll always have to read each file and parse it, which last TIniFile can do for you.
TIniFile (https://freepascal.org/docs-html/current/fcl/inifiles/tinifile.html) will probably be the more simple, and if not the fastest it should be fast enough even for 5000 files.
Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened.
Complexity is not an obstacle. I would just like to do that as quickly as possible.
Hi Lucamar,TIniFile (https://freepascal.org/docs-html/current/fcl/inifiles/tinifile.html) will probably be the more simple, and if not the fastest it should be fast enough even for 5000 files.
The OP says that he's running 5,000(5K) files after filtering it down from 50,000,000(50M). TIniFile is good in a pinch but with 50M Sections and no indexes, it's gonna hurt a lot.
Hence me proposing the simplest SQL engine possible to migrate from a set of files to something just before a full blown database server.
Hey, even FireBird would be an improvement, right?
Cheers,
Gus
Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened.
Complexity is not an obstacle. I would just like to do that as quickly as possible.
You talked of having to read some 5000 files out of several millions; no matter what and how you do it, that is your bottleneck: a lot of files (~5000) to read and parse. That's what I meant, and in those circumstances you might as well use TIniFile to simplify your code since it will not add significantly to the time spent, much less if each file is really as small as you said.
Or you could use TStringList: it will swallow each file in a jiffy and has methods and properties to deal with "key=value" pairs like yours.
The question, really, is that yor program will probably spend more time in the file system (seeking each file, reading it, etc.) than doing whatever needs to be done with the data.
Until recently, this data was saved in a database (MySQL).
I still think about using a TSDB.
Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened.
Hey gli,
Another thing I'm worried is if you hit the limit of files per folder.
I would have to investigate for NTFS and the Linux extN, but I think there is a limit. I think I stumbled on a post about this once, but not sure.
Cheers,
Gus
| MAGIC NUMBER (32 bit) | Version Number (32 bit) | Header Length (64 bit) |
| Filename 1 (24 byte shortstring) | Offset (64 bit) |
| Filename 2 (24 byte shortstring) | Offset (64 bit) |
| Filename 3 (24 byte shortstring) | Offset (64 bit) |
| ... |
Where offset is the offset of the content of that file within this file50 million records in a half-competent database is absolutely nothing. I can't speak for MSSQL, but in the case of PostgreSQL you can ask it how it's tackling a query hence identify anywhere where it's locating one or more matching rows by doing a sequential scan rather than using an index.
My suggestion would be to revisit the database situation, adding indexes etc. as necessary since the database maintainers will have addressed all issues of file count etc.
MarkMLl
[…]What is the fastest way to read these files?The “best” would be not to read so many files. Ideally, you wouldn’t read any files.
[…]
Or some other way ...
[…] Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened. […]You could store such information via path components, by using a directory tree that looks like:
[…]Well, (concerning spinning disks) one can set up a striping RAID. That’s not that expensive (nowadays).
Unless you throw very expensive hardware at it, you'll never overcome the sheer amount of IO involved.
A slow database is typically an unoptimized database
I've just checked a 2.7 million row table in a PostgreSQL database running live (i.e. will be heavily cached), and getting the minimum or maximum value of a specified column takes roughly a second. Asking for an arbitrary day's worth of data (I used 2010-01-01) gets a response which is effectively instantaneous.Nice!
My queries are not very advanced. Basically I search for records within a period of dates, in addition to looking at some records prior to the first date of the filter. As I mentioned before, using a DBMS, the biggest problem was when updating / deleting data.