Print Page - Best reading method

Programming => General => Topic started by: gii on March 08, 2021, 01:53:32 pm

Title: Best reading method
Post by: gii on March 08, 2021, 01:53:32 pm

I have several files (~50 million, each with ~66 bytes).

Important details:

1. There is a filter before reading all the files. With each request, ~5,000 files will be read.

2. The files have a standard structure:

Something like:
H = 1200
O = 1947.47
L = 1950.87
C = 1200
V = 78965656.99

Only values change from file to file

3. If necessary, I can change the way I write these files.

What is the fastest way to read these files?

Typed files
BufStream
Or some other way ...

Title: Re: Best reading method
Post by: Gustavo 'Gus' Carreno on March 08, 2021, 02:09:03 pm

Hi gli,

Why does it have to be files?

Can you not use and embedded SQLite 3.0 database?
Takes all the pain of filtering and parsing those files.

Just my 2c.

Cheers,
Gus

Title: Re: Best reading method
Post by: lucamar on March 08, 2021, 02:10:58 pm

TIniFile (https://freepascal.org/docs-html/current/fcl/inifiles/tinifile.html) will probably be the more simple, and if not the fastest it should be fast enough even for 5000 files.

With such a large number of files your bottleneck will always be IO and you can't avoid it no matter what, since you'll always have to read each file and parse it, which last TIniFile can do for you.

Title: Re: Best reading method
Post by: gii on March 08, 2021, 02:26:21 pm

Quote from: gcarreno on March 08, 2021, 02:09:03 pm

Hi gli,

Why does it have to be files?

Can you not use and embedded SQLite 3.0 database?
Takes all the pain of filtering and parsing those files.

Just my 2c.

Cheers,
Gus

Until recently, this data was saved in a database (MySQL).

However, as the database grew, it became very slow.

My first idea was to use MongoDB. I had two problems.

1. Limit of 16KB per document.
2. High consumption of RAM memory.

Now, I am trying to save this data directly to disk.

I still think about using a TSDB.

The prospect is that in a short time, I will have 552 millions records.

Quote from: lucamar on March 08, 2021, 02:10:58 pm

TIniFile (https://freepascal.org/docs-html/current/fcl/inifiles/tinifile.html) will probably be the more simple, and if not the fastest it should be fast enough even for 5000 files.

With such a large number of files your bottleneck will always be IO and you can't avoid it no matter what, since you'll always have to read each file and parse it, which last TIniFile can do for you.

Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened.

Complexity is not an obstacle. I would just like to do that as quickly as possible.

Title: Re: Best reading method
Post by: Gustavo 'Gus' Carreno on March 08, 2021, 02:27:27 pm

Hi Lucamar,

Quote from: lucamar on March 08, 2021, 02:10:58 pm

TIniFile (https://freepascal.org/docs-html/current/fcl/inifiles/tinifile.html) will probably be the more simple, and if not the fastest it should be fast enough even for 5000 files.

The OP says that he's running 5,000(5K) files after filtering it down from 50,000,000(50M). TIniFile is good in a pinch but with 50M Sections and no indexes, it's gonna hurt a lot.
Hence me proposing the simplest SQL engine possible to migrate from a set of files to something just before a full blown database server.
Hey, even FireBird would be an improvement, right?

Cheers,
Gus

Title: Re: Best reading method
Post by: lucamar on March 08, 2021, 02:33:55 pm

Quote from: gii on March 08, 2021, 02:26:21 pm

Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened.

Complexity is not an obstacle. I would just like to do that as quickly as possible.

You talked of having to read some 5000 files out of several millions; no matter what and how you do it, that is your bottleneck: a lot of files (~5000) to read and parse. That's what I meant, and in those circumstances you might as well use TIniFile to simplify your code since it will not add significantly to the time spent, much less if each file is really as small as you said.

Or you could use TStringList: it will swallow each file in a jiffy and has methods and properties to deal with "key=value" pairs like yours.

The question, really, is that yor program will probably spend more time in the file system (seeking each file, reading it, etc.) than doing whatever needs to be done with the data.

Title: Re: Best reading method
Post by: gii on March 08, 2021, 02:45:20 pm

Quote from: gcarreno on March 08, 2021, 02:27:27 pm

Hi Lucamar,

Quote from: lucamar on March 08, 2021, 02:10:58 pm
TIniFile (https://freepascal.org/docs-html/current/fcl/inifiles/tinifile.html) will probably be the more simple, and if not the fastest it should be fast enough even for 5000 files.

The OP says that he's running 5,000(5K) files after filtering it down from 50,000,000(50M). TIniFile is good in a pinch but with 50M Sections and no indexes, it's gonna hurt a lot.
Hence me proposing the simplest SQL engine possible to migrate from a set of files to something just before a full blown database server.
Hey, even FireBird would be an improvement, right?

Cheers,
Gus

I confess that from past experiences, I am very afraid to continue using a typical database.

Quote from: lucamar on March 08, 2021, 02:33:55 pm

Quote from: gii on March 08, 2021, 02:26:21 pm
Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened.

Complexity is not an obstacle. I would just like to do that as quickly as possible.

You talked of having to read some 5000 files out of several millions; no matter what and how you do it, that is your bottleneck: a lot of files (~5000) to read and parse. That's what I meant, and in those circumstances you might as well use TIniFile to simplify your code since it will not add significantly to the time spent, much less if each file is really as small as you said.

Or you could use TStringList: it will swallow each file in a jiffy and has methods and properties to deal with "key=value" pairs like yours.

The question, really, is that yor program will probably spend more time in the file system (seeking each file, reading it, etc.) than doing whatever needs to be done with the data.

I will run benchmarks, later post the results.

Title: Re: Best reading method
Post by: Gustavo 'Gus' Carreno on March 08, 2021, 02:48:51 pm

Quote from: gii on March 08, 2021, 02:26:21 pm

Until recently, this data was saved in a database (MySQL).

I still think about using a TSDB.

Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened.

First of all lets address the HUMONGOUS elephant in the room: You have entries in the tens of millions and growing into the hundreds of millions.
Unless you throw very expensive hardware at it, you'll never overcome the sheer amount of IO involved.

Then you said you downgraded from a MySQL db. Can you publish your schema to see if you optimized the columns and the indexes?

Lastly, it looks like you are filtering via date, so I'm guessing you want to make a graph or the likes with a chunk of your data between dates.
Can I suggest having a table like this:

Code: SQL [Select][+]

CREATE TABLE `data_by_dates` (
  `date_time` DATETIME,
  `data_index` INT,
  PRIMARY KEY date_time
);
 

That data_index points to the table that contains the data mentioned on your files but it has no DateTime field, just the numerical data.
That way you are still dealing with 500,000,000(500M) entries but with the smallest footprint possible.

Again, just my 2c.

Cheers,
Gus

Title: Re: Best reading method
Post by: Gustavo 'Gus' Carreno on March 08, 2021, 02:57:16 pm

Hey gli,

Another thing I'm worried is if you hit the limit of files per folder.
I would have to investigate for NTFS and the Linux extN, but I think there is a limit. I think I stumbled on a post about this once, but not sure.

Cheers,
Gus

Title: Re: Best reading method
Post by: gii on March 08, 2021, 03:43:42 pm

Quote from: gcarreno on March 08, 2021, 02:57:16 pm

Hey gli,

Another thing I'm worried is if you hit the limit of files per folder.
I would have to investigate for NTFS and the Linux extN, but I think there is a limit. I think I stumbled on a post about this once, but not sure.

Cheers,
Gus

I organized the files so that it wouldn't be a problem.

--

I am saving stock market candles.

There are ~25.000 instruments. Each one, having a different graphic time (daily, 30 minutes, etc.).

Analyzing the whole problem, I think the best solution is to create several SQLite databases splited per instrument / graph time.

For example:

Apple and IBM.

Split into 4 databases.

APPLE_DAYLI.db3
APPLE_30MIN.db3

IBM_DAYLI.db3
IBM_30MIN.db3

Title: Re: Best reading method
Post by: Warfley on March 08, 2021, 03:58:06 pm

The biggest performance issue you have is that you have so many files. Reading thousands of small files is much slower then reading one big file, so probably something to keep in mind. Also it should be noted that Linux ext3 and ext4 filesystems get really slow when having to deal with so many files in a single directory

Then if you want the maximum performance, you should probably use a binary file format, e.g. file of record or using a TFileStream reading records, but you should note that this can massively impact the maintainability when you later change up the format, it requires much more portability code.
And if you are really fancy, you map the file into memory using mmap.

So this is how I would do it, if performance is the main concern: Single file, containing of a header and the content. Header format:

Code: [Select]

| MAGIC NUMBER (32 bit) | Version Number (32 bit) | Header Length (64 bit) |
| Filename 1 (24 byte shortstring)                | Offset (64 bit)        |
| Filename 2 (24 byte shortstring)                | Offset (64 bit)        |
| Filename 3 (24 byte shortstring)                | Offset (64 bit)        |
|                                ...                                       |

Where offset is the offset of the content of that file within this file
So when reading the file, you first check the magic number and version number. If both match a supported value, you read the header length, and then read all the header information from this point up to header length and build up a dictionary of the filename and objects. (The 24 byte shortstring is just an arbitrary length, you can choose any length you want, but too long will just be a waste of space)

If you then want to read a file, all you need to do is lookup the offset and then load the the contents of that file at position offset into a record. If you map your file into memory using mmap, this is pretty much just a single pointer operation

Title: Re: Best reading method
Post by: Gustavo 'Gus' Carreno on March 08, 2021, 04:47:12 pm

Hi gli,

Let me rant a bit about what I'm understanding about your predicament.

From what I can gather your specs are:

Upon a determined date/time interval
Select a chunk of data
From stock market companies

So:

Step 1: Filter universe data by date
Step 2: retrieve subset of universe data

What Warfley is proposing is what I was about to propose about 10 minutes ago, and solves Step 2 quite well.

What made me rethink this is the fact that in order to have the fastest Step 1 you would need a balanced binary tree, you know, like the ones MySQL implements for it's indexes.
Doing that yourself is gonna be a real pain in the arse and (I don't believe I'm about to say this) why re-invent the wheel here?

So, even if you implement a blazing fast solution for Step 2, you're still threading water in Step 1.

Now, you're proposing splitting your data into smaller chunks.
Well that's the best way to go, absolutely and I would advise doing that with a system that can solve Step 1 with balanced binary trees(AVL trees if I'm not mistaken).

My proposal is the following:

For all the companies crate a table with the name of the company(Apple) and the name of a table prefix(apple_)
For each company create a set of tables for each graph time: apple_daily, apple_30min, etc with only 2 columns: date/time, pointer to points value table, where the primary key is the date/time column
For each graph time create a table to hold the points value data: apple_daily_data, apple_30m_data that contains an index that goes into apple_daily and the other data points

Using this method, you are slicing the data in many smaller chunks, just like what a binary tree does when searching it.
You'll then have the same amount of data, in total, but now a bit more manageable for any database engine.
And you want that under a piece of software that has been tested for these types of use.
Anything you do is only going back to methods discarded in favour of databases.

Cheers,
Gus

Title: Re: Best reading method
Post by: MarkMLl on March 08, 2021, 05:34:52 pm

50 million records in a half-competent database is absolutely nothing. I can't speak for MSSQL, but in the case of PostgreSQL you can ask it how it's tackling a query hence identify anywhere where it's locating one or more matching rows by doing a sequential scan rather than using an index.

My suggestion would be to revisit the database situation, adding indexes etc. as necessary since the database maintainers will have addressed all issues of file count etc.

MarkMLl

Title: Re: Best reading method
Post by: ASBzone on March 08, 2021, 06:30:54 pm

As others have said, pursuing this from a file system perspective is going to be burdensome, and won't be any faster than a database.

A slow database is typically an unoptimized database, so the issue is far more likely to be the organization of the data and the overall architecture of the solution you are building.

If you're open to solving the overall problem, it will help to let folks know what the files are about and what you're trying to do with them. You'll get great advice.

Otherwise, you've gotten some advice for how to handle what you are doing today, but expect that there will still be performance issues -- lots of them, with that growth projection.

Title: Re: Best reading method
Post by: ASBzone on March 08, 2021, 06:32:01 pm

Quote from: MarkMLl on March 08, 2021, 05:34:52 pm

50 million records in a half-competent database is absolutely nothing. I can't speak for MSSQL, but in the case of PostgreSQL you can ask it how it's tackling a query hence identify anywhere where it's locating one or more matching rows by doing a sequential scan rather than using an index.

My suggestion would be to revisit the database situation, adding indexes etc. as necessary since the database maintainers will have addressed all issues of file count etc.

MarkMLl

I can speak for MSSQL, and I concur with all that has been said here. Architecture and structure is key.

Title: Re: Best reading method
Post by: Kays on March 08, 2021, 06:57:12 pm

Quote from: gii on March 08, 2021, 01:53:32 pm

[…]What is the fastest way to read these files?
[…]
Or some other way ...

The “best” would be not to read so many files. Ideally, you wouldn’t read any files.

The “fastest” way will be optimized for your anticipated access patterns, there’s no panacea for that. The general-purpose solution that’s already been mentioned several times is using a DBMS.

Personally, I’d prefer “typed files (https://wiki.freepascal.org/typed_files)” using some record structure just to keep it simple, you know. If you’re using spinning disks, squeeze at least as many records in one sector as you can (traditionally this has been 512 Bytes). [On SSDs it’s primarily about conserving space, not speed.]

Quote from: gii on March 08, 2021, 02:26:21 pm

[…] Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened. […]

You could store such information via path components, by using a directory tree that looks like:

Code: Text [Select][+]

2021             [year]
  ├─ 02          [month]
  │  ├─ 27       [day]
  │  │  └─ XAG   [silver]
  │  │     ├─ 13 [a file for silver prices from 13:00Z to 13:59Z]
  │  │     ├─ 14
  │  │     └─ 15
  │  └─ 28
  └─ 03

Quote from: gcarreno on March 08, 2021, 02:48:51 pm

[…]
Unless you throw very expensive hardware at it, you'll never overcome the sheer amount of IO involved.

Well, (concerning spinning disks) one can set up a striping RAID. That’s not that expensive (nowadays).

Title: Re: Best reading method
Post by: MarkMLl on March 08, 2021, 07:35:14 pm

Quote from: ASBzone on March 08, 2021, 06:30:54 pm

A slow database is typically an unoptimized database

Or one where the queries can't be satisfied using an index. I've seen that, I think now-relative dates were involved, and I fixed it by defining part of the query as a function whose result was invariant so could be cached.

MarkMLl

Title: Re: Best reading method
Post by: balazsszekely on March 08, 2021, 08:33:42 pm

Luckily I have a few monsters in my possession, so I ran a few tests. A full traverse of an indexed table(Firebird database, ~43 000 000 record) took approximately 5 minutes. The CPU usage never went above 50%, the memory consumption instead rapidly rose to 16GB.

If I filter the table by some criteria, ~5000 records between two date for example, the indexed read is almost instantaneous. So a database is a good choice in my opinion.

Title: Re: Best reading method
Post by: MarkMLl on March 08, 2021, 09:01:40 pm

I've just checked a 2.7 million row table in a PostgreSQL database running live (i.e. will be heavily cached), and getting the minimum or maximum value of a specified column takes roughly a second. Asking for an arbitrary day's worth of data (I used 2010-01-01) gets a response which is effectively instantaneous.

MarkMLl

Title: Re: Best reading method
Post by: Warfley on March 08, 2021, 09:19:56 pm

In the general, a DBMS is always a pretty solid solution. Relational databases like MySQL/Mariadb, Prosgres or MSSQL are great at optimizing general (and very complex) queries, indexing and interlinking data. That said, while being the best general purpose solution, it is usually not a great specific solution. So if you have a lot of homogeneous data you always access the same way, a handrcrafted solution will probably be faster.

Using a database is pretty often much easier to create and also maintain then a handwritten solution, so pretty much the question boils down to: is a database fast enough. If that is the case, using a database is in such situation pretty much always preferable.

That said, I already had the situation that I rewrote a whole program that had a custom format to use SQLite, in order to allow for more general queries, and found that this simply was too slow to be used, which is why I reverted back to having my own datastructures, which while being fast enough, means many features I would have been able to build using SQLite are simply not possible because my datastructures are too static for this (or to put it another way, require so much more effort that they are pretty much not worth it).
Sometimes the only way to find out is to try it out and fail. But if the DBMS works out good enough, I can highly recommend you that over creating your own dataformat, it saves you a lot of stress

Title: Re: Best reading method
Post by: balazsszekely on March 08, 2021, 09:34:18 pm

@MarkMLI

Quote

I've just checked a 2.7 million row table in a PostgreSQL database running live (i.e. will be heavily cached), and getting the minimum or maximum value of a specified column takes roughly a second. Asking for an arbitrary day's worth of data (I used 2010-01-01) gets a response which is effectively instantaneous.

Nice!

Just to be clear, I got 5 minutes from Lazarus, running this:

Code: Pascal [Select][+]

var
  Time: QWord;
  RecCnt: Int64;
begin
  RecCnt := 0;
  Time := GetTickCount64;
  qTest.DisableControls;
  try
    qTest.First;
    while not qTest.EOF do
    begin
      RecCnt := RecCnt + 1;
      qTest.Next;
    end;
  finally
    qTest.EnableControls;
  end;
  Time := GetTickCount64 - Time;
  ShowMessage('Record count: ' + IntTostr(RecCnt) + sLineBreak +
              'Time: ' + IntToStr(Time div 1000) + ' sec');
end;

The purpose of the above code was not record counting, I was curious how long it takes to traverse the whole table. The operations you described are almost instantaneous at my side too.

Title: Re: Best reading method
Post by: MarkMLl on March 08, 2021, 10:15:17 pm

@GetMem Historical note. Back in about 2000 I needed a relational database which (a) had proper transaction support, (b) didn't sound too tacky when we needed to discuss it with mainframe-using corporates, and (c) didn't require that we mortgage our soul or tell the supplier too much. MySQL failed (a) and (b), the winner was the SOLID database ("uses Bonsai-tree technology") but it was suddenly bought by IBM and its licensing changed. At that point I moved us to PostgreSQL which was just emerging, it was still semi-research with e.g. support for tertiary (tape) storage implemented by a student who defected to IBM.

IBM themselves gave me bad vibes, they were doing things like assuring me that anything I wrote on a PC could of course be ported to one of their larger systems provided that I used RPG.

Postgres has been ticking over for us since then (I can't comment on our customers but assure you that they touch your life), and with the exception of some breaking changes relating to date formats in the v7 era upgrades and backup/restores have been painless. Scratch the surface and there's a lot of clever stuff in there, e.g. a way to partition a table so that chunks can be moved onto slower storage.

The examples I used were with Postgres's command-based program, I'd rather not fiddle with anything more custom on a live system. However I must say that those min() and max() timings impressed me since the field I used isn't indexed, GOK how the server did it...

xxxxxxxxxx=> explain verbose select max(analogue001) from readings; QUERY PLAN ---------- Aggregate (cost=103624.40..103624.41 rows=1 width=8) Output: max(analogue001) -> Seq Scan on public.readings (cost=0.00..96765.52 rows=2743552 width=8) ...

Deep magic as far as I'm concerned. Anybody who reads The Literature can write a compiler, there's around 1000 programmers capable of writing an efficient linker, but a decent relational database takes real skill.

MarkMLl

Title: Re: Best reading method
Post by: gii on March 09, 2021, 01:53:21 am

There are many answers to quote, but I read them all.

I have always preferred to use a DBMS. Indexing the tables helped me a lot in filtering data. However, deleting and changing some records took too long. My database was corrupted twice, because of my impatience.

When I decided to save the data directly to disk, I was very satisfied with the performance (readings were done in about 0.2 seconds on a hard disk). However, I would like to know if there was a better way to read this data, that's why the topic.

My queries are not very advanced. Basically I search for records within a period of dates, in addition to looking at some records prior to the first date of the filter. As I mentioned before, using a DBMS, the biggest problem was when updating / deleting data.

At the moment, splitting the data into several SQLite databases is working very well.

Someone suggested having an auxiliary table, whose primary key would be a date, and would have another field pointing to the "true record". Know that I thought about this hypothesis a lot, however, at the moment I think it is not necessary.

Despite having ~ 25,000 databases, my tables have a maximum of 300,000 records. When I make an appointment, I look only at an instrument / DB (although it is much more comfortable to have a single database, either with 1 table and millions of records, or with 25,000 tables and with about 200,000 records).

I honestly do not know if the solution adopted by me can be seen as "good programming practice". I would be very happy to read more opinions about it.

Title: Re: Best reading method
Post by: MarkMLl on March 09, 2021, 08:57:38 am

Quote from: gii on March 09, 2021, 01:53:21 am

My queries are not very advanced. Basically I search for records within a period of dates, in addition to looking at some records prior to the first date of the filter. As I mentioned before, using a DBMS, the biggest problem was when updating / deleting data.

I presume you're aware that it's common to temporarily drop indexes before doing a bulk addition/deletion. Apart from that it should be possible to do a lot of that sort of thing as background batch operations, with appropriate choice of isolation level.

MarkMLl