Recent

Author Topic: Best reading method  (Read 4172 times)

gii

  • Jr. Member
  • **
  • Posts: 52
Best reading method
« on: March 08, 2021, 01:53:32 pm »
I have several files (~50 million, each with ~66 bytes).

Important details:

1. There is a filter before reading all the files. With each request, ~5,000 files will be read.

2. The files have a standard structure:

Something like:
H = 1200
O = 1947.47
L = 1950.87
C = 1200
V = 78965656.99

Only values change from file to file

3. If necessary, I can change the way I write these files.

What is the fastest way to read these files?

Typed files
BufStream
Or some other way ...

Gustavo 'Gus' Carreno

  • Sr. Member
  • ****
  • Posts: 460
  • Professional amateur ;-P
Re: Best reading method
« Reply #1 on: March 08, 2021, 02:09:03 pm »
Hi gli,

Why does it have to be files?

Can you not use and embedded SQLite 3.0 database?
Takes all the pain of filtering and parsing those files.

Just my 2c.

Cheers,
Gus
Lazarus 2.1.0(trunk) FPC 3.3.1(trunk) Ubuntu 20.10 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.0(stable) Ubuntu 20.10 64b Dark Theme
http://github.com/gcarreno

lucamar

  • Hero Member
  • *****
  • Posts: 3790
Re: Best reading method
« Reply #2 on: March 08, 2021, 02:10:58 pm »
TIniFile will probably be the more simple, and if not the fastest it should be fast enough even for 5000 files.

With such a large number of files your bottleneck will always be IO and you can't avoid it no matter what, since you'll always have to read each file and parse it, which last TIniFile can do for you.
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

gii

  • Jr. Member
  • **
  • Posts: 52
Re: Best reading method
« Reply #3 on: March 08, 2021, 02:26:21 pm »
Hi gli,

Why does it have to be files?

Can you not use and embedded SQLite 3.0 database?
Takes all the pain of filtering and parsing those files.

Just my 2c.

Cheers,
Gus

Until recently, this data was saved in a database (MySQL).

However, as the database grew, it became very slow.

My first idea was to use MongoDB. I had two problems.

1. Limit of 16KB per document.
2. High consumption of RAM memory.

Now, I am trying to save this data directly to disk.

I still think about using a TSDB.

The prospect is that in a short time, I will have 552 millions records.

TIniFile will probably be the more simple, and if not the fastest it should be fast enough even for 5000 files.

With such a large number of files your bottleneck will always be IO and you can't avoid it no matter what, since you'll always have to read each file and parse it, which last TIniFile can do for you.

Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened.

Complexity is not an obstacle. I would just like to do that as quickly as possible.

Gustavo 'Gus' Carreno

  • Sr. Member
  • ****
  • Posts: 460
  • Professional amateur ;-P
Re: Best reading method
« Reply #4 on: March 08, 2021, 02:27:27 pm »
Hi Lucamar,

TIniFile will probably be the more simple, and if not the fastest it should be fast enough even for 5000 files.

The OP says that he's running 5,000(5K) files after filtering it down from 50,000,000(50M). TIniFile is good in a pinch but with 50M Sections and no indexes, it's gonna hurt a lot.
Hence me proposing the simplest SQL engine possible to migrate from a set of files to something just before a full blown database server.
Hey, even FireBird would be an improvement, right?

Cheers,
Gus
Lazarus 2.1.0(trunk) FPC 3.3.1(trunk) Ubuntu 20.10 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.0(stable) Ubuntu 20.10 64b Dark Theme
http://github.com/gcarreno

lucamar

  • Hero Member
  • *****
  • Posts: 3790
Re: Best reading method
« Reply #5 on: March 08, 2021, 02:33:55 pm »
Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened.

Complexity is not an obstacle. I would just like to do that as quickly as possible.

You talked of having to read some 5000 files out of several millions; no matter what and how you do it, that is your bottleneck: a lot of files (~5000) to read and parse. That's what I meant, and in those circumstances you might as well use TIniFile to simplify your code since it will not add significantly to the time spent, much less if each file is really as small as you said.

Or you could use TStringList: it will swallow each file in a jiffy and has methods and properties to deal with "key=value" pairs like yours.

The question, really, is that yor program will probably spend more time in the file system (seeking each file, reading it, etc.) than doing whatever needs to be done with the data.
« Last Edit: March 08, 2021, 02:39:22 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

gii

  • Jr. Member
  • **
  • Posts: 52
Re: Best reading method
« Reply #6 on: March 08, 2021, 02:45:20 pm »
Hi Lucamar,

TIniFile will probably be the more simple, and if not the fastest it should be fast enough even for 5000 files.

The OP says that he's running 5,000(5K) files after filtering it down from 50,000,000(50M). TIniFile is good in a pinch but with 50M Sections and no indexes, it's gonna hurt a lot.
Hence me proposing the simplest SQL engine possible to migrate from a set of files to something just before a full blown database server.
Hey, even FireBird would be an improvement, right?

Cheers,
Gus

I confess that from past experiences, I am very afraid to continue using a typical database.

Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened.

Complexity is not an obstacle. I would just like to do that as quickly as possible.

You talked of having to read some 5000 files out of several millions; no matter what and how you do it, that is your bottleneck: a lot of files (~5000) to read and parse. That's what I meant, and in those circumstances you might as well use TIniFile to simplify your code since it will not add significantly to the time spent, much less if each file is really as small as you said.

Or you could use TStringList: it will swallow each file in a jiffy and has methods and properties to deal with "key=value" pairs like yours.

The question, really, is that yor program will probably spend more time in the file system (seeking each file, reading it, etc.) than doing whatever needs to be done with the data.

I will run benchmarks, later post the results.

Gustavo 'Gus' Carreno

  • Sr. Member
  • ****
  • Posts: 460
  • Professional amateur ;-P
Re: Best reading method
« Reply #7 on: March 08, 2021, 02:48:51 pm »
Until recently, this data was saved in a database (MySQL).

I still think about using a TSDB.

Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened.

First of all lets address the HUMONGOUS elephant in the room: You have entries in the tens of millions and growing into the hundreds of millions.
Unless you throw very expensive hardware at it, you'll never overcome the sheer amount of IO involved.

Then you said you downgraded from a MySQL db. Can you publish your schema to see if you optimized the columns and the indexes?

Lastly, it looks like you are filtering via date, so I'm guessing you want to make a graph or the likes with a chunk of your data between dates.
Can I suggest having a table like this:
Code: SQL  [Select][+][-]
  1. CREATE TABLE `data_by_dates` (
  2.   `date_time` DATETIME,
  3.   `data_index` INT,
  4.   PRIMARY KEY date_time
  5. );
  6.  

That data_index points to the table that contains the data mentioned on your files but it has no DateTime field, just the numerical data.
That way you are still dealing with 500,000,000(500M) entries but with the smallest footprint possible.

Again, just my 2c.

Cheers,
Gus
« Last Edit: March 08, 2021, 02:53:36 pm by gcarreno »
Lazarus 2.1.0(trunk) FPC 3.3.1(trunk) Ubuntu 20.10 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.0(stable) Ubuntu 20.10 64b Dark Theme
http://github.com/gcarreno

Gustavo 'Gus' Carreno

  • Sr. Member
  • ****
  • Posts: 460
  • Professional amateur ;-P
Re: Best reading method
« Reply #8 on: March 08, 2021, 02:57:16 pm »
Hey gli,

Another thing I'm worried is if you hit the limit of files per folder.
I would have to investigate for NTFS and the Linux extN, but I think there is a limit. I think I stumbled on a post about this once, but not sure.

Cheers,
Gus
Lazarus 2.1.0(trunk) FPC 3.3.1(trunk) Ubuntu 20.10 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.0(stable) Ubuntu 20.10 64b Dark Theme
http://github.com/gcarreno

gii

  • Jr. Member
  • **
  • Posts: 52
Re: Best reading method
« Reply #9 on: March 08, 2021, 03:43:42 pm »
Hey gli,

Another thing I'm worried is if you hit the limit of files per folder.
I would have to investigate for NTFS and the Linux extN, but I think there is a limit. I think I stumbled on a post about this once, but not sure.

Cheers,
Gus

I organized the files so that it wouldn't be a problem.

--

I am saving stock market candles.

There are ~25.000 instruments. Each one, having a different graphic time (daily, 30 minutes, etc.).

Analyzing the whole problem, I think the best solution is to create several SQLite databases splited per instrument / graph time.

For example:

Apple and IBM.

Split into 4 databases.

APPLE_DAYLI.db3
APPLE_30MIN.db3

IBM_DAYLI.db3
IBM_30MIN.db3

Warfley

  • Sr. Member
  • ****
  • Posts: 435
Re: Best reading method
« Reply #10 on: March 08, 2021, 03:58:06 pm »
The biggest performance issue you have is that you have so many files. Reading thousands of small files is much slower then reading one big file, so probably something to keep in mind. Also it should be noted that Linux ext3 and ext4 filesystems get really slow when having to deal with so many files in a single directory

Then if you want the maximum performance, you should probably use a binary file format, e.g. file of record or using a TFileStream reading records, but you should note that this can massively impact the maintainability when you later change up the format, it requires much more portability code.
And if you are really fancy, you map the file into memory using mmap.

So this is how I would do it, if performance is the main concern: Single file, containing of a header and the content. Header format:
Code: [Select]
| MAGIC NUMBER (32 bit) | Version Number (32 bit) | Header Length (64 bit) |
| Filename 1 (24 byte shortstring)                | Offset (64 bit)        |
| Filename 2 (24 byte shortstring)                | Offset (64 bit)        |
| Filename 3 (24 byte shortstring)                | Offset (64 bit)        |
|                                ...                                       |
Where offset is the offset of the content of that file within this file
So when reading the file, you first check the magic number and version number. If both match a supported value, you read the header length, and then read all the header information from this point up to header length and build up a dictionary of the filename and objects. (The 24 byte shortstring is just an arbitrary length, you can choose any length you want, but too long will just be a waste of space)

If you then want to read a file, all you need to do is lookup the offset and then load the the contents of that file at position offset into a record. If you map your file into memory using mmap, this is pretty much just a single pointer operation
« Last Edit: March 08, 2021, 04:01:13 pm by Warfley »

Gustavo 'Gus' Carreno

  • Sr. Member
  • ****
  • Posts: 460
  • Professional amateur ;-P
Re: Best reading method
« Reply #11 on: March 08, 2021, 04:47:12 pm »
Hi gli,

Let me rant a bit about what I'm understanding about your predicament.

From what I can gather your specs are:
  • Upon a determined date/time interval
  • Select a chunk of data
  • From stock market companies

So:
  • Step 1: Filter universe data by date
  • Step 2: retrieve subset of universe data

What Warfley is proposing is what I was about to propose about 10 minutes ago, and solves Step 2 quite well.

What made me rethink this is the fact that in order to have the fastest Step 1 you would need a balanced binary tree, you know, like the ones MySQL implements for it's indexes.
Doing that yourself is gonna be a real pain in the arse and (I don't believe I'm about to say this) why re-invent the wheel here?

So, even if you implement a blazing fast solution for Step 2, you're still threading water in Step 1.

Now, you're proposing splitting your data into smaller chunks.
Well that's the best way to go, absolutely and I would advise doing that with a system that can solve Step 1 with balanced binary trees(AVL trees if I'm not mistaken).

My proposal is the following:
  • For all the companies crate a table with the name of the company(Apple) and the name of a table prefix(apple_)
  • For each company create a set of tables for each graph time: apple_daily, apple_30min, etc with only 2 columns: date/time, pointer to points value table, where the primary key is the date/time column
  • For each graph time create a table to hold the points value data: apple_daily_data, apple_30m_data that contains an index that goes into apple_daily and the other data points

Using this method, you are slicing the data in many smaller chunks, just like what a binary tree does when searching it.
You'll then have the same amount of data, in total, but now a bit more manageable for any database engine.
And you want that under a piece of software that has been tested for these types of use.
Anything you do is only going back to methods discarded in favour of databases.

Cheers,
Gus
Lazarus 2.1.0(trunk) FPC 3.3.1(trunk) Ubuntu 20.10 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.0(stable) Ubuntu 20.10 64b Dark Theme
http://github.com/gcarreno

MarkMLl

  • Hero Member
  • *****
  • Posts: 2427
Re: Best reading method
« Reply #12 on: March 08, 2021, 05:34:52 pm »
50 million records in a half-competent database is absolutely nothing. I can't speak for MSSQL, but in the case of PostgreSQL you can ask it how it's tackling a query hence identify anywhere where it's locating one or more matching rows by doing a sequential scan rather than using an index.

My suggestion would be to revisit the database situation, adding indexes etc. as necessary since the database maintainers will have addressed all issues of file count etc.

MarkMLl
Turbo Pascal v1 on CCP/M-86, multitasking with LAN and graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

ASBzone

  • Hero Member
  • *****
  • Posts: 614
  • Automation leads to relaxation...
    • Free Console Utilities for Windows (and a few for Linux) from BrainWaveCC
Re: Best reading method
« Reply #13 on: March 08, 2021, 06:30:54 pm »
As others have said, pursuing this from a file system perspective is going to be burdensome, and won't be any faster than a database.

A slow database is typically an unoptimized database, so the issue is far more likely to be the organization of the data and the overall architecture of the solution you are building.

If you're open to solving the overall problem, it will help to let folks know what the files are about and what you're trying to do with them.  You'll get great advice.

Otherwise, you've gotten some advice for how to handle what you are doing today, but expect that there will still be performance issues -- lots of them, with that growth projection.
-ASB: https://www.BrainWaveCC.com/

Lazarus v2.0.13 r64843 / FPC v3.2.1-r49055 (via FpcUpDeluxe) -- Windows 64-bit install w/Win32 and Linux/Arm cross-compiles
Primary System: Windows 10 Pro x64, Version 2009 (Build 19042)
Other Systems: Windows 10 Pro x64, Version 2009 (Build 19042) or greater

ASBzone

  • Hero Member
  • *****
  • Posts: 614
  • Automation leads to relaxation...
    • Free Console Utilities for Windows (and a few for Linux) from BrainWaveCC
Re: Best reading method
« Reply #14 on: March 08, 2021, 06:32:01 pm »
50 million records in a half-competent database is absolutely nothing. I can't speak for MSSQL, but in the case of PostgreSQL you can ask it how it's tackling a query hence identify anywhere where it's locating one or more matching rows by doing a sequential scan rather than using an index.

My suggestion would be to revisit the database situation, adding indexes etc. as necessary since the database maintainers will have addressed all issues of file count etc.

MarkMLl

I can speak for MSSQL, and I concur with all that has been said here.  Architecture and structure is key.
-ASB: https://www.BrainWaveCC.com/

Lazarus v2.0.13 r64843 / FPC v3.2.1-r49055 (via FpcUpDeluxe) -- Windows 64-bit install w/Win32 and Linux/Arm cross-compiles
Primary System: Windows 10 Pro x64, Version 2009 (Build 19042)
Other Systems: Windows 10 Pro x64, Version 2009 (Build 19042) or greater

 

TinyPortal © 2005-2018