Recent

Author Topic: Best reading method  (Read 4133 times)

Kays

  • Sr. Member
  • ****
  • Posts: 299
  • Whasup!?
    • KaiBurghardt.de
Re: Best reading method
« Reply #15 on: March 08, 2021, 06:57:12 pm »
[…]What is the fastest way to read these files?
[…]
Or some other way ...
The “best” would be not to read so many files. Ideally, you wouldn’t read any files.

The “fastest” way will be optimized for your anticipated access patterns, there’s no panacea for that. The general-purpose solution that’s already been mentioned several times is using a DBMS.

Personally, I’d prefer “typed files” using some record structure just to keep it simple, you know. If you’re using spinning disks, squeeze at least as many records in one sector as you can (traditionally this has been 512 Bytes). [On SSDs it’s primarily about conserving space, not speed.]

[…] Currently, I have a SQLite database (with date information only) that helps me filter the files. This means that I don't need to open them all and filter them out. Only the necessary files will be opened. […]
You could store such information via path components, by using a directory tree that looks like:
Code: Text  [Select][+][-]
  1. 2021             [year]
  2.   ├─ 02          [month]
  3.   │  ├─ 27       [day]
  4.   │  │  └─ XAG   [silver]
  5.   │  │     ├─ 13 [a file for silver prices from 13:00Z to 13:59Z]
  6.   │  │     ├─ 14
  7.   │  │     └─ 15
  8.   │  └─ 28
  9.   └─ 03

[…]
Unless you throw very expensive hardware at it, you'll never overcome the sheer amount of IO involved.
Well, (concerning spinning disks) one can set up a striping RAID. That’s not that expensive (nowadays).
Yours Sincerely
Kai Burghardt

MarkMLl

  • Hero Member
  • *****
  • Posts: 2400
Re: Best reading method
« Reply #16 on: March 08, 2021, 07:35:14 pm »
A slow database is typically an unoptimized database

Or one where the queries can't be satisfied using an index. I've seen that, I think now-relative dates were involved, and I fixed it by defining part of the query as a function whose result was invariant so could be cached.

MarkMLl
Turbo Pascal v1 on CCP/M-86, multitasking with LAN and graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

GetMem

  • Hero Member
  • *****
  • Posts: 3984
Re: Best reading method
« Reply #17 on: March 08, 2021, 08:33:42 pm »
Luckily I have a few monsters in my possession, so I ran a few tests. A full traverse of an indexed table(Firebird database, ~43 000 000 record) took approximately 5 minutes.  The CPU usage never went above 50%, the memory consumption instead rapidly rose to 16GB.

If I filter the table by some criteria, ~5000 records between two date for example, the indexed read is almost instantaneous. So a database is a good choice in my opinion.

MarkMLl

  • Hero Member
  • *****
  • Posts: 2400
Re: Best reading method
« Reply #18 on: March 08, 2021, 09:01:40 pm »
I've just checked a 2.7 million row table in a PostgreSQL database running live (i.e. will be heavily cached), and getting the minimum or maximum value of a specified column takes roughly a second. Asking for an arbitrary day's worth of data (I used 2010-01-01) gets a response which is effectively instantaneous.

MarkMLl
Turbo Pascal v1 on CCP/M-86, multitasking with LAN and graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Warfley

  • Sr. Member
  • ****
  • Posts: 426
Re: Best reading method
« Reply #19 on: March 08, 2021, 09:19:56 pm »
In the general, a DBMS is always a pretty solid solution. Relational databases like MySQL/Mariadb, Prosgres or MSSQL are great at optimizing general (and very complex) queries, indexing and interlinking data. That said, while being the best general purpose solution, it is usually not a great specific solution. So if you have a lot of homogeneous data you always access the same way, a handrcrafted solution will probably be faster.

Using a database is pretty often much easier to create and also maintain then a handwritten solution, so pretty much the question boils down to: is a database fast enough. If that is the case, using a database is in such situation pretty much always preferable.

That said, I already had the situation that I rewrote a whole program that had a custom format to use SQLite, in order to allow for more general queries, and found that this simply was too slow to be used, which is why I reverted back to having my own datastructures, which while being fast enough, means many features I would have been able to build using SQLite are simply not possible because my datastructures are too static for this (or to put it another way, require so much more effort that they are pretty much not worth it).
Sometimes the only way to find out is to try it out and fail. But if the DBMS works out good enough, I can highly recommend you that over creating your own dataformat, it saves you a lot of  stress

GetMem

  • Hero Member
  • *****
  • Posts: 3984
Re: Best reading method
« Reply #20 on: March 08, 2021, 09:34:18 pm »
@MarkMLI

Quote
I've just checked a 2.7 million row table in a PostgreSQL database running live (i.e. will be heavily cached), and getting the minimum or maximum value of a specified column takes roughly a second. Asking for an arbitrary day's worth of data (I used 2010-01-01) gets a response which is effectively instantaneous.
Nice!

Just to be clear, I got 5 minutes from Lazarus, running this:
Code: Pascal  [Select][+][-]
  1. var
  2.   Time: QWord;
  3.   RecCnt: Int64;
  4. begin
  5.   RecCnt := 0;
  6.   Time := GetTickCount64;
  7.   qTest.DisableControls;
  8.   try
  9.     qTest.First;
  10.     while not qTest.EOF do
  11.     begin
  12.       RecCnt := RecCnt + 1;
  13.       qTest.Next;
  14.     end;
  15.   finally
  16.     qTest.EnableControls;
  17.   end;
  18.   Time := GetTickCount64 - Time;
  19.   ShowMessage('Record count: ' + IntTostr(RecCnt) + sLineBreak +
  20.               'Time: ' + IntToStr(Time div 1000) + ' sec');
  21. end;

The purpose of the above code was not record counting, I was curious how long it takes to traverse the whole table. The operations you described are almost instantaneous at my side too.

MarkMLl

  • Hero Member
  • *****
  • Posts: 2400
Re: Best reading method
« Reply #21 on: March 08, 2021, 10:15:17 pm »
@GetMem Historical note. Back in about 2000 I needed a relational database which (a) had proper transaction support, (b) didn't sound too tacky when we needed to discuss it with mainframe-using corporates, and (c) didn't require that we mortgage our soul or tell the supplier too much. MySQL failed (a) and (b), the winner was the SOLID database ("uses Bonsai-tree technology") but it was suddenly bought by IBM and its licensing changed. At that point I moved us to PostgreSQL which was just emerging, it was still semi-research with e.g. support for tertiary (tape) storage implemented by a student who defected to IBM.

IBM themselves gave me bad vibes, they were doing things like assuring me that anything I wrote on a PC could of course be ported to one of their larger systems provided that I used RPG.

Postgres has been ticking over for us since then (I can't comment on our customers but assure you that they touch your life), and with the exception of some breaking changes relating to date formats in the v7 era upgrades and backup/restores have been painless. Scratch the surface and there's a lot of clever stuff in there, e.g. a way to partition a table so that chunks can be moved onto slower storage.

The examples I used were with Postgres's command-based program, I'd rather not fiddle with anything more custom on a live system. However I must say that those min() and max() timings impressed me since the field I used isn't indexed, GOK how the server did it...


xxxxxxxxxx=> explain verbose select max(analogue001) from readings;
QUERY PLAN
----------
 Aggregate  (cost=103624.40..103624.41 rows=1 width=8)
   Output: max(analogue001)
   ->  Seq Scan on public.readings  (cost=0.00..96765.52 rows=2743552 width=8)
...


Deep magic as far as I'm concerned. Anybody who reads The Literature can write a compiler, there's around 1000 programmers capable of writing an efficient linker, but a decent relational database takes real skill.

MarkMLl
Turbo Pascal v1 on CCP/M-86, multitasking with LAN and graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

gii

  • Jr. Member
  • **
  • Posts: 52
Re: Best reading method
« Reply #22 on: March 09, 2021, 01:53:21 am »
There are many answers to quote, but I read them all.

I have always preferred to use a DBMS. Indexing the tables helped me a lot in filtering data. However, deleting and changing some records took too long. My database was corrupted twice, because of my impatience.

When I decided to save the data directly to disk, I was very satisfied with the performance (readings were done in about 0.2 seconds on a hard disk). However, I would like to know if there was a better way to read this data, that's why the topic.

My queries are not very advanced. Basically I search for records within a period of dates, in addition to looking at some records prior to the first date of the filter. As I mentioned before, using a DBMS, the biggest problem was when updating / deleting data.

At the moment, splitting the data into several SQLite databases is working very well.

Someone suggested having an auxiliary table, whose primary key would be a date, and would have another field pointing to the "true record". Know that I thought about this hypothesis a lot, however, at the moment I think it is not necessary.

Despite having ~ 25,000 databases, my tables have a maximum of 300,000 records. When I make an appointment, I look only at an instrument / DB (although it is much more comfortable to have a single database, either with 1 table and millions of records, or with 25,000 tables and with about 200,000 records).

I honestly do not know if the solution adopted by me can be seen as "good programming practice". I would be very happy to read more opinions about it.

MarkMLl

  • Hero Member
  • *****
  • Posts: 2400
Re: Best reading method
« Reply #23 on: March 09, 2021, 08:57:38 am »
My queries are not very advanced. Basically I search for records within a period of dates, in addition to looking at some records prior to the first date of the filter. As I mentioned before, using a DBMS, the biggest problem was when updating / deleting data.

I presume you're aware that it's common to temporarily drop indexes before doing a bulk addition/deletion. Apart from that it should be possible to do a lot of that sort of thing as background batch operations, with appropriate choice of isolation level.

MarkMLl
Turbo Pascal v1 on CCP/M-86, multitasking with LAN and graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

 

TinyPortal © 2005-2018