Recent

Author Topic: speeding up big data projects  (Read 3030 times)

daringly

  • Jr. Member
  • **
  • Posts: 73
speeding up big data projects
« on: September 21, 2022, 03:58:43 pm »
I have a dataset that includes 3000 files and about 2 billion datapoints. There is more data than I can keep in memory.

Each datapoint has a "user"; there are roughly 1m users -- 1m bins. I have to sort the 2 billion datapoints, putting them into 1m bins and then aggregate that data.

An ugly way to do it is to create a file for each of the 1m users. It will take awhile. Is there an elegant way to do this?

KodeZwerg

  • Hero Member
  • *****
  • Posts: 2269
  • Fifty shades of code.
    • Delphi & FreePascal
Re: speeding up big data projects
« Reply #1 on: September 21, 2022, 04:08:51 pm »
If data come from a DBMS, tweak it to reduce the dataflow, only keep visible parts in memory (read in chunks)
Or explain more about what you are doing, how you are doing, why you are doing...
« Last Edit: Tomorrow at 31:76:97 xm by KodeZwerg »

Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: speeding up big data projects
« Reply #2 on: September 21, 2022, 04:16:47 pm »
Such things should be done in the dbms itself with SQL.
For example Oracle, MySql/MariaDb, ms sql server or PostGress are usually much faster than you can do it locally in code.
Just return the queries locally.
If it is code that comes back many times, write stored procedures for it.

I don't suppose you rely on *.csv for such data? Than export it or leave it in a proper dbms.
« Last Edit: September 21, 2022, 04:20:29 pm by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

daringly

  • Jr. Member
  • **
  • Posts: 73
Re: speeding up big data projects
« Reply #3 on: September 21, 2022, 04:18:47 pm »
Each line of a .csv file is a datapoint with about 10 fields including a user. Each .csv file is sorted by user, then by chronology. Most users will appear in multiple .csv files.

In the past, I'd read each datapoint, reduce the "User" string into a type longint (and translate other lengthier datapoints into integers or longints). As the number of users increased, the lookup time for each datapoint increased significantly. Doing it all in memory when my dataset was smaller worked, but it took several days, and was unstable. Sometimes it would simply crash.

My idea now is to do it in chunks. Read one csv file. instead of a lookup for each user, i'd simply have a file named after each users. I'd translate the .csv row entry into a record, and write that record to the end of that user's individual datafile.

When I want to aggregate the data (e.g., turn a user file into a single summary), I simply go through the roughly 1 million user files, and add them up however I like. I'll end up running that process several times as I try different analytics approaches. The reading of the .csvs and translating into user files, I hope to run only once.

I've run things like this in SQL before, but things go smoothly on my local desktop (which i don't have to share, or worry about screwing other people up if I hog all the resources).

Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: speeding up big data projects
« Reply #4 on: September 21, 2022, 04:21:35 pm »
CSV is a bad idea for such applications. Read my previous answer.
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: speeding up big data projects
« Reply #5 on: September 21, 2022, 05:28:06 pm »
Quote
I have to sort the 2 billion datapoints

that is called "streaming sort" or "merge sort"

https://en.wikipedia.org/wiki/Merge_sort

conceptually it is a loop like this

1. copy 1 big file into 2 half-files, with some simple transformations.
2. copy 2 half files into 1 big file,  with some simple transformations.
3. check if the file became sorted
4. if it did not - repeat the loop

It is not hard to combine those stages into actually one procedure. But tedious and requires attention to details.

however, the concept itself, split to small easy to understand non-intertwined functions, is like i outline above

usually disk I/O speed is the bottleneck. In the scheme above you do 3 full readings and 2 full writings on every loop iteration. By making intertwined fuse of those steps you would have one reading and one writing per iteration, making it about twice faster than simplest implementation. You can start with simplest one to see if it fits your need in general. Then if you would need you can make it twice faster by writing more complex, fused funciton,m instead of 3 simplistic ones. But maybe you would not.

frankly, merge sort is told in every school/course/textbook on computing. It is much simpler than QuickSort and is equally basic idea to all programming.

-------

Now, as long as posisble you would better do it using binary files with fixed-size elements.
Even DBF could be better than CSV (i am not sure DBF format would handle 2B rows it though).

There were special databases exactly for simple-structure but huge-amount data.

--------

there is a possible approach to store separately data and pointers to them, using something like Tokyo Cabinet (or competitors, or successors), https://dbmx.net/tkrzw/#overview

In such approach you would only sort "pointers" without copying data itself.

it might be good or bad idea. Given that you can not hold in RAM a lot of data
- you would eliminate most of streaming writes, you would only write "pointers", not dataframes. And in a best case you would be able to totally sort pointers in memory and then write then to disk only once
- at the same time, you would do equal amount of reads, but those would change from sequential reads to random reads (google: butterfly HDD test)

SSD-like disks have no penalty for random read but are worn out by writes (google: chia cryptocurrency), for those it looks like a good optimization
HDD-like discs are not worn out by writes, but random access is much slower and wear out their "head positioning" units, there it would be bad idea.

---

whatever approach you would choose, you may decide to look into those: https://github.com/d-mozulyov/CachedBuffers

read the sources to learn ideas on strategy of reading-writing big mounts opf data

same about links mentioned in the discussion  https://stackoverflow.com/questions/5639531/buffered-files-for-faster-disk-access

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12121
  • Debugger - SynEdit - and more
    • wiki
Re: speeding up big data projects
« Reply #6 on: September 21, 2022, 05:35:37 pm »
This really call for a database. A database can handle much larger data, and can probably get you the data in a matter of seconds.

As the number of users increased, the lookup time for each datapoint increased significantly.

If it must be done without database, familiarize yourself with how hashes work. Because, using them correctly and the look-up time stays nearly constant (BIG O (1)).
Cache misses and worst case swap file usage could impact that constant time....

If you must create files, then maybe (though I am writing this without knowing your data, which makes it a long shot) consider "binning" it. I.e. reduce the number of files by putting chunks of data into the same file (bin, aka container).
If you do, the important point is to group in such a way that items that will be accessed "together" (e.g. shortly after each other) are grouped together. Of course how to do that requires some knowledge of the data.


But again, a database can already do all that.

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: speeding up big data projects
« Reply #7 on: September 21, 2022, 05:55:49 pm »
This really call for a database

Thing is, not every database is RDBMS/SQL, which are usually meant by default. And then even without telling apart OLAP OLTP

And apart of storage and access interface - there may come over needs, like predictable time for writing.

And then maybe then be responsible for data damage, detection of it at very least, if not recovery. Random flip-bits and HDD failure sand what not.

Frankly, being responsible for 2B data frames just calls for drafting requirements to storage back-ends for the whole cycle of that how that data is used by humans.

And then selecting formats/libraries/hardware.

SQL solution, even as simple as SQLite or Firebird, might indeed be the optimal solution here. Or non-SQL database. Or even custom "file of record".

I think Firebird 2.5 and newer, especially with relaxed config, would handle 2B rows. But it AFAIR would not detect flip bits (unless you implement page-level crypto plugin). I also, being lazy-eval engine, would not give time warranties on writing or even reading. It even was in FAQ, that it you implement telemetry writing from hardware sensors - you have to do raw binary files first, and then only put sizeable complete chunks asynchronously into DB.

Quote
familiarize yourself with how hashes work.

...or select and use a stable library that does it. But such a libray is a database :-)
« Last Edit: September 21, 2022, 05:58:20 pm by Arioch »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12645
  • FPC developer.
Re: speeding up big data projects
« Reply #8 on: September 21, 2022, 05:59:55 pm »
I have a dataset that includes 3000 files and about 2 billion datapoints. There is more data than I can keep in memory.

Partition your data along its major action and use multiple machines with plenty of memory. It is the big data way.

MarkMLl

  • Hero Member
  • *****
  • Posts: 8533
Re: speeding up big data projects
« Reply #9 on: September 21, 2022, 06:04:16 pm »
CSV is a bad idea for such applications. Read my previous answer.

Seconded. Use an offline import to get the content of the files into one or more appropriate tables, then tell the database to build the indices.

I can't speak for other databases but PostgreSQL has various /explain/ facilities which allows you to examine how the query planner is handling the job, and make sure that it's not attempting sequential scans etc. where they're not appropriate.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: speeding up big data projects
« Reply #10 on: September 21, 2022, 07:32:38 pm »
if he works with "tables" - then conversion to SQL would be hard.

He would probably start with openig TTable / TSimpleDataset / etc on top of SQL query and doing his usual ISAM .Insert/.Edit/.Post

I know i did

It takes "reprogramming" human brains, not program code, to start doing SQL

So it may be better course to chose a non-SQL non-relational database instead.

Find something similar to Tokyo Cabinet or Google Bigtable. - Those are NOT recommendations of a tool, but hints for research, what kind of library might be needed. Discusstions of T.C. compared with successors and ocmpetitors would bring more products to try.

MarkMLl

  • Hero Member
  • *****
  • Posts: 8533
Re: speeding up big data projects
« Reply #11 on: September 21, 2022, 08:19:57 pm »
He would probably start with openig TTable / TSimpleDataset / etc on top of SQL query and doing his usual ISAM .Insert/.Edit/.Post

I know i did

And I'm sorry, but that's the wrong way to do it: particularly with a non-trivial amount of data you need to use database-specific offline import facilities, since apart from anything else they don't attempt to build/maintain indexes which makes operations an order of magnitude faster.

After that it might even be possible to do the job using SQL or whatever language is supported by the database backend.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: speeding up big data projects
« Reply #12 on: September 21, 2022, 08:28:51 pm »
...or you need libraries, designed ot provide ISAM-like interface to large data

relational databases might be good long-term goal for him, or maybe are not fitting his problem

thinking about data not as column/row girds but as abstract mathematic sets is a serious rewiring, it is not done in 2-3 months

and given 2B of data already accumulated i do not believe in incremental learning while working with real data

in this case, hereby, i would've research non-relational databases first, even if in 2-3 years he would indeed migrate to SQL in slow, controlled manner, without having strong pressure from an overhanging problem

MarkMLl

  • Hero Member
  • *****
  • Posts: 8533
Re: speeding up big data projects
« Reply #13 on: September 21, 2022, 09:00:15 pm »
...or you need libraries, designed ot provide ISAM-like interface to large data

It's very likely that buried in some corner of a forgotten FTP archive is a Perl script to do exactly the job he wants.

Quote
thinking about data not as column/row girds but as abstract mathematic sets is a serious rewiring, it is not done in 2-3 months

Frankly, I think you're being pessimistic. This isn't the place to get into a discussion of SQL application (I can strongly recommend the PostgreSQL mailing lists for that sort of thing), but this is basically (a) an offline import of the 3000 files into a table, complete with a column saying which original file each row came from, and (b) queries on behalf of each user as required. If necessary the list of users could be extracted into a table by a select with the unique modifier and then per-user extraction onto other media could be done... excuse a bit of handwaving here but I'm without much of my reference material.

A couple of billion rows is really not very much these days: I've got roughly that much of (a particular type of) customer data and an offline backup or restore takes of the order of 15 minutes. And once it's online I trust Postgres to manage memory and other resources more efficiently than naive scripting could.

As an aside, I'd observe that some of the more advanced SQL stuff strongly reminds me of APL: and while I've not read the oldest papers my understanding is that Codd's precursor languages similarly used various special symbols. However I /don't/ think OP's requirement is particularly arcane.

MarkMLl
« Last Edit: September 21, 2022, 09:25:21 pm by MarkMLl »
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: speeding up big data projects
« Reply #14 on: September 21, 2022, 09:16:22 pm »
(b) queries on behalf of each user as required.

like select * from tablename
TheDailyWTF has plenty of stories

Quote
by a select with the unique modifier

oh, you mean an external sorting in %temp%  - that would take time... On every such a query

Quote
I've got roughly that much of (a particular type of) customer data

you mean that you schema only conststs of "create table" and those having no "constraint" right ?

As of mere data, i said above that with reasonable DB design even Firebird with non-default config would easily manage it, and perhaps even SQLite.
But...

You suppose a reasonable design from a person having zero expoisure to SQL concepts. Won't happen.

Oh, how long would Postgress work with reasonable speed on 2B rows of a data, that are being regularly changed (TTable.Edit & TTable.Post), after a transaction was open (automatically, by TTable.Open)?

No, you would not break program by doing crazy things like TTable.Close which make all the DB-Aware controls go blank.  Usewrs do not need blank controls. They need data. Program should be functional, so we have

Code: Pascal  [Select][+][-]
  1. procedure TMainForm.FormCreate(....);
  2. begin
  3.    MainTable.Open;
  4. end;
  5.  
  6. procedure TMainForm.FormDestroy(....);
  7. begin
  8.    MainTable.Close;
  9. end;
  10.  

 

TinyPortal © 2005-2018