Recent

Author Topic: distinct values from very large table  (Read 1235 times)

Joanna

  • Hero Member
  • *****
  • Posts: 1000
distinct values from very large table
« on: September 05, 2024, 04:18:34 am »
Hi everyone
I have an sqlite  table with over 73 million rows that has 8 numerical columns. I want to gather the distinct values from the 8 columns and put them into a combobox.

The process of getting distinct values for even one column seems to be very slow and consumes a lot of cpu cycles.

Is there any other way to do this ?
« Last Edit: September 05, 2024, 08:40:31 am by Joanna »
✨ 🙋🏻‍♀️ More Pascal enthusiasts are needed on IRC .. https://libera.chat/guides/ IRC.LIBERA.CHAT  Ports [6667 plaintext ] or [6697 secure] channel #fpc  #pascal Please private Message me if you have any questions or need assistance. 💁🏻‍♀️

cdbc

  • Hero Member
  • *****
  • Posts: 1499
    • http://www.cdbc.dk
Re: distinct values from very large table
« Reply #1 on: September 05, 2024, 06:53:12 am »
Hi
I would say SQL, but the material you're providing is a bit on the skinny side  :D
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE5 -> FPC 3.2.2 -> Lazarus 2.2.6 up until Jan 2024 from then on it's: KDE5/QT5 -> FPC 3.3.1 -> Lazarus 3.0

Joanna

  • Hero Member
  • *****
  • Posts: 1000
Re: distinct values from very large table
« Reply #2 on: September 05, 2024, 07:14:24 am »
Maybe I'm missing an index for this?
« Last Edit: September 05, 2024, 07:16:44 am by Joanna »
✨ 🙋🏻‍♀️ More Pascal enthusiasts are needed on IRC .. https://libera.chat/guides/ IRC.LIBERA.CHAT  Ports [6667 plaintext ] or [6697 secure] channel #fpc  #pascal Please private Message me if you have any questions or need assistance. 💁🏻‍♀️

cdbc

  • Hero Member
  • *****
  • Posts: 1499
    • http://www.cdbc.dk
Re: distinct values from very large table
« Reply #3 on: September 05, 2024, 07:48:31 am »
Hi
Hmmm, that depends... Does this table of yours come with chairs?
...And what is it made of? ...old apple-crates?!?
More info, please  ;D
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE5 -> FPC 3.2.2 -> Lazarus 2.2.6 up until Jan 2024 from then on it's: KDE5/QT5 -> FPC 3.3.1 -> Lazarus 3.0

Zvoni

  • Hero Member
  • *****
  • Posts: 2641
Re: distinct values from very large table
« Reply #4 on: September 05, 2024, 08:18:03 am »
As Benny said: SQL is (usually) the way to go, though deduplicating is always expensive

To get a better idea:
of those 73 Million rows, per each column of those 8 columns: What's the ratio between distinct values and the count of rows?
e.g. of 73 million Values in Column 4, there are 20 distinct values and so on.

I have an idea which might just work

OH, AND WHICH DBMS?
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Joanna

  • Hero Member
  • *****
  • Posts: 1000
Re: distinct values from very large table
« Reply #5 on: September 05, 2024, 08:37:44 am »
To get a better idea:
of those 73 Million rows, per each column of those 8 columns: What's the ratio between distinct values and the count of rows?
e.g. of 73 million Values in Column 4, there are 20 distinct values and so on.

I have an idea which might just work

OH, AND WHICH DBMS?

I’m using sqlite3 and the values in the table are mostly in the range of -6 to 7 and 102 to 105 so far as I know. I suppose I was wanting to be thorough by getting all distinct values from table to see what is in it
✨ 🙋🏻‍♀️ More Pascal enthusiasts are needed on IRC .. https://libera.chat/guides/ IRC.LIBERA.CHAT  Ports [6667 plaintext ] or [6697 secure] channel #fpc  #pascal Please private Message me if you have any questions or need assistance. 💁🏻‍♀️

Zvoni

  • Hero Member
  • *****
  • Posts: 2641
Re: distinct values from very large table
« Reply #6 on: September 05, 2024, 08:56:11 am »
To get a better idea:
of those 73 Million rows, per each column of those 8 columns: What's the ratio between distinct values and the count of rows?
e.g. of 73 million Values in Column 4, there are 20 distinct values and so on.

I have an idea which might just work

OH, AND WHICH DBMS?

I’m using sqlite3 and the values in the table are mostly in the range of -6 to 7 and 102 to 105 so far as I know. I suppose I was wanting to be thorough by getting all distinct values from table to see what is in it

OK, next....
Do you want the Distinct Values per each Column or across all 8 Columns?
Distinct per each Column: You need 8 separate Queries
Distinct across all columns: You need a UNION Select

This is the "regular" way. Since we're talking 73M rows, Performance will probably take a massive hit.
So depending on your scenario i might have a workaround for future performance
« Last Edit: September 05, 2024, 08:58:48 am by Zvoni »
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Joanna

  • Hero Member
  • *****
  • Posts: 1000
Re: distinct values from very large table
« Reply #7 on: September 05, 2024, 04:04:45 pm »
I want distinct values across all 8 columns. I have already tried using union on the columns and it was a disaster. It stalls program and maxes out my cpu. Someone in chat also gave me a cte whereby the distinct values of coikmns wertr selected first and then fed to the part that does unions. Too slow.
« Last Edit: September 05, 2024, 09:37:49 pm by Joanna »
✨ 🙋🏻‍♀️ More Pascal enthusiasts are needed on IRC .. https://libera.chat/guides/ IRC.LIBERA.CHAT  Ports [6667 plaintext ] or [6697 secure] channel #fpc  #pascal Please private Message me if you have any questions or need assistance. 💁🏻‍♀️

Zvoni

  • Hero Member
  • *****
  • Posts: 2641
Re: distinct values from very large table
« Reply #8 on: September 05, 2024, 04:29:51 pm »
I want distinct values across all 8 columns. I have already tried using 7nion on the columns and it was a disaster. It stalls program and maxes out my cpu. Someone in chat also gave me a cte whereby the distinct values of coikmns wertr selected first and then fed to the part that does unions. Too slow.
Thought so.

Question: Are you using any Triggers in your SQLite-Database?
Are there any Detail-records in other tables (Primary/ForeignKey) relying on this "73M-Rows"-Table?

because i have a ready "Proof-of-Concept" for my idea.
It involves (17?) triggers for your "73M"-Rows-Table, keeping track of the distinct values in a separate table.

That said: You can't avoid populating that "Distinct" table at least once (which should be fairly easy).

If my PoC works out, you would just have to grab those "distinct" values from that separate table, instead of doing an 8-way UNION-Select on your Monster-table

Proof of Concept (Only 3 Columns instead of 8 )
tbl_joanna_master (this is your "73M-Rows"-Table)
Code: SQL  [Select][+][-]
  1. CREATE TABLE "tbl_joanna_master" (
  2.         "ID"    INTEGER,
  3.         "Col1"  INTEGER,
  4.         "Col2"  INTEGER,
  5.         "Col3"  INTEGER,
  6.         PRIMARY KEY("ID")
  7. )

the "distinct values"-table
Code: SQL  [Select][+][-]
  1. CREATE TABLE "tbl_joanna_distinct" (
  2.         "DistValue"     INTEGER NOT NULL UNIQUE,
  3.         "DistCount"     INTEGER DEFAULT 1
  4. )

The 7 triggers involved in this PoC (Note Column and Tablenames)
Code: SQL  [Select][+][-]
  1. CREATE TRIGGER trg_joanna_master_AI_Col1 AFTER INSERT ON tbl_joanna_master
  2. FOR EACH ROW
  3. WHEN NEW.Col1 IS NOT NULL
  4. BEGIN
  5.         INSERT INTO tbl_joanna_distinct(DistValue) VALUES(NEW.Col1) ON CONFLICT(DistValue) DO UPDATE SET DistCount=DistCount+1 WHERE DistValue=NEW.Col1;       
  6. END;
  7.  
  8. CREATE TRIGGER trg_joanna_master_AI_Col2 AFTER INSERT ON tbl_joanna_master
  9. FOR EACH ROW
  10. WHEN NEW.Col2 IS NOT NULL
  11. BEGIN
  12.         INSERT INTO tbl_joanna_distinct(DistValue) VALUES(NEW.Col2) ON CONFLICT(DistValue) DO UPDATE SET DistCount=DistCount+1 WHERE DistValue=NEW.Col2;       
  13. END;
  14.  
  15. CREATE TRIGGER trg_joanna_master_AI_Col3 AFTER INSERT ON tbl_joanna_master
  16. FOR EACH ROW
  17. WHEN NEW.Col3 IS NOT NULL
  18. BEGIN
  19.         INSERT INTO tbl_joanna_distinct(DistValue) VALUES(NEW.Col3) ON CONFLICT(DistValue) DO UPDATE SET DistCount=DistCount+1 WHERE DistValue=NEW.Col3;       
  20. END;
  21.  
  22. CREATE TRIGGER trg_joanna_master_AU_Col1 AFTER UPDATE ON tbl_joanna_master
  23. FOR EACH ROW
  24. WHEN OLD.Col1<>NEW.Col1
  25. BEGIN
  26.         -- Reduce Count by One
  27.         UPDATE tbl_joanna_distinct
  28.                 SET DistCount=DistCount-1
  29.                 WHERE DistValue=OLD.Col1;
  30.         -- Delete if Count reaches 0   
  31.         DELETE FROM tbl_joanna_distinct WHERE DistValue=OLD.Col1 AND DistCount<=0;
  32.         -- INSERT New.Col1 if it doesn't exist, otherwise Count up by 1
  33.         INSERT INTO tbl_joanna_distinct(DistValue) VALUES(NEW.Col1) ON CONFLICT(DistValue) DO UPDATE SET DistCount=DistCount+1 WHERE DistValue=NEW.Col1;
  34. END;
  35.  
  36. CREATE TRIGGER trg_joanna_master_AU_Col2 AFTER UPDATE ON tbl_joanna_master
  37. FOR EACH ROW
  38. WHEN OLD.Col2<>NEW.Col2
  39. BEGIN
  40.         -- Reduce Count by One
  41.         UPDATE tbl_joanna_distinct
  42.                 SET DistCount=DistCount-1
  43.                 WHERE DistValue=OLD.Col2;
  44.         -- Delete if Count reaches 0   
  45.         DELETE FROM tbl_joanna_distinct WHERE DistValue=OLD.Col2 AND DistCount<=0;
  46.         -- INSERT New.Col1 if it doesn't exist, otherwise Count up by 1
  47.         INSERT INTO tbl_joanna_distinct(DistValue) VALUES(NEW.Col2) ON CONFLICT(DistValue) DO UPDATE SET DistCount=DistCount+1 WHERE DistValue=NEW.Col2;
  48. END;
  49.  
  50. CREATE TRIGGER trg_joanna_master_AU_Col3 AFTER UPDATE ON tbl_joanna_master
  51. FOR EACH ROW
  52. WHEN OLD.Col3<>NEW.Col3
  53. BEGIN
  54.         -- Reduce Count by One
  55.         UPDATE tbl_joanna_distinct
  56.                 SET DistCount=DistCount-1
  57.                 WHERE DistValue=OLD.Col3;
  58.         -- Delete if Count reaches 0   
  59.         DELETE FROM tbl_joanna_distinct WHERE DistValue=OLD.Col3 AND DistCount<=0;
  60.         -- INSERT New.Col1 if it doesn't exist, otherwise Count up by 1
  61.         INSERT INTO tbl_joanna_distinct(DistValue) VALUES(NEW.Col3) ON CONFLICT(DistValue) DO UPDATE SET DistCount=DistCount+1 WHERE DistValue=NEW.Col3;
  62. END;
  63.  
  64. CREATE TRIGGER trg_joanna_master_BD BEFORE DELETE ON tbl_joanna_master
  65. FOR EACH ROW
  66. BEGIN
  67.         -- Reduce Count by One
  68.         UPDATE tbl_joanna_distinct SET DistCount=DistCount-1 WHERE DistValue=OLD.Col1;
  69.         UPDATE tbl_joanna_distinct SET DistCount=DistCount-1 WHERE DistValue=OLD.Col2;
  70.         UPDATE tbl_joanna_distinct SET DistCount=DistCount-1 WHERE DistValue=OLD.Col3;
  71.         -- Delete if Count reaches 0   
  72.         DELETE FROM tbl_joanna_distinct WHERE DistValue=OLD.Col1 AND DistCount<=0;
  73.         DELETE FROM tbl_joanna_distinct WHERE DistValue=OLD.Col2 AND DistCount<=0;
  74.         DELETE FROM tbl_joanna_distinct WHERE DistValue=OLD.Col3 AND DistCount<=0;
  75. END;

Run this INSERT, and you will see, how it the "distinct"-table counts up. Change/Delete Values in your Master, and see how the count changes in "distinct"
Code: SQL  [Select][+][-]
  1. INSERT INTO tbl_joanna_master(Col1,Col2,Col3) VALUES(1,2,3),(4,5,6),(1,3,6);
« Last Edit: September 05, 2024, 04:37:07 pm by Zvoni »
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Joanna

  • Hero Member
  • *****
  • Posts: 1000
Re: distinct values from very large table
« Reply #9 on: September 06, 2024, 01:01:17 am »
Thanks Zvoni, my sql skills are not so great I’ve never used triggers before.
The table for distinct also stores the occurrences of each distinct value and that is something I’d like to know.

I will add this to my schema. I do have another table that references the big table but not the columns I want unique values for.

One thing I wonder about though is how much will the triggers slow down things when I recreate the table and make new data?
« Last Edit: September 06, 2024, 01:05:27 am by Joanna »
✨ 🙋🏻‍♀️ More Pascal enthusiasts are needed on IRC .. https://libera.chat/guides/ IRC.LIBERA.CHAT  Ports [6667 plaintext ] or [6697 secure] channel #fpc  #pascal Please private Message me if you have any questions or need assistance. 💁🏻‍♀️

egsuh

  • Hero Member
  • *****
  • Posts: 1440
Re: distinct values from very large table
« Reply #10 on: September 06, 2024, 04:59:37 am »
I believe Zvoni's approach is the right solution, but I'd like to suggest a simpler solution.

1) First, create tables that contain only distinct values for each 8 fields (and number of rows of each distinct value for current state if necessary). That would be similar to :

Code: SQL  [Select][+][-]
  1.     CREATE TABLE "tbl_joanna_field1_distinct" (
  2.             "DistValue"     INTEGER NOT NULL UNIQUE,
  3.             "DistCount"     INTEGER DEFAULT 1
  4.     )
  5.  
  6.     CREATE TABLE "tbl_joanna_field2_distinct" (
  7.             "DistValue"     INTEGER NOT NULL UNIQUE,
  8.             "DistCount"     INTEGER DEFAULT 1
  9.     )
  10.     // one TABLE FOR each FIELD
  11.  

2) And define triggers for your own table.

Code: SQL  [Select][+][-]
  1. CREATE TRIGGER trg_joanna_master_AI_Col1 AFTER INSERT ON tbl_joanna_master
  2. AS
  3. BEGIN
  4.      IF  (NEW.field1 IS NOT NULL) THEN  BEGIN
  5.          UPDATE tbl_joanna_field1_distinct SET  distcount = distcount + 1 WHERE DistValue=NEW.field1;
  6.          IF (rec_count <= 0) THEN
  7.              INSERT INTO tbl_joanna_field1_distinct (nenw.field1, 1);
  8.     END;
  9.          
  10.      IF  (NEW.field2 IS NOT NULL) THEN  BEGIN
  11.          UPDATE tbl_joanna_field2_distinct SET  distcount = distcount + 1 WHERE DistValue=NEW.field2;
  12.          IF (rec_count <= 0) THEN
  13.              INSERT INTO tbl_joanna_field2_distinct (nenw.field2, 1);
  14.     END;
  15.     -- and so on for other fields
  16. END;
  17.  
  18. CREATE TRIGGER trg_joanna_master_AI_Col1 AFTER DELETE ON tbl_joanna_master
  19. AS
  20. BEGIN
  21.      IF  (OLD.field1 IS NOT NULL) THEN  BEGIN
  22.          UPDATE tbl_joanna_field1_distinct SET  distcount = distcount - 1 WHERE DistValue=OLD.field1;
  23.     END;
  24.          
  25.      IF  (OLD.field2 IS NOT NULL) THEN  BEGIN
  26.          UPDATE tbl_joanna_field2_distinct SET  distcount = distcount - 1 WHERE DistValue=OLD.field2;
  27.     END;
  28.     -- and so on for other fields
  29. END;
  30.  
  31. CREATE TRIGGER trg_joanna_master_AI_Col1 AFTER UPDATE ON tbl_joanna_master
  32. AS
  33. BEGIN
  34.      IF  (OLD.field1 <> NEW.field1) THEN  BEGIN
  35.          -- procedure for insert trigger for new.field1
  36.          -- procedure for delete trigger for old.fied1
  37.     END;
  38.     -- and so on for other fields
  39. END;
  40.  
  41.  

Not sure syntaxes are right.

Zvoni

  • Hero Member
  • *****
  • Posts: 2641
Re: distinct values from very large table
« Reply #11 on: September 06, 2024, 07:59:52 am »
Not sure syntaxes are right.
They are not. SQLite doesn't support "IF-THEN-ELSE" within a Trigger-Body.
You can't even do a "SELECT CASE WHEN XXXX THEN INSERT INTO blablbalba"
Believe me, i tried.
It's the reason, why i had to make a trigger per Column for AFTER UPDATE.

@Joanna: Question for those 8 Columns: Are NULLs allowed in those 8 Columns?
Because if NOT (there always must be an Integer-Value), then the AFTER-INSERT-Triggers can be reduced to one Trigger

Thanks Zvoni, my sql skills are not so great I’ve never used triggers before.
Think of Triggers like "Events" you're used to from Lazarus
an AFTER INSERT-trigger would be something like an "OnAfterInsert"-Event you might know from Lazarus


Quote
The table for distinct also stores the occurrences of each distinct value and that is something I’d like to know.
To store the count is the only way i've found to keep track, WHEN to delete a former "distinct" value.
You could argue, that it's not necessary to delete that value from "Distinct", when you keep the count = 0.
Then to Grab "What distinct" Values are currently in my 73M rows?"
it would be a simple "SELECT DistValue FROM tbl_joanna_distinct WHERE DistCount>0 ORDER BY DistValue"
(which you should do anyway that way)

Quote
I will add this to my schema. I do have another table that references the big table but not the columns I want unique values for.
My question, if you have another table, which references the big table, doesn't have to do if it references the values.
It has to do with, that for the first creation of the distinct table, you would have to drop any and all ForeignKey-constraints between them, because you would have to do a dump of the big table, drop/empty out the big table, create the triggers and distinct table, then reimport the big table (to get the Values and counts starting in "Distinct")
This would be a one-time thing

Quote
One thing I wonder about though is how much will the triggers slow down things when I recreate the table and make new data?
The first creation (see above how to) will take time, since in a nutshell you execute 73M x 8 SQL-Instructions instead of just 73M
BUT AFTER THAT (running process), it's the difference between 1 Statement and max. 24 (depending what you do, INSERT ( 8 Statements), UPDATE (24), DELETE ( 8 ) Statements, which nowadays is pretty much not noticable
« Last Edit: September 06, 2024, 10:43:29 am by Zvoni »
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

egsuh

  • Hero Member
  • *****
  • Posts: 1440
Re: distinct values from very large table
« Reply #12 on: September 06, 2024, 08:31:46 am »
Quote
They are not. SQLite doesn't support "IF-THEN-ELSE" within a Trigger-Body.
You can't even do a "SELECT CASE WHEN XXXX THEN INSERT INTO blablbalba"
Believe me, i tried.
It's the reason, why i had to make a trigger per Column for AFTER UPDATE.

I see. My grammar is of Firebird. I never doubt you. Just thinking of algorithm or approaches --- not specific syntax.

Joanna

  • Hero Member
  • *****
  • Posts: 1000
Re: distinct values from very large table
« Reply #13 on: September 06, 2024, 10:05:12 am »
It’s a shame that sql doesn’t support passing different tables as a parameter to a procedure. It seems painfully redundant compared to pascal. I guess that is the big difference between compiled and scripting languages?

Zvoni the colum values in this particular table are not allowed to be null. Default is 0. Also I probably should have mentioned that this table is a lookup table which is not supposed to be modified by users.

I’m glad you gave code for update and deleting though, I will save it in case I ever need it.

If I want to populate the distinct table from existing table as a test , how would I do it? I’m guessing something along the lines of
Insert into tbl_distinct select (columns I want) from bigtable?
« Last Edit: September 06, 2024, 10:10:47 am by Joanna »
✨ 🙋🏻‍♀️ More Pascal enthusiasts are needed on IRC .. https://libera.chat/guides/ IRC.LIBERA.CHAT  Ports [6667 plaintext ] or [6697 secure] channel #fpc  #pascal Please private Message me if you have any questions or need assistance. 💁🏻‍♀️

egsuh

  • Hero Member
  • *****
  • Posts: 1440
Re: distinct values from very large table
« Reply #14 on: September 06, 2024, 10:13:25 am »
  insert into distinct_table select distinct field1 from master_table;

would work.

 

TinyPortal © 2005-2018