Recent

Author Topic: combine/merge/consolidate multiple rows into one record  (Read 287 times)

tatamata

  • Hero Member
  • *****
  • Posts: 734
    • ZMSQL - SQL enhanced in-memory database
combine/merge/consolidate multiple rows into one record
« on: September 27, 2020, 01:24:05 pm »
Hello,
Is anybody aware of a Lazarus code snippet or a free piece of software that can be used for post-hoc consolidation of a dataset, in a way that multiple rows sharing same unique value in one column (partially duplicate records) can be comined/merged/consolidated into one row?
To explain use case, imagine  a dataset containing contacts (unique_customer_id, first name, last name, adress, e_mail, phone, etc.). Now, we want to merge rows with same unique_customer_id...
« Last Edit: September 27, 2020, 01:50:03 pm by tatamata »

howardpc

  • Hero Member
  • *****
  • Posts: 3553
Re: combine/merge/consolidate multiple rows into one record
« Reply #1 on: September 27, 2020, 02:24:56 pm »
The difficulty in coding this is not merging record A with record B where for every non-identical field either A or B has a blank field, but the case of merging A and B when two non-blank fields differ.
For those cases, you need an algorithm or rule, unique to the database for each field, to decide whether A's field or B's field is the one to preserve (or if some other course of action should be taken).
Often in such cases no one really knows which is the 'right' field.
Even if B's last user edit date is later, the data may be incorrect.
Only a local, knowledgeable person can determine those sorts of judgments in most cases. Or you have to go with a heuristic, and accept that the data is compromised or unreliable.
An important question to ask is "How did two different records with the same unique customer_ID arise in the first place?"

tatamata

  • Hero Member
  • *****
  • Posts: 734
    • ZMSQL - SQL enhanced in-memory database
Re: combine/merge/consolidate multiple rows into one record
« Reply #2 on: September 27, 2020, 03:14:35 pm »
The difficulty in coding this is not merging record A with record B where for every non-identical field either A or B has a blank field, but the case of merging A and B when two non-blank fields differ.
For those cases, you need an algorithm or rule, unique to the database for each field, to decide whether A's field or B's field is the one to preserve (or if some other course of action should be taken).
Often in such cases no one really knows which is the 'right' field.
Even if B's last user edit date is later, the data may be incorrect.
Only a local, knowledgeable person can determine those sorts of judgments in most cases. Or you have to go with a heuristic, and accept that the data is compromised or unreliable.
An important question to ask is "How did two different records with the same unique customer_ID arise in the first place?"
Sure, I am aware of those problems you are pointing to and these are not trivial. Obviously, for each column a rule has to be defined, what "merging" actually means. In SQL, e.g., it might be concatenation, unique words, unique lexems, greatest string, least string, max, min, etc...
But, for now I am looking for either already developed Lazarus piece of code (operationg with dataset  or spreadsheet), or a free stand-alone generic software for such data operations, which is not SQL-based...
For example, I was looking into OpenRefine software and, suprisingly, it seems it does not have such option, at least not an obvious one...
« Last Edit: September 27, 2020, 03:16:21 pm by tatamata »

wittbo

  • Full Member
  • ***
  • Posts: 139
Re: combine/merge/consolidate multiple rows into one record
« Reply #3 on: September 27, 2020, 09:52:15 pm »
First of all, I don't believe, that you will find any solution for your idea:  the database is the instance, that sends all data records, that match the original SQL statement, to the TQuery/TDataset component. A TDBgrid is the visualisation tool only. If a database provides 100 records, the TDBGrid will always show 100 datasets. I don't believe, that there could any frontend logic, that compresses 100 records into 40 datasets for example.

And -if I understand you correctly- you don't know the merging criteria exactly. I believe your task is to clean up an inconsistent customer database, and it may be that in one case the name is correct and the email incorrect, but in the next the email is correct and the name incorrect.

One can only proceed systematically in such a case; define the fields that must always be identical, count the records that have a match with these criteria and correct the other fields:
select ID,
         NAME,
         count(*) as number
from  customers
group by id, name
having count(*) > 1;


You could write a software, who creates this SQL-statement at runtime, changes the query and reopens it again. So you can try a stepwise approach to condense your customer database.
« Last Edit: September 27, 2020, 09:57:17 pm by wittbo »
-wittbo-
MBAir with MacOS 10.14.6 / Lazarus 2.0.10
iMac with MacOS 10.13.6 / Lazarus 2.0.2

 

TinyPortal © 2005-2018