Recent

Author Topic: What intermediate dataformat should I use to analyse a CSV file?  (Read 1176 times)

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
What intermediate dataformat should I use to analyse a CSV file?
« on: February 15, 2020, 03:33:17 pm »
Hi,

I have data in the form of a CSV file.
The file has data like:
Age, Total days of admittence, Diagnosis, Barthel index, Destination after discharge, ...

Now I would like to do some statistics on the data, like calculate the percentage of patients admitted with diagnosis "CVA", calculate the percentage of destination="Home" with patient with diagnosis="CVA", calculate the mean Barthel index of patients with diagnosis="CVA" and destination="Home" versus destination<>"Home".

Complication is that not all fields are known for all patients (the field is left blank in this case).

Parsing the CSV file is not a real problem.

My main question is: what kind of internal data structure should I use, so that I can easily "query" it for questions like the above?

And before anyone asks: No, I don't have access to a statistical analysis program (SPSS or the like), neither do I have any database engine installed on the machine in question.
(The data are exported form an Excel file, I don't even have access to MS Access. I work in a nursing home and there are simply no funds for that.)

The dataset is currently limited to about 160 patients, so I could do this all by hand, but the same excercise will have to be done next year (on appr. the same amount of data).
And of course, since I like programming, it will be a nice excercise for me.

Bart

jamie

  • Hero Member
  • *****
  • Posts: 6131
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #1 on: February 15, 2020, 03:47:15 pm »
There is a TCSVDataSet on the Data access tab.

Drop a TDataSource on the form and some sort of TDBControlxxx? to view the data ?

You do need to know your Field names so you can create the names for the dataset

« Last Edit: February 15, 2020, 03:49:09 pm by jamie »
The only true wisdom is knowing you know nothing

fred

  • Full Member
  • ***
  • Posts: 201
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #2 on: February 15, 2020, 03:52:18 pm »
My first thought would be a Sqlite database, ZMSQL or JCSV (Jans CSV Components) and run select count etc.

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #3 on: February 15, 2020, 04:13:00 pm »
My first thought would be a Sqlite database, ZMSQL or JCSV (Jans CSV Components) and run select count etc.

No such software is available on the machine in question, nor will it be installed on my request.

Bart

wp

  • Hero Member
  • *****
  • Posts: 11923
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #4 on: February 15, 2020, 04:30:47 pm »
My first thought would be a Sqlite database, ZMSQL or JCSV (Jans CSV Components) and run select count etc.

No such software is available on the machine in question, nor will it be installed on my request.

Bart

This is not "software" in the sense of external programs - they are just Lazarus packages, except for Sqlite which is a dll to be distributed along with your exe in its folder. If you are allowed to add an exe to the user's computer then you should also be allowed to add sqlite3.dll.

I agree that SQL would be the easiest way to calculate the percentages under some conditions. Otherwise you must put the data into sorted arrays and count yourself - but this should not be too difficult either.

fred

  • Full Member
  • ***
  • Posts: 201
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #5 on: February 15, 2020, 04:37:09 pm »
Since the data is about 160 rows even loading it in a stringgrid and running some for loops would be fast enough.

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #6 on: February 15, 2020, 04:48:50 pm »
fred was faster   ..

Hi!

I would use a StringGrid with only 160 rows.

To make it reusable make some hardcoded  functions which could be used next year.

Winni

jamie

  • Hero Member
  • *****
  • Posts: 6131
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #7 on: February 15, 2020, 07:38:56 pm »
Since this is nothing more then comma text, you can load the complete source into a TStringList..

each item would be the record...

To split the record per item you can use SPLIT or another TStringList.CommaText := TheMainStringList.Item[?];

 From that point on, the secondary Stringlist will have the fields broken down, one per item.
The only true wisdom is knowing you know nothing

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #8 on: February 15, 2020, 10:18:04 pm »
Parsing the CSV is not he problem.

@All: thanks for the suggestions so far.

Bart

af0815

  • Hero Member
  • *****
  • Posts: 1291
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #9 on: February 15, 2020, 10:24:56 pm »
The first question for me is always.

How much i have today (you write 160)
How much i have in 10 years ( :-) )

How much datafields may i have and how much memory i have to spend.

If you have no problem in 10 yrs with the size of data then with the memory from today -> use collections if you do not want external db.
Collections can be sorted, if you design the compare right - it works more than one level deep. Make a collection with dummydata (10yrs size) and test the speed of sorting and counting.

regards
Andreas

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #10 on: February 15, 2020, 10:37:32 pm »
How much i have in 10 years ( :-) )

Well, appr. 1600 then.
If I still do then, what I do now.
It's only data from my own ward.

It's not meant for scientific publication.

In the (distant) future we may gather more data, when we move to a more decent electronic patient dossier.
If we then want to do more substantial work ont that dataset, we'll seek collabaration with one of the universities and dataprocessing will be done in an appropriate application (one not written by me).
But that's just distant dreams (or nightmares??).

Could you give a small example on how you would use collections for such data?
I never used that before.

Bart

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #11 on: February 15, 2020, 10:42:56 pm »
Hi!

I have a StringGrid with 75.000 Rows and 14 Cols.

Sequential  searching is <1 second. Sorting is lightning fast - whatever they do internal.

So there is no problem that you could run in time critical situations.

The data on the HD is a simple CSV.

Winni
« Last Edit: February 15, 2020, 10:44:29 pm by winni »

af0815

  • Hero Member
  • *****
  • Posts: 1291
Re: What intermediate dataformat should I use to analyse a CSV file?
« Reply #12 on: February 15, 2020, 10:47:58 pm »
Could you give a small example on how you would use collections for such data?
I never used that before.
A good starting point is https://wiki.freepascal.org/TCollection

edit:
Something about sorting is here https://forum.lazarus.freepascal.org/index.php?topic=38905.0
« Last Edit: February 15, 2020, 10:56:52 pm by af0815 »
regards
Andreas

 

TinyPortal © 2005-2018