Recent

Author Topic: Help with TStringlist duplicate names with different values  (Read 3134 times)

OH1KH

  • Jr. Member
  • **
  • Posts: 63
Help with TStringlist duplicate names with different values
« on: November 19, 2019, 06:37:36 pm »
Hi!

I have a text file from where I extract  name=value pairs.
How ever this file can contain duplicate names having different values on which the last (towards the end of file) is the current name=value pair that should be used.

So far I have managed to read text file line by line, dig out name and value from every line and add them to TStringlist.

This takes some seconds, that is reasonable time for 1 500 000 name=value pairs.

But then comes the hard part. I have to find duplicates and save only the last one. I have not found a fast way for it.
Reading text file from end to beginning and adding string list with  stringlist.Duplicates:=dupIgnore setting does not work as the value part may differ and so the stringlist.Add( 'name=value) may not show out as duplicate and the gets added even if there is already a name like that.

Is there a clever and fast  way to do this? 

if we do not count mysql database usage, or external Linux tools. Maybe usage of TMemIniFile ?

 I have found only stupid and slow ways.


--
Saku

Bart

  • Hero Member
  • *****
  • Posts: 5275
    • Bart en Mariska's Webstek
Re: Help with TStringlist duplicate names with different values
« Reply #1 on: November 19, 2019, 06:51:54 pm »
Lazy solution: reverse the stringlist then Values[SomeKey] will return the now first occurence of SomeKey (the last one in the original).
Not very nice with > a million strings...

You might be able to use PosEx on the Text, until you find no further occurence of 'ThisKey='.

Maybe TVistualStringTree can do this more efficiently?

Bart

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Help with TStringlist duplicate names with different values
« Reply #2 on: November 19, 2019, 06:55:07 pm »
Hi!

Stringlist with 1.5 mio items - take time ....

After readling the list you do a

Code: Pascal  [Select][+][-]
  1. screen.cursor := crHourglass:
  2. MyStringList.sort;
  3. screen.cursor := crDefault;

You can make fresh coffee inbetween, but you see from the cursor if sort is ready.

Then you go from top to botton and compare the names:

Code: Pascal  [Select][+][-]
  1. For i := MyStringList.Count - 1 downto 1 do
  2.   begin
  3.   // get the names for i and i-1
  4.  if nameI = nameI1 then MyStringList.delete (i-1)
  5.  end;
  6.  
  7. MyStringList.savetoFile (....

This does not take the value in account. Is this a problem?

Winni

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Help with TStringlist duplicate names with different values
« Reply #3 on: November 19, 2019, 07:04:39 pm »
Using hashmap seems to be more preferable in this case.

devEric69

  • Hero Member
  • *****
  • Posts: 648
Re: Help with TStringlist duplicate names with different values
« Reply #4 on: November 19, 2019, 07:31:30 pm »
For information, there is the TStringHash class that could be useful too (" It implements a bucket list for Name=Value pairs, where Value is an integer. This enables quick lookup of values based on a name. [snip] the property TStringHash.AddReplacesExisting indicates if Add should replace existing values or not ").
use: Linux 64 bits (Ubuntu 20.04 LTS).
Lazarus version: 2.0.4 (svn revision: 62502M) compiled with fpc 3.0.4 - fpDebug \ Dwarf3.

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: Help with TStringlist duplicate names with different values
« Reply #5 on: November 19, 2019, 08:27:06 pm »
You could also use IndexOfName to find similar strings. Basically something like:

Code: Pascal  [Select][+][-]
  1. function GetLastValue(const AName: String; AList: TStringList): String
  2. var
  3.   i: Integer;
  4. begin
  5.   Result := '';
  6.   repeat
  7.     i := AList.IndexOfName(AName);
  8.     if i >= 0 then begin
  9.       Result := AList.Values[i];
  10.       AList.Delete(i);
  11.     end;
  12.   until (i < 0);
  13. end;
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

wp

  • Hero Member
  • *****
  • Posts: 11857
Re: Help with TStringlist duplicate names with different values
« Reply #6 on: November 19, 2019, 11:12:53 pm »
Here is a complete procedure which reads the data file line by line into a sorted StringList. It checks whether the name part of the string (string before '=') is already contained. If not, the current entry is deleted. Then the entire line is added to the stringlist. This way only the latest occurence of a name=value pair is kept. In total the entire procedure is rather short.

I tested only 100,000 entries and found that the routine runs for about half a minute - did not want to wait for 1.5 millions.

Certainly, any more sophisticated list class will be faster because TStringList does only linear search.

Code: Pascal  [Select][+][-]
  1. procedure TForm1.ReadDataFile(AFileName: String);
  2. var
  3.   List: TStringList;
  4.   F: TextFile;
  5.   s, sn: String;
  6.   idx: Integer;
  7.   t: TDateTime;
  8. begin
  9.   List := TStringList.Create;
  10.   try
  11.     List.Sorted := true;
  12.  
  13.     t := Now;
  14.     AssignFile(F, AFileName);
  15.     Reset(F);
  16.     while not EoF(F) do
  17.     begin
  18.       ReadLn(F, s);
  19.       sn := copy(s, 1, pos('=', s)-1);
  20.       idx := List.IndexOfName(sn);
  21.       if idx > -1 then
  22.         List.Delete(idx);
  23.       List.Add(s);
  24.     end;
  25.     CloseFile(F);
  26.     Caption := FormatDateTime('n:ss.zzz', Now-t);
  27.   finally
  28.     List.Free;
  29.   end;
  30. end;  

jamie

  • Hero Member
  • *****
  • Posts: 6090
Re: Help with TStringlist duplicate names with different values
« Reply #7 on: November 19, 2019, 11:50:17 pm »
Here is my master piece, it finds the last one in the list and returns the value..

I suppose it could be modified to also return the index but things get slow when you start dealing with large list so here it is.
Code: Pascal  [Select][+][-]
  1. function GetLastValue(const S: string; AList: string): string;
  2. var
  3.   LastOneFound, CurrentOnefound: integer;
  4. begin
  5.   Result := '';
  6.   lastOneFound := 0;
  7.   CurrentOneFound := 1;
  8.   repeat
  9.     CurrentOnefound := PosEx(S, AList, CurrentOnefound);
  10.     if (CurrentOneFound <> 0) and (AList[CurrentOneFound + Length(S)] = '=') then
  11.       if (CurrentOneFOund = 1) or (Alist[CurrentOneFound - 1] < #32) then
  12.         LastOneFound := CurrentOneFound;
  13.     if CurrentOneFound <> 0 then
  14.       while AList[CurrentOneFound] >= #32 do
  15.         Inc(CurrentOneFound);
  16.   until CurrentOneFound = 0;
  17.   if lastOneFound <> 0 then
  18.   begin
  19.     Inc(LastOneFound, Length(S) + 1);
  20.     while not (Alist[LastOneFound] in [#0, #13, #10]) do
  21.     begin
  22.       Result := Result + Alist[LastOneFound];
  23.       Inc(LastOneFound);
  24.     end;
  25.   end;
  26. end;
  27.  
  28. procedure TForm1.Button1Click(Sender: TObject);
  29. begin
  30.   Caption := GetLastValue('pie', Memo1.Lines.Text);
  31. end;                                                
  32.  

That does a direct search on the string body and moves along the lines.
I believe this one to be about as fast as you will get it for searching for the last entry and getting the value of it.
 At least for pascal code that is  :D

 A little tinkering maybe required but it seems to work for a list I tried.
The only true wisdom is knowing you know nothing

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Help with TStringlist duplicate names with different values
« Reply #8 on: November 20, 2019, 06:58:39 am »
@jamie, what about the performance of your method on data of 1,500,000 lines?

jamie

  • Hero Member
  • *****
  • Posts: 6090
Re: Help with TStringlist duplicate names with different values
« Reply #9 on: November 20, 2019, 10:55:12 pm »
I don't know, I have no idea. That's a lot of lines.

However, a reverse search will work just fine since it's expected to have the value at the end I guess.
The only true wisdom is knowing you know nothing

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Help with TStringlist duplicate names with different values
« Reply #10 on: November 20, 2019, 11:13:38 pm »
@OH1KH

Please tell us

* what kind of type  is the "value" - integer, string or???
* does the "value" is the decision which one of the duplicates must be deleted. Like: the highest must not be deleted.

Before we don't have this information we all stumble in the dark.

Winni

wp

  • Hero Member
  • *****
  • Posts: 11857
Re: Help with TStringlist duplicate names with different values
« Reply #11 on: November 20, 2019, 11:29:01 pm »
which one of the duplicates must be deleted. Like: the highest must not be deleted.
Didn't he say that? "I have to find duplicates and save only the last one. "

jamie

  • Hero Member
  • *****
  • Posts: 6090
Re: Help with TStringlist duplicate names with different values
« Reply #12 on: November 20, 2019, 11:44:26 pm »
Ok, I have part of the request...

if you have a known Key name and only want the last one in the list... then taking a know name you can do this.
Code: Pascal  [Select][+][-]
  1. Function ReverseGetLastValue(Const S,AL:String):String;
  2. Var
  3.   LineStart:Integer;
  4.   Found :Boolean = false;
  5.   EqualPos :Integer;
  6. Begin
  7.   Result := '';
  8.   LineStart := Length(AL);
  9.   Repeat
  10.    { Get off the line endings }
  11.    While (LineStart > 1) and (AL[LineStart] < #32) DO dec(LineStart);
  12.    {Move to Start of Line }
  13.    While (LineStart > 1)And(AL[LineStart-1]>=#32) do Dec(LineStart);
  14.    EqualPos := LineStart;
  15.    {Move To the = }
  16.    While (AL[EqualPos]<>'=')and(AL[EqualPos]>#32) do Inc(EqualPos);
  17.    If (AL[EqualPos]='=')and(Copy(AL,LineStart, EqualPos-Linestart) = S) then
  18.      Begin
  19.       Inc(EqualPos);
  20.       Found := True;
  21.       While AL[EQualPos] >=#32  Do
  22.         begin
  23.           Result := Result+AL[EqualPos];
  24.           Inc(EqualPos);
  25.         End;
  26.      End Else Dec(LineStart);
  27.   Until (Linestart = 0) or (Found);
  28. End;
  29. procedure TForm1.Button1Click(Sender: TObject);
  30. begin
  31.   Caption := ReverseGetLastValue('noodlehead', Memo1.Lines.Text);
  32. end;                                                                
  33.  

 So once a Key is found in the list of interest, you can then search the list for the last one doing it this way..
 Save the key in another list so you don't repeat it again. Using a TStringList can be done for those after the fact..
The only true wisdom is knowing you know nothing

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Help with TStringlist duplicate names with different values
« Reply #13 on: November 20, 2019, 11:53:56 pm »
@wp

What does he mean with  "the last"??? The highest value??  Are the values sorted.?? Are they of interest???

For all:

Creating 1.5 mio random names: 11.8 sec  (55 MB)
Loading  into StringList: 0.3 sec
Sorting : 47.6 sec

Ryzen 5, good old harddisk

So if this is a one time action the 47 seconds should not be a problem.

Winni

jamie

  • Hero Member
  • *****
  • Posts: 6090
Re: Help with TStringlist duplicate names with different values
« Reply #14 on: November 21, 2019, 12:04:13 am »
This looks like a logging file, write out keynames and their values over time..

 Which means this, one most likely already knows the key names that will be in there to start with and taking that into account, you can scan backwards from the list getting the last one that was entered.

  However, if the key names are not known, then one will need to scan the list starting at the beginning, read a key in , check to see if its in key list yet that is being built, if so then don't bother and move on. If it's not yet in the list then add it to this side list and then call the function like laid out to get the last one on the list..

  I am getting a feeling this is a home work assignment so this means the key names are not known to start with. If that being the case a forward read is needed to ensure you have all the unknown keys before you decide on duplicates.

 Yup, looks like home work!

 I am done.
The only true wisdom is knowing you know nothing

 

TinyPortal © 2005-2018