Recent

Author Topic: Proposal: Associative String Arrays  (Read 4864 times)

Warfley

  • Hero Member
  • *****
  • Posts: 1872
Proposal: Associative String Arrays
« on: May 11, 2023, 04:24:06 pm »
So this is probably the thing that most bugs me when doing a Pascal project, that I don't have any easy to use string dictionaries at hand. I'd argue that the probably most useful datastructure aside from the base types and arrays are probably a simple dictionary for String based Key-Value pairs.

There are some types available, e.g. TDictionary and THashMap from Generics.Collections or TFPGMap from fgl, but they don't really work (at least out of the box) with string keys, as they use the pointer to that string as key rather than the string itself.
There are alternatives, like TStringList, and there are some oldschool Hashmap types that map to raw pointers, but they are very cumbersome and inconvinient to use. And that is a staple and a very essential datatype in pretty much all other languages.

So my idea was, that this could be added on the language level as simpe associative array:
Code: Pascal  [Select][+][-]
  1. var
  2.   map: Array[String] of Integer;
  3. begin
  4.   map['foo'] := 42;
  5.   if 'foo' in map then
  6.     WriteLn(map['foo']);
  7. end;

On a syntactic level this would not be that dissimilar from the existing associative arrays for enums:
Code: Pascal  [Select][+][-]
  1. type
  2.   TTest = (test1, test2);
  3. var
  4.   map: Array[TTest] of Integer;

Of course the underlying technology must be differently, as the above will just create an Array with test1..test2, but from a high level view it would not be that different, and be very useful.

There is already a lot of compiler magic and special treatment for strings, e.g. they work in case-of, and overall are a pretty magical datatype, so adding some more magic would not be that disruptive, and in return I can't stress enough how useful this would be (especially if it's lazy copy and reference counted like normal dynamic arrays).

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12142
  • FPC developer.
Re: Proposal: Associative String Arrays
« Reply #1 on: May 11, 2023, 04:35:57 pm »
What is wrong with

Code: Pascal  [Select][+][-]
  1.  
  2. {$mode delphi}
  3. uses generics.collections;
  4.  
  5. type  TStrInt = TDictionary<string,integer>;
  6.  
  7.     var
  8.       map : TStrInt;
  9.     begin
  10.       map:=tstrint.create;
  11.       map.addorsetvalue('foo',42);
  12.       if map.containskey('foo') then
  13.         WriteLn(map['foo']);
  14.     end.
  15.  

And no, afaik generic.collections has special cuckoo2 hash for string content
« Last Edit: May 11, 2023, 04:41:28 pm by marcov »

Warfley

  • Hero Member
  • *****
  • Posts: 1872
Re: Proposal: Associative String Arrays
« Reply #2 on: May 11, 2023, 05:09:47 pm »
Interesting, because in a program where I've used Generics.Collections and I had a bunch of duplicates in the map. Then I must investigate a bit on why that was. Because this has triggered a search for me for other dictionary components that are string capable, and I ended up with some ancient Hashmap implementation. Which is why I created that thread, because using this old Pointer based Datatype made life much harder than it needs to be.

I still think that associative arrays on language level can be useful (one of the things I really like about python), but if generics.Collections works, then there is at least something usable
« Last Edit: May 11, 2023, 05:12:20 pm by Warfley »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12142
  • FPC developer.
Re: Proposal: Associative String Arrays
« Reply #3 on: May 11, 2023, 05:49:28 pm »
Interesting, because in a program where I've used Generics.Collections and I had a bunch of duplicates in the map.

Note the addorsetvalue vs simply .add.

Quote
I still think that associative arrays on language level can be useful (one of the things I really like about python), but if generics.Collections works, then there is at least something usable

It is a typical scripting feature, where everything is a list or map. It is the proverbial hammer in the "if the only tool you have is a hammer" phrase, but fine for quick and dirty batch operations.

I'm not convinced of the need for it a language feature in a non scripting language. Basically it is only curb-appeal for newbs.

Warfley

  • Hero Member
  • *****
  • Posts: 1872
Re: Proposal: Associative String Arrays
« Reply #4 on: May 11, 2023, 09:51:58 pm »
It's not just scripting languages, whenever you want to associate data with an identifier that is useful, e.g. when parsing files and wanting to cache that parsing result, or when associating data records with names of the subjects of that record, or when you want to associate connections with usernames in a server (e.g. for a chat).

Just as an example, I wrote a parser which has automatas to parse structures, and for debugging purposes I want to be able to lookup which automata is associated with what symbol

I have not had a single project in the past 5 years where I did not need a dictionary mapping a base type (mostly string) anti a more complex object. It's the natural counterpart to having some form of ID or name in your data field. Whenever you have a data type that has a name or ID, you probably also want some way to look it up.
« Last Edit: May 11, 2023, 09:53:37 pm by Warfley »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12142
  • FPC developer.
Re: Proposal: Associative String Arrays
« Reply #5 on: May 11, 2023, 09:53:35 pm »
I was talking about the need for in-language specifically, not the use of maps in general.


Warfley

  • Hero Member
  • *****
  • Posts: 1872
Re: Proposal: Associative String Arrays
« Reply #6 on: May 11, 2023, 09:55:34 pm »
Data types that you use often should be language supported. There is technically no need for strings or arrays to be supported on a language level, yet they are because they are that often used that this level of ease is warranted. I argue that this list should be extended by associative array/dictionaries. Hell pascal has sets, and I use a dict much more often than a set

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12142
  • FPC developer.
Re: Proposal: Associative String Arrays
« Reply #7 on: May 11, 2023, 10:03:20 pm »
I think I use sets more. Great for statemachines and the like.

PascalDragon

  • Hero Member
  • *****
  • Posts: 5935
  • Compiler Developer
Re: Proposal: Associative String Arrays
« Reply #8 on: May 11, 2023, 10:05:46 pm »
So my idea was, that this could be added on the language level as simpe associative array:
Code: Pascal  [Select][+][-]
  1. var
  2.   map: Array[String] of Integer;
  3. begin
  4.   map['foo'] := 42;
  5.   if 'foo' in map then
  6.     WriteLn(map['foo']);
  7. end;

On a syntactic level this would not be that dissimilar from the existing associative arrays for enums:
Code: Pascal  [Select][+][-]
  1. type
  2.   TTest = (test1, test2);
  3. var
  4.   map: Array[TTest] of Integer;

Of course the underlying technology must be differently, as the above will just create an Array with test1..test2, but from a high level view it would not be that different, and be very useful.

Classes like Generics.Collections.TDictionary<,> and FGL.TFPGMap<,> already fullfill this purpose and work today and it's much easier to change the implementation than with a compiler-magic type. So, no, I'm against something like this. There are much more important things that need to be implemented that can't be easily done with existing functionality.

Warfley

  • Hero Member
  • *****
  • Posts: 1872
Re: Proposal: Associative String Arrays
« Reply #9 on: May 11, 2023, 10:15:31 pm »
I think I use sets more. Great for statemachines and the like.
I also like them, but, similarly to the dict, I would like to see them be able to be used with more types like strings or doubles or so, with the compiler internally deciding when to use a hash set for large types and a bit set for small types. Like sets are amazing for ASCII parsing automatas, but for UTF codepoints they are too small

Stefan Glienke

  • New Member
  • *
  • Posts: 24
Re: Proposal: Associative String Arrays
« Reply #10 on: May 12, 2023, 01:02:37 am »
afaik generic.collections has special cuckoo2 hash for string content
By default it uses crc32 or xxhash32 (see initialization of Generics.Hashes)

440bx

  • Hero Member
  • *****
  • Posts: 5137
Re: Proposal: Associative String Arrays
« Reply #11 on: May 12, 2023, 02:01:01 am »
Classes like Generics.Collections.TDictionary<,> and FGL.TFPGMap<,> already fullfill this purpose and work today and it's much easier to change the implementation than with a compiler-magic type. So, no, I'm against something like this. There are much more important things that need to be implemented that can't be easily done with existing functionality.
I have to concur.

When it comes to the implementation, critical considerations are, among others, the number of elements the array will host and the hashing algorithm used. 

In particular, a hashing algorithm that produces good results in one case can easily produce dismal ones in other cases and there is no way for a compiler to "divine" which algorithm it should use to ensure good performance.  The programmer is the one who should make the choice based on the nature of the data, not the compiler.

Interpreters can get away with this kind of inefficiencies because no one can expect an interpreter to rival the speeds obtained by a compiler (at least in most cases.)

I completely agree that hash tables, which is all an associative array is, should be implemented externally.
(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

Warfley

  • Hero Member
  • *****
  • Posts: 1872
Re: Proposal: Associative String Arrays
« Reply #12 on: May 12, 2023, 10:44:03 am »
In particular, a hashing algorithm that produces good results in one case can easily produce dismal ones in other cases and there is no way for a compiler to "divine" which algorithm it should use to ensure good performance.  The programmer is the one who should make the choice based on the nature of the data, not the compiler.

Interpreters can get away with this kind of inefficiencies because no one can expect an interpreter to rival the speeds obtained by a compiler (at least in most cases.)

I completely agree that hash tables, which is all an associative array is, should be implemented externally.

One does not exclude the other. Having a solution that is easy to use and works "good enough" most of the time is still an improvement over something that is more complex to use but configurable for every Situation.

Take for example generics.collection maps, they use a runtime comparer using runtime polymorphism to handle the data. That's kinda inefficient, to always do VMT lookups for a simple comparison. A compiletime comparer like the generic parameter of the ghashmaps THashmap does not have any indirection and can even be inclined, making it much more efficient. But using gHashMap is much more complex then TDictionary, I often use TDictionary, as long as it's fast enough. If I run into problems I can change the hasmap implementation afterwards.

Similarly looking at the already existing dynamic arrays and strings, their lazy copy mechanism and reference counting is sub optimal in many cases where you don't pass the array around. Also insertion can be quite slow, due to linear growth. So if you need high performance it can be better to use TList instead (I even had at one point wrote my own array implementation with raw pointers due to performance). This does not mean that arrays are useless as a an inbuilt language construct, just that in some cases you may need to manually choose a more complex alternative.

If we would follow such a philosophy consequently, we would end up with something like C where you always have to do everything yourself from scratch, and ever program begins with you writing your own array implementation (I've spent uncountable hours implementing arrays manually in C for different projects, it's always the same for every new program I write), because c does not provide anything higher than raw memory management and basic control flow
« Last Edit: May 12, 2023, 10:47:04 am by Warfley »

BeniBela

  • Hero Member
  • *****
  • Posts: 923
    • homepage
Re: Proposal: Associative String Arrays
« Reply #13 on: May 13, 2023, 01:04:29 am »
Take for example generics.collection maps, they use a runtime comparer using runtime polymorphism to handle the data. That's kinda inefficient, to always do VMT lookups for a simple comparison. A compiletime comparer like the generic parameter of the ghashmaps THashmap does not have any indirection and can even be inclined, making it much more efficient. But using gHashMap is much more complex then TDictionary, I often use TDictionary, as long as it's fast enough. If I run into problems I can change the hasmap implementation afterwards.

Similarly looking at the already existing dynamic arrays and strings, their lazy copy mechanism and reference counting is sub optimal in many cases where you don't pass the array around. Also insertion can be quite slow, due to linear growth. So if you need high performance it can be better to use TList instead (I even had at one point wrote my own array implementation with raw pointers due to performance). This does not mean that arrays are useless as a an inbuilt language construct, just that in some cases you may need to manually choose a more complex alternative.


I write my own arrays and hashmaps in Pascal, too.

I was just debugging my hashmap a few minute agos, after trying to change the stored types, and having forgotten how to change them.

The problem is that FPC does dynamic dispatch even for build in types like strings. Even if it knows the type at compile time, all memory management goes through functions that use RTTI to determine the type at runtime: https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/383

So I have a triple leveled hashmap. First a fully generic hash map. Then I specialize it with pointer, because pointer avoid that memory management.  Then I put a generic hash map over it, that cast everything to pointer and then puts it in the specialized map.

Like when I have a string, it is faster to keep the string in a pointer variable and call fpc_AnsiStr_Decr_Ref on it to free it, rather than relying on the automated freeing of fpc

jcmontherock

  • Sr. Member
  • ****
  • Posts: 290
Re: Proposal: Associative String Arrays
« Reply #14 on: May 13, 2023, 11:20:26 am »
In that case I personally use TStringList with names-values.
Windows 11 UTF8-64 - Lazarus 4.0RC2-64 - FPC 3.2.2

 

TinyPortal © 2005-2018