Recent

Author Topic: How can one optimise inserting an item into an array, anywhere but the end?  (Read 3589 times)

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
I was afraid someone would say that, and quite frankly, I think I know it in my bones, but hey, if you don't ask you keep ignorant :)

Well, there's B-Trees, B+-Trees and Bonsai Trees... and there was something about those that made IBM buy the company that invented them.

There's definitely hybrid approaches you could use. For example, if you knew that your overall array would never have more than 64K elephants you could implement it as 256 dynamic arrays each of up to 256 elements, each of which had an index, and an otter class wrapper to handle indexed access.

The absolutely crucial thing is to look at how often there is a mid-array insertion, and how far it is from the end, relative to the number of reads.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1114
  • Professional amateur ;-P
Hey Mark,

Well, there's B-Trees, B+-Trees and Bonsai Trees... and there was something about those that made IBM buy the company that invented them.

Eheheh!! So now, it begs the question: Does FPC ship with an implementation of a binary tree kinda container?

There's definitely hybrid approaches you could use. For example, if you knew that your overall array would never have more than 64K elephants you could implement it as 256 dynamic arrays each of up to 256 elements, each of which had an index, and an otter class wrapper to handle indexed access.

You know what? Marco also mentioned arrays of arrays and it only clicked with this explanation of yours!! Thanks Mark, this helped a bit.

The absolutely crucial thing is to look at how often there is a mid-array insertion, and how far it is from the end, relative to the number of reads.

To be quite honest, a lot of the usage will see the array containing only around 1 to ~50 items. Yeah, again stuff from the coder I mentioned in the "non-blocking" post :(
But I like to have it future proof, where there could be items in the hundreds, thousands even...

I like my containers future proof and optimised. And maybe I'm suffering from early optimization sickness and I should just use a bloody array for the time being, and then when the load gets a bit too much for the array, then bite the bullet and just do the bloody binary tree !!

I dunno!!!!

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
To be quite honest, a lot of the usage will see the array containing only around 1 to ~50 items. Yeah, again stuff from the coder I mentioned in the "non-blocking" post :(
But I like to have it future proof, where there could be items in the hundreds, thousands even...

For something that small, have a maximal-sized array of integers as an index and an unsorted dynamic array. But be aware that extending a dynamic array can be slow, so do it in chunks.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1114
  • Professional amateur ;-P
Hey Mark,

For something that small, have a maximal-sized array of integers as an index and an unsorted dynamic array. But be aware that extending a dynamic array can be slow, so do it in chunks.

During the past few days I've been imagining ways of pulling it off and one of them was exactly that. Thanks for validating some of my brain wanderings :)

And you've now also gave me the answer to why there is a variable called capacity attached to dynamic arrays. We have to SetLengh() in chunks because it's costly!!

Again, many thanks for your share of knowledge !!

Cheers,
Gus
« Last Edit: August 13, 2022, 06:49:52 am by Gustavo 'Gus' Carreno »
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

440bx

  • Hero Member
  • *****
  • Posts: 3946
And you've now also gave me the answer to why there is a variable called capacity attached to dynamic arrays. We have to SetLengh() in chunks because it's costly!!
Is this something that will be Windows only or does it have to be multiplatform ?
(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

AlexTP

  • Hero Member
  • *****
  • Posts: 2387
    • UVviewsoft
Gustavo,
TList (and generics like TFPGList) are based on array internally. And yes, inserting to the middle - makes slow array moving - like Move( ) that you seen in this thread. So it is not optimal. What is optimal from the point of view of insertion - is linked list.

Zvoni

  • Hero Member
  • *****
  • Posts: 2319
Staying with the original question (Inserting into an Array):
In Visual Basic 6 (Blast the Dot CRAP) i did it as follows

(Don't start with Array-Index beginning at 0, this is to visualize it)
You have a Source-Array with Length 100 (1..100)
Now you want to insert at Position 42
Create a Destination Array with Length 101 (1..101)
Copy the Memory starting at Pos 1 in Source-Array with Length 41*MemberSize (Member can be 2 Bytes, 4 Bytes, 8 Bytes, A "hidden" Pointer--> Strings, Records etc.) to DestArray[1]
Assign Array-Member[42] the new value
Copy the Memory starting at Pos 42 in Source-Array to Dest-Array[43] with Length 59*MemberSize
Discard/Destroy/Erase Source-Array

No idea about performance and what not.
In VBA it worked without problems, and i never noticed any performance glitches, though maybe i just didn't have enough data in my arrays
*shrug*
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
Copy the Memory starting at Pos 1 in Source-Array with Length 41*MemberSize (Member can be 2 Bytes, 4 Bytes, 8 Bytes, A "hidden" Pointer--> Strings, Records etc.) to DestArray[1]
Assign Array-Member[42] the new value
Copy the Memory starting at Pos 42 in Source-Array to Dest-Array[43] with Length 59*MemberSize
Discard/Destroy/Erase Source-Array

If access assumed that the array was actually split into two fragments A and B (Hell, I wasn't the one who brought BASIC into it :-) with A containing 100 elements and B empty, then inserting element 42 would involve moving the top fragment into B, trimming A and then appending to A. An insertion of an element >42 would now involve appending the unaffected elements from B to A, and so on.

However the management and performance burden would be substantially worse than using e.g. index+store, or "doing it properly" with a linked list or tree.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Zvoni

  • Hero Member
  • *****
  • Posts: 2319

However the management and performance burden would be substantially worse than using e.g. index+store, or "doing it properly" with a linked list or tree.

MarkMLl
Mark, don't get me wrong: I actually agree with you, i just addressed the original question, and my answer was, how i did/do it in VBA
My answer describes more the "algorithm"/technique i use(d)
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

PascalDragon

  • Hero Member
  • *****
  • Posts: 5446
  • Compiler Developer
And for small amounts contained in the array, it will make sense.
But at what point is this approach gonna be inefficient?

So my original idea to optimise it is to have a low level call that would shift a whole block of mem, one byte to the right.

But then more questions appear:
  • Is this possible?
  • Is it the best solution?
  • When does the array pass the boundary of a page, and there need to be 2 mem copy operations?

For small arrays you can just as well use Insert and call it a day as it tries to be as performant as possible for what it can work with. Not to mention that it also deals with the nitty, gritty details of managed types.

For larger arrays you will likely need additional or different data structures (e.g. the mentioned binary trees).

Also as long as you aren't developing a database system that needs to be sure where each page is (in memory or on disk) then you don't need to deal with pages aside from "is it valid?".

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1114
  • Professional amateur ;-P
Hey 440bx,

Is this something that will be Windows only or does it have to be multiplatform ?

The intent, at least with me, is to always aim at cross-platform.

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1114
  • Professional amateur ;-P
Hey Alex,

TList (and generics like TFPGList) are based on array internally. And yes, inserting to the middle - makes slow array moving - like Move( ) that you seen in this thread. So it is not optimal. What is optimal from the point of view of insertion - is linked list.

Ok, many thanks ALex, for confirming that a lot of the list classes are implemented over an array :)

And I do agree that for insertion, the linked list is the best, since it's O(1). The problem is locating the place to insert, and that can be O(N), which is the worst :(

Still, you gave me an assurance that I wasn't saying bullshit about the lists being based in arrays and I cannot thank you enough for that!!!

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1114
  • Professional amateur ;-P
Hey PascalDragon,

For small arrays you can just as well use Insert and call it a day as it tries to be as performant as possible for what it can work with. Not to mention that it also deals with the nitty, gritty details of managed types.

Okidokes, many thanks for more complete tidbits surrounding arrays and the Insert in particular !!!

For larger arrays you will likely need additional or different data structures (e.g. the mentioned binary trees).

Hummm, confirms my suspitions, thanks!
Does Free Pascal come with a balanced binary tree implementation somewhere in it's bowels? <--- EDIT

Also as long as you aren't developing a database system that needs to be sure where each page is (in memory or on disk) then you don't need to deal with pages aside from "is it valid?".

Thanks, again, since this is rather good to know!!

As usual, my dear PascalDragon, a trove of information that is useful and presented in a clear and succinct way. Kudos mate!!

Cheers,
Gus
« Last Edit: August 13, 2022, 07:20:08 am by Gustavo 'Gus' Carreno »
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

jollytall

  • Sr. Member
  • ****
  • Posts: 306
A bit off, but still worth considering.
Where does your data to insert come from, and when do you need to use it?
Typically, either you have the data upfront, before actually using it (e.g. on disk) or the data slowly comes while also being processed (e.g. user input, sensors, network).
In the first case, maybe first load all data without inserting to the right place and sort the array once before use.
In the second case, unless talking about very large data sets, the input is slow enough to do whatever insert method you want.

BrunoK

  • Sr. Member
  • ****
  • Posts: 452
  • Retired programmer
In FPC 3.2.2, for example
\packages\fcl-base\src\avl_tree.pp
Explanations :
https://en.wikipedia.org/wiki/AVL_tree

 

TinyPortal © 2005-2018