checking if a dividend / divisor pair entering a division is the same as last time division was activated
The reason for that is to enable existing code to work without changes... in my experience code like this is quite common:
d:= (x div y);
m:= (x mod y);
And I wanted that code run as fast as possible and not have to be rewritten by the user. The overhead will be small.
I can understand that, but some mantras simply don't translate well from short to long integers. In the special case above it is not even a good idea for 'standard' integers - imho it is actually a layer 8 problem. First, there is a DivMod function in FPC (albeit only for UInt32) which delivers quotient and remainder. Even if that was not available the following is faster (and people doing arbitrary precision stuff are aware of this)
d:= (x div y);
m:= x - y*d;
When you calculate the quotient for the 'Div' operator you automatically calculate the remainder too - you're just not exposing it to the user. If you look at how I addressed this, then you'll find that I also made a DivMod function available to the user. The 'Div' and 'Mod' operators then use it and flush one of the two down the drain. Looking for repeated patterns in core routines is just making the average case slow and should be avoided.
I ran some tests comparing your code to mine. Add, subtract & multiply all take approximately the same time (my code is about 25% slower but I'm happy with that).
Not sure what you measured, but on my laptop I get the following info from speedtest5 for v4.8.5
time elapsed is 0,34 seconds for 100 iterations of divide with 16384 bit integers
time elapsed is 0,37 seconds for 100000 iterations of multiply with 16384 bit integers
time elapsed is 1,14 seconds for 100000 iterations of add with 16384 bit integers
time elapsed is 1,09 seconds for 100000 iterations of subtract with 16384 bit integers
The 1,14 seconds for 100000 iterations translate to ~26.000 clock cycles for a 16384 bit integer add. Doing the same sized addition with my bench for ArbInt I get ~1.700 clock cycles. Figures for sub are comparable - mul is worse.
I did some tests of dynamic arrays vs new/free of pointer to array. It looks to me like dynamic arrays are faster, and they get more faster as the size of the array increases (if that makes sense). The main reason for this is that new of pointer to array requires the memory to be initialised, whereas a dynamic array has its memory zero'd automagically. (There might be a trick to fix this that I'm not aware of). So I don't think my dynamic arrays are a problem.
If you look at my implementation you'll nowhere find any initialisation (zeroing) of the memory I allocate and free via pointers. Not neccessary, as I always know how many limbs in the associated memory for the data is valid. There are the 'Limit' and 'Used' figures - 'Limit' states how many limbs can actually be used via the memory pointer and 'Used' states how many of these actually contribute to the number being represented by the ArbInt. So every time you re-size your dynamic array the implied initialisation of the array contributes negative to the speed.
As you point out, my division routine is my biggest bottleneck. Can you explain in a nutshell why your division runs so much quicker, please?
I looked at your code, and the only thing I could see is a reference to doing each calculation loop on 2 or 3 limbs/words instead just one (as I am doing). Is that the essence of what you are doing?
I'll try
- I always work with full limbs vs your routine working on half limbs <= more than twice the time because you also have twice as many memory reads & writes
- I divide by multiplication with an inverse (per limb step) vs your routine using 'Div' (per half limb step) which is much slower
- I have way less correction steps 1 in 2^63 vs 1 in 2^16 in your code <= every correction induces another full addition of divisor to (running) remainder
- I update the top three limbs of the (running) remainder separate which shortens the update of the rest via costly SubMulMag1 by two limb <= this is especially efficient for very short divisions. However, for an m/n division the main loop in DivModMagBase runs for m-n iterations.
I am considering attempting to steal your code and translate it line-by-line to work with my data structures. But am a bit uneasy about this because of the difficulties I had translating the standard Knuth algorithm from C. Also, would you object if I did this?
Thanks.
Feel free.