@praetor, @MarkMLI
So here we go ;-) Attached is what I can donate
- an implementation of unsigned div and mod for 128 by 64 and 64 by 64 bit
- functional interface with a layout that allows simple enhancement with assembler
- one demo implementation for CoreI 2xxx (Nehalem) is included
- a unit test
- a small benchmark utility, that can measure on cycle count basis under x86-64
What you don't get (sorry)
- full signed integer support <= but that should be rather trivial to implement
- an implementation optimised for 32 bit systems <= but I can help generating that too
- tested & benched under Solaris (SPARC) <= I simply don't have access, so you may need to fiddle with the unit test & bench utility
On my Nehalem system the DivMod 128 by 64 takes ~90 clock cycles (pure pascal), while the asm div instruction takes ~100! Of course this will vary with core used.
Any questions/feedback, pls let me know.
Regards,
MathMan