That's a nice idea Phil - thanks.
Admittedly I didn't think about writing a 1000x1000 matrix multiply function, although what you describe sounds like calculating a determinant, whose speed upon a naïve implementation is O(n³), with the theoretical limit at O(n²)), while doing a dot product between two 1000-component vectors is a fairly trivial operation that I don't think you can do any faster than O(n), although the FMA operations will help a lot here. Granted, manipulating such matrices can be useful in some linear programming problems. It's something I'll keep on the list though!
A while ago I wrote a bunch of vector routines using everything from Pascal to raw 386 opcodes (with x87 floating-point operations), then SSE, AVX and FMA, selecting the best one that the hardware supports, while also doing a lot of benchmarking during development to ensure it actually IS faster than a previous instruction set! Unfortunately, the only usable computer I have currently is an old laptop that only goes up to AVX... i.e. no FMA, AMX2 or BMI2.
Of course, the drawback to having versions for all the different instruction sets that is selected at run-time is a larger binary size and some overhead on initialisation, but a lot of it can be cut out of x86-64, for example, because 64-bit Windows will not install if the processor doesn't at least support SSE2.
Anyway, enough rambling!
So far I have a number of vector and matrix routines to port over that are designed for game engines, so hopefully those will be welcome, and routines for modular arithmetic that is mostly for my own thing, but might see some uses elsewhere, who knows!
To hrayon: what I'm currently building is less about big integer operations and more about parallel computation on lots of small integers simultaneously, so the two complement rather than compete with each other. Still, I'll be taking a look!
Regarding big integers, I mostly use gmp. Unfortunately, it's written in C (and assembler) so I have to communicate with it through a DLL. Having this ported to Pascal would be wonderful though because I can see some speed increases due to less overhead. Saying that, I haven't managed to get gmp to compile on 64-bit Windows yet, only 32-bit.