Thanks engkin - looks like an intersting place to be and I'll definitely need some help.
Welcome to Lazarus/FPC, MathMan. I hope you don't mind me asking a few quick questions, but I can't help it, I'm curious.
I am using a Win 7/64 environment on a laptop with a core 430M microprocessor.
What speed is your CPU?
What version of Lazarus/FPC are you using?
I implemented profiling via direct read of the processor time-stamp-counter for all Long arithmetic routines i have implemented so far.
What variable type did you use to hold the processor time-stamp-counter values and how did you get these values?
Then i dissected the time spend in Karatsuba and found that for the average case the schoolbook multiplications / long adds & subs that are called below the break-off make up 90+% of the total run-time of Karatsuba. But in the case of 100 or 99 digit numbers it falls down to <10%!!!
What was the total run-time in the case of 100 digit numbers using Karatsuba and schoolbook algorithms?
From your reply to Hinst:
For all input lengths the run-time behaves as I expect with the exception of input length 100 and 99.
This odd behavior is confirmed using the processor time-stamp-counter only, right?
How far did you go with the input length, 128?
OK, before I answer this (at least as good as i can from memory) let me state that I installed Lazarus 1.2.4 with FPC 2.6.4 and the bug has disappeared! Before I used the Lazarus Version that came with FPC 2.6.0 (1.2.0 iirc). What I am suspecting now - as it would fit to all my observations - is some kind of compiler bug with respect to Register handling as I am heavily mixing Pascal and Assembler.
So here are the answers to your questions
CPU - Intel i5 430M @2.27 GHz
Processor time stamp counter is hold in a QWORD. I retrieve it via inline Assembler like follows
asm
rdtsc; // Read time-stamp-counter from core
shl RDX, 32; // Convert to 64 bit
or RAX, RDX;
end ['RAX', 'RDX'];
Total time for Karatsuba roughly 66,000,000,000 - of that the schoolbook mul used roughly 4.000.000.000 / add and sub 250.000.000 each.
All was confirmed by the run-time Counter, correct. But as you can see the total run time in the 100 digit case is 25 seconds (instead of the expected 2.5 to 3 seconds) and that was definitely visible on the PC too.
Send me an e-mail - jens[.dot.]nurmann[.at.]web[.dot.]de - and I'll send you the sources if you like to take a look. But as the issue has vanished it may be of academic interest only. I'd love to have some feedback though if you see something I should do different (but pls. keep in mind that the program is in very early stages)!
Regards,
Jens