Recent

Author Topic: [SOLVED] Tricky profiling issue  (Read 7856 times)

MathMan

  • Sr. Member
  • ****
  • Posts: 325
[SOLVED] Tricky profiling issue
« on: July 27, 2014, 08:39:17 pm »
Hi,

I just started developing a arbitrary precision arithmetic package in Free Pascal. Currently I am using a Win 7/64 environment on a laptop with a core 430M microprocessor. Beside the schoolbook multiplication I have just implemented a Karatsuba multiplication algorithm that is currently providing some headache.

After I initially estimated the break-off Point in the recursion of the Karatsuba algorithm at 64 digits (each digit is a 64bit unsigned integer) I did some testing to drill this down exact. On my laptop it is around 48 digits (measured via a non recursive variant of Karatsuba). So i set the break-off to 50 in the recursive variant and started benchmarking schoolbook vs Karatsuba. With the exception of mults for numbers with 100 or 99 digits everything worked as expected. But in these two cases the run time for Karatsuba goes way-way up.

I have spend considerable time the last two days to get to the bottom of this. I implemented profiling via direct read of the processor time-stamp-counter for all Long arithmetic routines i have implemented so far. Then i dissected the time spend in Karatsuba and found that for the average case the schoolbook multiplications / long adds & subs that are called below the break-off make up 90+% of the total run-time of Karatsuba. But in the case of 100 or 99 digit numbers it falls down to <10%!!! I excluded the getmem and freemem calls as potential time wasters too - and with that there really is nothing left that could hog the function. I also checked the assembler output for break-off 64 and 50 just to make sure the Compiler is not doing something funny - all is looking good as far as I can see.

I am not new to programming but fairly new to Free Pascal (2 weeks) and it looks like I do not have gathered enough profiling / debugging experience so far to drill this down. Any help from more experienced people would be highly appreciated. Posting snippets here would not be of much use I think but I am happy to share all files and whatever you ask for if you contact me via e-mail.

Regards,
Jens
« Last Edit: July 29, 2014, 12:32:28 pm by MathMan »

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: Tricky profiling issue
« Reply #1 on: July 27, 2014, 09:17:43 pm »
If you work on a Linux distribution you can use the gprof tool.
The debugging page of the Project Options dialog offers one-click enabling of code generation for gprof.

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: Tricky profiling issue
« Reply #2 on: July 27, 2014, 09:25:56 pm »
Ok, yes - I saw that when I looked for profiling Support. Unfortunately, as said, I am working under Win 7/64 and so far have shied away from installing a complete Cygwin etc. to get to use gprof without understanding if this can really provide more info than what I currently can gather.

If nothing different comes up I'll have a look into this.

Regards,
Jens

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: Tricky profiling issue
« Reply #3 on: July 27, 2014, 10:34:49 pm »
Sorry, missed that you were on Windows.
You could try Eric Grange's profiler, see here:
http://www.delphitools.info/samplingprofiler/

There is also Darius Blaszyk's profiler, the sources of which are at
http://svn.freepascal.org/svn/fpcprojects/fpprofiler/trunk
« Last Edit: July 27, 2014, 10:43:01 pm by howardpc »

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: Tricky profiling issue
« Reply #4 on: July 28, 2014, 10:49:54 am »
Thanks for the links to the profiling software.

Yesterday night I installed Eric Granges sampling profiler. Unfortunately I seem to be missing something here - even though I compiled with debug information included the profiler tells me there is none in my application. In addition it seems to be unable to find the sources even though I activated the "Include applications directory" in the find sources section. In consequence the profiler refuses to work.

"Another round for reading manuals then" I thought but the included help-file will not show any content (with exeption of the chapter headlines) :-( Quick scan on the MS websites revealed that I am using the latest Version of HelpViewer for CHM files. Oh well - I think I am just a tad bit frustrated now.

Not your fault anyway but I am not getting ahead anywhere.

Regards,

Jens

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Tricky profiling issue
« Reply #5 on: July 28, 2014, 11:21:18 am »
"Another round for reading manuals then" I thought but the included help-file will not show any content (with exeption of the chapter headlines) :-( Quick scan on the MS websites revealed that I am using the latest Version of HelpViewer for CHM files. Oh well - I think I am just a tad bit frustrated now.
You could give it a go with the lhelp.exe CHM viewer which is included with Lazarus... or built if you press F1 in code for context-sensitive help...
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

hinst

  • Sr. Member
  • ****
  • Posts: 303
Re: Tricky profiling issue
« Reply #6 on: July 28, 2014, 12:41:43 pm »
your post looks confusing to me... I read it and what I see is "When I increatse input size my program runs significantly slower but I can't figure why"; so... what kind of help could you possibly expect? figure it out yourself, it's up to you srsly....

you can try my half-a$$ed profiling library though: http://forum.lazarus.freepascal.org/index.php/topic,25253.0.html not sure if it will get you what you want; don't forget to read the page which explains how to use it btw

PS: it works not like that sampling profiler, so with it you will most likely not run into issues you described
« Last Edit: July 28, 2014, 12:44:13 pm by hinst »
Too late to escape fate

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: Tricky profiling issue
« Reply #7 on: July 28, 2014, 02:25:32 pm »
Hi hinst,

your post looks confusing to me... I read it and what I see is "When I increatse input size my program runs significantly slower but I can't figure why"; so... what kind of help could you possibly expect? figure it out yourself, it's up to you srsly....

...

Sorry if I was unclear in my Initial post. The issue is as follows.

For all input lengths the run-time behaves as I expect with the exception of input length 100 and 99. In these cases (when I set the break-off value to 50 or below in the recursive Karatsuba algorithm) the average run time is 13 times lower than expected (and the minimum  run-time still 2.5 times). In all other cases the algorithm behaves completely sane and for other break-off values there seem to be no corresponding artifacts (for example I expected a similar artifact for input length 128 and 127 with a break-off value of 64 - but it isn't there). Hope this explains it better.

Regards,
Jens

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9857
  • Debugger - SynEdit - and more
    • wiki
Re: Tricky profiling issue
« Reply #8 on: July 28, 2014, 03:10:13 pm »
I would  encourage you to install a virtual machine with linux, and use valgrind ( module: cachegrind). Then there is a viewer: kcachegrind.

That is the best you can get in terms of analysing where your time goes.

I am not sure, this may even get you info about cpu cache loading time / cache misses.


Depending on the amount of memory used by your app, also consider:
- disk swapping.
  even if bigger sets run fine. In bigger sets, data might be distributed differently.
  (check the process list in windows)
- cpu cache hits/misses
  same issue. Only much harder to notice/measures

If your code accesses data, that is widely spread to many locations in memory,  you may have a lot of cache misses. That can impact performance.


engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Tricky profiling issue
« Reply #9 on: July 28, 2014, 06:59:59 pm »
Welcome to Lazarus/FPC, MathMan. I hope you don't mind me asking a few quick questions, but I can't help it, I'm curious.

I am using a Win 7/64 environment on a laptop with a core 430M microprocessor.
What speed is your CPU?
What version of Lazarus/FPC are you using?

I implemented profiling via direct read of the processor time-stamp-counter for all Long arithmetic routines i have implemented so far.
What variable type did you use to hold the processor time-stamp-counter values and how did you get these values?

Then i dissected the time spend in Karatsuba and found that for the average case the schoolbook multiplications / long adds & subs that are called below the break-off make up 90+% of the total run-time of Karatsuba. But in the case of 100 or 99 digit numbers it falls down to <10%!!!
What was the total run-time in the case of 100 digit numbers using Karatsuba and schoolbook algorithms?

From your reply to Hinst:
For all input lengths the run-time behaves as I expect with the exception of input length 100 and 99.
This odd behavior is confirmed using the processor time-stamp-counter only, right?
How far did you go with the input length, 128?

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: Tricky profiling issue
« Reply #10 on: July 28, 2014, 11:58:11 pm »
Hi Martin,

I would  encourage you to install a virtual machine with linux, and use valgrind ( module: cachegrind). Then there is a viewer: kcachegrind.

That is the best you can get in terms of analysing where your time goes.

I am not sure, this may even get you info about cpu cache loading time / cache misses.

I expected (and somewhat feared) that this would be brought up - and to be honest I do see the point that developing under Linux might be the wiser choice. For me personally the option would either be to install a dual boot or setup another computer for Linux solely as I do not like messing with virtual machines. However I'll give either option serious consideration in the upcoming weeks.

Depending on the amount of memory used by your app, also consider:
- disk swapping.
  even if bigger sets run fine. In bigger sets, data might be distributed differently.
  (check the process list in windows)
- cpu cache hits/misses
  same issue. Only much harder to notice/measures

If your code accesses data, that is widely spread to many locations in memory,  you may have a lot of cache misses. That can impact performance.

There are some new developments - see my response to engkin ...

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: Tricky profiling issue
« Reply #11 on: July 29, 2014, 12:26:44 am »
Thanks engkin - looks like an intersting place to be and I'll definitely need some help.

Welcome to Lazarus/FPC, MathMan. I hope you don't mind me asking a few quick questions, but I can't help it, I'm curious.

I am using a Win 7/64 environment on a laptop with a core 430M microprocessor.
What speed is your CPU?
What version of Lazarus/FPC are you using?

I implemented profiling via direct read of the processor time-stamp-counter for all Long arithmetic routines i have implemented so far.
What variable type did you use to hold the processor time-stamp-counter values and how did you get these values?

Then i dissected the time spend in Karatsuba and found that for the average case the schoolbook multiplications / long adds & subs that are called below the break-off make up 90+% of the total run-time of Karatsuba. But in the case of 100 or 99 digit numbers it falls down to <10%!!!
What was the total run-time in the case of 100 digit numbers using Karatsuba and schoolbook algorithms?

From your reply to Hinst:
For all input lengths the run-time behaves as I expect with the exception of input length 100 and 99.
This odd behavior is confirmed using the processor time-stamp-counter only, right?
How far did you go with the input length, 128?

OK, before I answer this (at least as good as i can from memory) let me state that I installed Lazarus 1.2.4 with FPC 2.6.4 and the bug has disappeared! Before I used the Lazarus Version that came with FPC 2.6.0 (1.2.0 iirc). What I am suspecting now - as it would fit to all my observations - is some kind of compiler bug with respect to Register handling as I am heavily mixing Pascal and Assembler.

So here are the answers to your questions

CPU - Intel i5 430M @2.27 GHz

Processor time stamp counter is hold in a QWORD. I retrieve it via inline Assembler like follows

  asm
    rdtsc; // Read time-stamp-counter from core
    shl RDX, 32; // Convert to 64 bit
    or RAX, RDX;
  end ['RAX', 'RDX'];

Total time for Karatsuba roughly 66,000,000,000 - of that the schoolbook mul used roughly 4.000.000.000 / add and sub 250.000.000 each.

All was confirmed by the run-time Counter, correct. But as you can see the total run time in the 100 digit case is 25 seconds (instead of the expected 2.5 to 3 seconds) and that was definitely visible on the PC too.

Send me an e-mail - jens[.dot.]nurmann[.at.]web[.dot.]de - and I'll send you the sources if you like to take a look. But as the issue has vanished it may be of academic interest only. I'd love to have some feedback though if you see something I should do different (but pls. keep in mind that the program is in very early stages)!

Regards,
Jens


engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: [SOLVED] Tricky profiling issue
« Reply #12 on: July 30, 2014, 05:53:24 am »
Thanks for the answers, Jens.

Quote
I installed Lazarus 1.2.4 with FPC 2.6.4 and the bug has disappeared!
It seems that you managed to get rid of the problem. Well done!

Quote
I'd love to have some feedback though if you see something I should do different
Comparing myself with others, I have bad programming habits and my basic assembly coding is limited to i386 (32bit instructions). Unless you have some reason, and it seems you do, I would suggest that you share your code here. Everyone who answered you here is better than me, and there are others as well.

 

TinyPortal © 2005-2018