This is the first time I publicly present included optional CPU branch removal.
1/4 Foreword for pascal programmers that don't know what CPU branches are about
Many processors are able to compute more than one line of code at a time. This means that two(or more) lines of pascal code can be computed in parallel(almost at the same time), improving code execution speed. However, once we hit an "if", "case" or loop("repeat", "for"...) the situation becomes complex. The CPU is not 100% sure what's best to be done, it may stall(wait) or try to guess. Apparently it is better to try guessing than waiting
. When the processor fails to predict if the code found at "then" or "else"(branches) block is the appropriate one, usually it fails with a sensible decrease of code execution speed(localy). Depending on the combination of CPU model and fpc version(as there is hope), the decrease can be small or huge.
Anyway, at the moment, branch misses are bottlenecks for new CPU models, no matter the language or compiler used. Older CPU models might not have appropriate optimizations within fpc, so they might hit the speed penalty too. All of these CPU share this bottleneck: the presence of conditional jumps("if", "case"...statements).
2/4 Drawbacks of branch removal
Previous uploaded modified code(files) already had some branch removals.
I'M NOT SURE THE FOLLOWING EXAMPLE IS THE BEST ONE!!!
For example:
"If (sourcewidth=0)and(sourceheight=0)and(destinationwidth=0)and(destinationheight=0) then copyimageoptimized;".
Can be changed to something like:
"If (sourcewidth or sourceheight)=0)and(destinationwidth or destinationheight=0) then copyimageoptimized;"//TWO BRANCHES
or
"If (sourcewidth or sourceheight or destinationwidth or destinationheight)=0 then copyimageoptimized;"//ONE BRANCH
or
"{$B+} If (sourcewidth=0)and(sourceheight=0)and(destinationwidth=0)and(destinationheight=0) then copyimageoptimized; {$B-}"//ONE BRANCH
The problem with the above alternatives is that those "or" instructions are not for free. Those are operations that take time.
This means that the situation is like the following text(JUST AS EXAMPLE):
Original code(with branches):
"if {condition} then {dothen} else {doelse};"
will take 10 millisecond if the CPU guesses the condition;
will take 20 milliseconds if the CPU misses the condition;
Modified code(without branches):
will take 13 milliseconds all the time(there is no "if" statement).
Between original and modified codes, if you consider the branch-less code(with time consumed for additional operations like the "or" in the above example) will statistically perform better, you may activate the option. Regarding code execution speed there is no universal safe solution...and there won't be.
3/4 Who might benefit from branch removal.
New CPU models might benefit because they are fast for many things except for branch misses recovery.
Some old CPU models that predict poorly and fpc might not have proper optimizations for them. For example, at the moment, I expect AMD K8(probably K10 too) series to run Lazarus/Fpc built binaries slow. For these CPUs, apparently, regarding branches("then" or "else" code blocks), even if the CPU predicts corectly the branch, some code might need to be aligned(something that fpc might not do properly at the moment
). This means that the simple existence of an "if", "case" or any loop within the pascal code can become a bottleneck, no matter branch prediction influence. Also, apparently a series of more than three consecutive conditional jumps is highly susceptible to mess the branch predictor.
Regarding software, I've never used Code Typhoon but probably this "branch"(distribution) has the potential to benefit more than "vanilla" Lazarus because of it's easier cross-compiling target.
4/4 How to enable branch removal
Use "-dBRANCHREMOVAL"(without quotes).
TEXT PRESENTED ABOVE IS JUST AN INTRODUCTION! IF YOU'RE LOOKING FOR MORE INFORMATIONS(OR MORE ACCURATE) REGARDING CPU BRANCHES TRY NOT TO USE THIS FORUM THREAD.