Recent

Author Topic: Graphic changes  (Read 19708 times)

lagprogramming

  • Sr. Member
  • ****
  • Posts: 406
Re: Graphic changes
« Reply #30 on: November 17, 2014, 11:03:15 am »
@taaz
Yes but,
function TGtk2WidgetSet.LineTo(DC: HDC; X, Y: Integer): Boolean;
http://msdn.microsoft.com/en-us/library/windows/desktop/dd145029%28v=vs.85%29.aspx
or
function TGtk2WidgetSet.MoveToEx(DC: HDC; X, Y: Integer; OldPoint: PPoint): Boolean;
http://msdn.microsoft.com/en-us/library/windows/desktop/dd145069%28v=vs.85%29.aspx
Appears to be the same situation.
We have "bool" changed to "boolean". If it has been done for LineTo or MovetoEx, why don't we do the same with LPtoDP and DptoLP? That size change difference might slow down things that are already slow. Fortunately the function result isn't used much within lcl, so we can't talk about a drastic speed decrease to to this reason.
« Last Edit: November 17, 2014, 11:17:46 am by lagprogramming »

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Graphic changes
« Reply #31 on: November 17, 2014, 11:33:27 am »
@taaz
Yes but,
function TGtk2WidgetSet.LineTo(DC: HDC; X, Y: Integer): Boolean;
http://msdn.microsoft.com/en-us/library/windows/desktop/dd145029%28v=vs.85%29.aspx
or
function TGtk2WidgetSet.MoveToEx(DC: HDC; X, Y: Integer; OldPoint: PPoint): Boolean;
http://msdn.microsoft.com/en-us/library/windows/desktop/dd145069%28v=vs.85%29.aspx
Appears to be the same situation.
We have "bool" changed to "boolean". If it has been done for LineTo or MovetoEx, why don't we do the same with LPtoDP and DptoLP? That size change difference might slow down things that are already slow. Fortunately the function result isn't used much within lcl, so we can't talk about a drastic speed decrease to to this reason.

Get over the size change, it is not your bottleneck even by defining the result as boolean the size change will happen since the prototype is already in bool it has to be converted to boolean to be assigned as the result of the function any way. In short the change will always happen if you assign it to a boolean variable if you use it directly you might overcome the change problem but I haven't test that either. I doubt that you will get any meaningful speed enhancement though.

Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

lagprogramming

  • Sr. Member
  • ****
  • Posts: 406
Re: Graphic changes
« Reply #32 on: November 18, 2014, 03:21:49 pm »
   1) For gtk2, I've manually inlined the "LPtoDP" and "DPtoLP" functions where they were called(LineTo, Rectangle are the most important ones I think). By doing so I've removed some useless CPU instructions, especially when the last parameter of the functions was "1".
   2) Now, there is an additional thing I'd like to test: the influence of the "DebugLn" statements. I've seen many of them are not conditionally compiled("{$ifdef.....}"). Even when "DebugLn" is not used, there may be a "jmp" instruction to pass over the code. Also, by building a file without "DebugLn" I expect a smaller memory footprint, which might have a surprising effect on CPU cache, thing that may improve code execution speed. Also, I forgot now where, but I've seen local variables declared only to be used together with "DebugLn", variables that are not conditionally compiled either. If the presence of these "DebugLn" drag down code execution speed, I'd like to know if "DEBUG" define symbol should be used in order to conditionaly compile them. I'd like to use a single symbol because "DebugLn" is found in many files and I don't want to change the code all over again a second time. If this change will affect binary file stripping, do you know how it will do it?

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Graphic changes
« Reply #33 on: November 18, 2014, 08:46:47 pm »
   1) For gtk2, I've manually inlined the "LPtoDP" and "DPtoLP" functions where they were called(LineTo, Rectangle are the most important ones I think). By doing so I've removed some useless CPU instructions, especially when the last parameter of the functions was "1".

Let me know what is the speed gain on those places is it worth the rtouble or its something along the lines of 0.0001 seconds every 10 million executions.

   2) Now, there is an additional thing I'd like to test: the influence of the "DebugLn" statements. I've seen many of them are not conditionally compiled("{$ifdef.....}"). Even when "DebugLn" is not used, there may be a "jmp" instruction to pass over the code. Also, by building a file without "DebugLn" I expect a smaller memory footprint, which might have a surprising effect on CPU cache, thing that may improve code execution speed. Also, I forgot now where, but I've seen local variables declared only to be used together with "DebugLn", variables that are not conditionally compiled either. If the presence of these "DebugLn" drag down code execution speed, I'd like to know if "DEBUG" define symbol should be used in order to conditionaly compile them. I'd like to use a single symbol because "DebugLn" is found in many files and I don't want to change the code all over again a second time. If this change will affect binary file stripping, do you know how it will do it?

There was an enhancement a few years back that allowed the debugln to be used as needed and replaced with empty functions when it was not needed. I was under the impression that the compiler would remove all empty functions from the final compiled code and eliminate the calls to it if not its something that should be considered for inclusion in the future. Or it was eliminating only the empty inlined function this way? Sorry can't remember exactly.

In general I'm against adding debugging functionality this way but the lack of a good debugger in lazarus it makes this a must have, In my opinion debug code should be deleted from any released code before shared to the public, as for the debug define no it should not use it it should use a debug specific for the feature debugged (dbg_KeyboardEvents) or in case that is generic then a define that will be used only by the project in question eg LAZ_DEBUG to avoid enabling it by mistake.

Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: Graphic changes
« Reply #34 on: November 18, 2014, 11:48:52 pm »
There was an enhancement a few years back that allowed the debugln to be used as needed and replaced with empty functions when it was not needed. I was under the impression that the compiler would remove all empty functions from the final compiled code and eliminate the calls to it if not its something that should be considered for inclusion in the future. Or it was eliminating only the empty inlined function this way? Sorry can't remember exactly.

Yes, if you replace "uses lazlogger;" with "uses lazloggerdummy;" all DebugLn() calls are replaced by empty procedures (I presume it is done  something along the lines of the way Assert() can be removed).
Under what optimization specifics or other conditions (if indeed there are any) the compiler might then eliminate empty DebugLn routines  I do not know.

lagprogramming

  • Sr. Member
  • ****
  • Posts: 406
Re: Graphic changes
« Reply #35 on: November 19, 2014, 12:06:29 am »
   Using recently updated fpc trunk with level 3 optimizations:

1) Pascal written procedure follows:
Code: [Select]
procedure z;
var localvariable:integer;
begin
for localvariable:=0 to 100 do;
end;

assembler code produced follows:

Code: [Select]
.section .text.n_unit1_$$_z
.balign 16,0x90
.globl UNIT1_$$_Z
.type UNIT1_$$_Z,@function
UNIT1_$$_Z:
.Lc1:
.Ll1:
# [unit1.pas]
# [34] begin
nop
# Var localvariable located in register eax
# Var localvariable located in register eax
.Ll2:
# [35] for localvariable:=0 to 100 do;
movl $0,%eax
subl $1,%eax
.balign 8,0x90
.Lj7:
addl $1,%eax
cmpl $100,%eax
jl .Lj7
.Ll3:
# [36] end;
nop
ret
.Lc2:
.Lt1:
.Le0:
.size UNIT1_$$_Z, .Le0 - UNIT1_$$_Z

   Empty loops are not removed, which might lead to the possibility, in certain circumstances, that CPU registers allocation is unpleasantly affected.

2) What follows was a surprise to me.
The idea of this part was to show that even if we have an empty procedure, as long as we pass function results as parameters to this empty procedure, we still have assembler code produced. Example, "DebugLn(floattostr(sin(x)))" would always compute "floattostr(sin(x))" because apparently, the compiler doesn't know if a function result is used or not. In my point of view even if the function result is not used the function should be executed(except for "if...short circuit...then" situations).
I thought that empty procedures/routines don't produce code and are not called. Looking at assembler code appears I was wrong, at least when functions are passed as parameters.
Pascal code follows:
Code: [Select]
function simplefunction:string;
begin
result:='SimpleFunctionResult';
end;

procedure emptyprocedure(parameter:string);
begin
//
end;

procedure procedurecall;
begin
emptyprocedure(simplefunction);
end;

Assembler code follows:

Code: [Select]
.section .text.n_unit1_$$_simplefunction$$ansistring
.balign 16,0x90
.globl UNIT1_$$_SIMPLEFUNCTION$$ANSISTRING
.type UNIT1_$$_SIMPLEFUNCTION$$ANSISTRING,@function
UNIT1_$$_SIMPLEFUNCTION$$ANSISTRING:
.Lc3:
.Ll5:
# [40] begin
nop
leaq -8(%rsp),%rsp
.Lc5:
# Var $result located in register rax
# PeepHole Optimization,MovMov2Mov1
movq %rdi,%rax
.Ll6:
# [41] result:='SimpleFunctionResult';
movq $_$UNIT1$_Ld1,%rsi
call fpc_ansistr_assign
.Ll7:
# [42] end;
leaq 8(%rsp),%rsp
nop
ret
.Lc4:
.Lt2:
.Le1:
.size UNIT1_$$_SIMPLEFUNCTION$$ANSISTRING, .Le1 - UNIT1_$$_SIMPLEFUNCTION$$ANSISTRING
.Ll8:

.section .text.n_unit1_$$_emptyprocedure$ansistring
.balign 16,0x90
.globl UNIT1_$$_EMPTYPROCEDURE$ANSISTRING
.type UNIT1_$$_EMPTYPROCEDURE$ANSISTRING,@function
UNIT1_$$_EMPTYPROCEDURE$ANSISTRING:
.Lc6:
.Ll9:
# [45] begin
nop
leaq -8(%rsp),%rsp
.Lc8:
# Var parameter located at rsp+0, size=OS_64
# PeepHole Optimization,MovMov2Mov1
movq %rdi,(%rsp)
call fpc_ansistr_incr_ref
.Ll10:
# [47] end;
movq %rsp,%rdi
call fpc_ansistr_decr_ref
leaq 8(%rsp),%rsp
nop
ret
.Lc7:
.Lt3:
.Le2:
.size UNIT1_$$_EMPTYPROCEDURE$ANSISTRING, .Le2 - UNIT1_$$_EMPTYPROCEDURE$ANSISTRING
.Ll11:

.section .text.n_unit1_$$_procedurecall
.balign 16,0x90
.globl UNIT1_$$_PROCEDURECALL
.type UNIT1_$$_PROCEDURECALL,@function
UNIT1_$$_PROCEDURECALL:
.Lc9:
# Temps allocated between rsp+0 and rsp+104
.Ll12:
# [50] begin
nop
leaq -104(%rsp),%rsp
.Lc11:
.Ll13:
movq $0,96(%rsp)
movq %rsp,%rdx
leaq 24(%rsp),%rsi
movl $1,%edi
call FPC_PUSHEXCEPTADDR
movq %rax,%rdi
call FPC_SETJMP
movq %rax,88(%rsp)
testl %eax,%eax
jne .Lj18
.Ll14:
# [51] emptyprocedure(simplefunction);
leaq 96(%rsp),%rax
movq %rax,%rdi
call UNIT1_$$_SIMPLEFUNCTION$$ANSISTRING
movq 96(%rsp),%rdi
call UNIT1_$$_EMPTYPROCEDURE$ANSISTRING
.Lj18:
.Ll15:
call FPC_POPADDRSTACK
.Ll16:
# [52] end;
leaq 96(%rsp),%rdi
call fpc_ansistr_decr_ref
.Ll17:
movq 88(%rsp),%rax
testq %rax,%rax
je .Lj19
call FPC_RERAISE
.Lj19:
.Ll18:
leaq 104(%rsp),%rsp
nop
ret
.Lc10:
.Lt4:
.Le3:
.size UNIT1_$$_PROCEDURECALL, .Le3 - UNIT1_$$_PROCEDURECALL

3) Adding up previous two paragraphs we end up with:
{CODE THAT WHEN I NEED IT I CAN'T FIND IT, ALTHOUGH I'VE SEEN THE SITUATION MORE THAN ONCE}
   It was something like:
Code: [Select]
begin of loop with counter variable declared to be used only for this loop
if condition then debugln('text'+function_call_result+'text.');
end of loop

lagprogramming

  • Sr. Member
  • ****
  • Posts: 406
Re: Graphic changes
« Reply #36 on: December 01, 2014, 10:44:45 pm »
This is the first time I publicly present included optional CPU branch removal.

1/4 Foreword for pascal programmers that don't know what CPU branches are about

   Many processors are able to compute more than one line of code at a time. This means that two(or more) lines of pascal code can be computed in parallel(almost at the same time), improving code execution speed. However, once we hit an "if", "case" or loop("repeat", "for"...) the situation becomes complex. The CPU is not 100% sure what's best to be done, it may stall(wait) or try to guess. Apparently it is better to try guessing than waiting :). When the processor fails to predict if the code found at "then" or "else"(branches) block is the appropriate one, usually it fails with a sensible decrease of code execution speed(localy). Depending on the combination of CPU model and fpc version(as there is hope), the decrease can be small or huge.
   Anyway, at the moment, branch misses are bottlenecks for new CPU models, no matter the language or compiler used. Older CPU models might not have appropriate optimizations within fpc, so they might hit the speed penalty too. All of these CPU share this bottleneck: the presence of conditional jumps("if", "case"...statements).

2/4 Drawbacks of branch removal
   Previous uploaded modified code(files) already had some branch removals.
   I'M NOT SURE THE FOLLOWING EXAMPLE IS THE BEST ONE!!!
   For example:
   "If (sourcewidth=0)and(sourceheight=0)and(destinationwidth=0)and(destinationheight=0) then copyimageoptimized;".
   Can be changed to something like:
   "If (sourcewidth or sourceheight)=0)and(destinationwidth or destinationheight=0) then copyimageoptimized;"//TWO BRANCHES
   or
   "If (sourcewidth or sourceheight or destinationwidth or destinationheight)=0 then copyimageoptimized;"//ONE BRANCH
   or
   "{$B+} If (sourcewidth=0)and(sourceheight=0)and(destinationwidth=0)and(destinationheight=0) then copyimageoptimized; {$B-}"//ONE BRANCH
   The problem with the above alternatives is that those "or" instructions are not for free. Those are operations that take time.
   This means that the situation is like the following text(JUST AS EXAMPLE):
   Original code(with branches):
   "if {condition} then {dothen} else {doelse};"
   will take 10  millisecond if the CPU guesses the condition;
   will take 20 milliseconds if the CPU misses the condition;

   Modified code(without branches):
   will take 13  milliseconds all the time(there is no "if" statement).

   Between original and modified codes, if you consider the branch-less code(with time consumed for additional operations like the "or" in the above example) will statistically perform better, you may activate the option. Regarding code execution speed there is no universal safe solution...and there won't be.

3/4 Who might benefit from branch removal.
   New CPU models might benefit because they are fast for many things except for branch misses recovery.
   Some old CPU models that predict poorly and fpc might not have proper optimizations for them. For example, at the moment, I expect AMD K8(probably K10 too) series to run Lazarus/Fpc built binaries slow. For these CPUs, apparently, regarding branches("then" or "else" code blocks), even if the CPU predicts corectly the branch, some code might need to be aligned(something that fpc might not do properly at the moment :) ). This means that the simple existence of an "if", "case" or any loop within the pascal code can become a bottleneck, no matter branch prediction influence. Also, apparently a series of more than three consecutive conditional jumps is highly susceptible to mess the branch predictor.
   Regarding software, I've never used Code Typhoon but probably this "branch"(distribution) has the potential to benefit more than "vanilla" Lazarus because of it's easier cross-compiling target.

4/4 How to enable branch removal
   Use "-dBRANCHREMOVAL"(without quotes).

   TEXT PRESENTED ABOVE IS JUST AN INTRODUCTION! IF YOU'RE LOOKING FOR MORE INFORMATIONS(OR MORE ACCURATE) REGARDING CPU BRANCHES TRY NOT TO USE THIS FORUM THREAD.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4474
  • I like bugs.
Re: Graphic changes
« Reply #37 on: December 02, 2014, 07:49:09 am »
   TEXT PRESENTED ABOVE IS JUST AN INTRODUCTION! IF YOU'RE LOOKING FOR MORE INFORMATIONS(OR MORE ACCURATE) REGARDING CPU BRANCHES TRY NOT TO USE THIS FORUM THREAD.

Huh!
Your writings have less and less connection with reality.
Please use a personal blog somewhere else instead of polluting the Lazarus forum.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018