Recent

Author Topic: FPC 3.1.1. vs 2.6.4  (Read 26296 times)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11452
  • FPC developer.
Re: FPC 3.1.1. vs 2.6.4
« Reply #15 on: March 05, 2015, 02:14:30 pm »
Do the same benchmark, but comment out the writeln. Maybe it is simply the unicode changes there that are different, not the generated code.

Leledumbo

  • Hero Member
  • *****
  • Posts: 8757
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: FPC 3.1.1. vs 2.6.4
« Reply #16 on: March 05, 2015, 02:19:38 pm »
in the example the difference is integer , how you get this in single? , you're in linux maybe
i retest > 10 times , always 2.6.4 <= 1450ms  / 3.1.1 >= ~1600ms
try to increase ixmax and iymax
Yes, I am. FPC has its own timing facility, why use windows one?
Code: [Select]
program mandelbrot;
 uses SysUtils,DateUtils;
const
   ixmax = 2500;
   iymax = 2000;
   cxmin = -2.5;
   cxmax =  1.5;
   cymin = -2.0;
   cymax =  2.0;
   maxcolorcomponentvalue = 255;
   maxiteration = 200;
   escaperadius = 2;
 
type
   colortype = record
      red   : byte;
      green : byte;
      blue  : byte;
   end;
 
var
   ix, iy      : integer;
   cx, cy      : real;
   pixelwidth  : real = (cxmax - cxmin) / ixmax;
   pixelheight : real = (cymax - cymin) / iymax;
   filename    : string = 'new1.ppm';
   comment     : string = '# ';
   outfile     : textfile;
   color       : colortype;
   zx, zy      : real;
   zx2, zy2    : real;
   iteration   : integer;
   er2         : real = (escaperadius * escaperadius);
   tm : TDateTime;
begin
   tm := Now;
   {$I-}
   assign(outfile, filename);
   rewrite(outfile);
   if ioresult <> 0 then
   begin
      writeln(stderr, 'unable to open output file: ', filename);
      exit;
   end;
 
   writeln(outfile, 'P6');
   writeln(outfile, ' ', comment);
   writeln(outfile, ' ', ixmax);
   writeln(outfile, ' ', iymax);
   writeln(outfile, ' ', maxcolorcomponentvalue);
 
   for iy := 1 to iymax do
   begin
      cy := cymin + (iy - 1)*pixelheight;
      if abs(cy) < pixelheight / 2 then cy := 0.0;
      for ix := 1 to ixmax do
      begin
         cx := cxmin + (ix - 1)*pixelwidth;
         zx := 0.0;
         zy := 0.0;
         zx2 := zx*zx;
         zy2 := zy*zy;
         iteration := 0;
         while (iteration < maxiteration) and (zx2 + zy2 < er2) do
         begin
            zy := 2*zx*zy + cy;
            zx := zx2 - zy2 + cx;
            zx2 := zx*zx;
            zy2 := zy*zy;
            iteration := iteration + 1;
         end;
         if iteration = maxiteration then
         begin
            color.red   := 0;
            color.green := 0;
            color.blue  := 0;
         end
         else
         begin
            color.red   := 255;
            color.green := 255;
            color.blue  := 255;
         end;
         write(outfile, chr(color.red), chr(color.green), chr(color.blue));
      end;
   end;
 
   close(outfile);
   writeln(MilliSecondSpan(tm,Now):1:8,' ms');
end.

Leledumbo

  • Hero Member
  • *****
  • Posts: 8757
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: FPC 3.1.1. vs 2.6.4
« Reply #17 on: March 05, 2015, 02:25:04 pm »
Do the same benchmark, but comment out the writeln. Maybe it is simply the unicode changes there that are different, not the generated code.
Either I did it wrong or what, but without write[ln] the result now becomes:
Code: [Select]
2.6.4 - 3.1.1
1534.00017880 ms - 742.99976695 ms
1517.99996383 ms - 743.99993755 ms
1401.00012068 ms - 717.99990255 ms
1508.99968576 ms - 691.99986756 ms
1513.99991009 ms - 684.99993067 ms
1427.99969763 ms - 710.99996567 ms
1505.00026066 ms - 719.99961510 ms
1391.00030065 ms - 698.00026249 ms
1388.99995945 ms - 730.99960573 ms
1519.00013443 ms - 742.00022500 ms
Now the difference is consistent :)
EDIT: looks like I did some mistakes in the previous benchmark (with write[ln]). The correct result should be:
Code: [Select]
2.6.4 - 3.1.1
1874.00034629 ms - 1562.99946830 ms
1826.00033004 ms - 1597.00023942 ms
2867.00001452 ms - 2565.00004325 ms
1858.00013132 ms - 1613.99999633 ms
3324.99956712 ms - 3702.99993083 ms
1835.00060812 ms - 1531.00029565 ms
2891.99987892 ms - 2737.00046819 ms
1861.00001447 ms - 1552.99964827 ms
2882.00005889 ms - 2677.99983267 ms
1837.99986262 ms - 1587.99996134 ms
The difference is also more consistent.
« Last Edit: March 05, 2015, 02:29:00 pm by Leledumbo »

FPK

  • Moderator
  • Full Member
  • *****
  • Posts: 118
Re: FPC 3.1.1. vs 2.6.4
« Reply #18 on: March 05, 2015, 07:37:04 pm »
Do the same benchmark, but comment out the writeln. Maybe it is simply the unicode changes there that are different, not the generated code.
Either I did it wrong or what, but without write[ln] the result now becomes:
Code: [Select]
2.6.4 - 3.1.1
1534.00017880 ms - 742.99976695 ms
1517.99996383 ms - 743.99993755 ms
1401.00012068 ms - 717.99990255 ms
1508.99968576 ms - 691.99986756 ms
1513.99991009 ms - 684.99993067 ms
1427.99969763 ms - 710.99996567 ms
1505.00026066 ms - 719.99961510 ms
1391.00030065 ms - 698.00026249 ms
1388.99995945 ms - 730.99960573 ms
1519.00013443 ms - 742.00022500 ms
Now the difference is consistent :)
EDIT: looks like I did some mistakes in the previous benchmark (with write[ln]). The correct result should be:
Code: [Select]
2.6.4 - 3.1.1
1874.00034629 ms - 1562.99946830 ms
1826.00033004 ms - 1597.00023942 ms
2867.00001452 ms - 2565.00004325 ms
1858.00013132 ms - 1613.99999633 ms
3324.99956712 ms - 3702.99993083 ms
1835.00060812 ms - 1531.00029565 ms
2891.99987892 ms - 2737.00046819 ms
1861.00001447 ms - 1552.99964827 ms
2882.00005889 ms - 2677.99983267 ms
1837.99986262 ms - 1587.99996134 ms
The difference is also more consistent.

This proves exactly the point: unicode in 3.x is an issue but we cannot do anything against this.

On modern CPUs -Cfsse2, -Cfsse3 or -Cfavx help also a lot with 3.x. -Cfsse2 and -Cfsse3 work with 2.6.x too but are less effective.

airpas

  • Full Member
  • ***
  • Posts: 179
Re: FPC 3.1.1. vs 2.6.4
« Reply #19 on: March 05, 2015, 07:58:17 pm »
Quote
This proves exactly the point: unicode in 3.x is an issue but we cannot do anything against this.

On modern CPUs -Cfsse2, -Cfsse3 or -Cfavx help also a lot with 3.x. -Cfsse2 and -Cfsse3 work with 2.6.x too but are less effective.
thanks for the info . i hope to see in the future a document about optimization strategies with free pascal , just like c++ does http://www.agner.org/optimize/

Nitorami

  • Sr. Member
  • ****
  • Posts: 496
Re: FPC 3.1.1. vs 2.6.4
« Reply #20 on: March 05, 2015, 08:26:11 pm »
@airpas: I have read agner yesterday. I understand approximately 0.1% of it but it seems obvious that generating the fastest possible code is almost impossible. For instance unrolling a loop may be a benefit but if the code does not fit into the cache anymore, this will prove detrimental. The various CPU optimisations partly seem to battle each other and the fastest code can only be found by trial and error, and it will be CPU specific.

I'd like to also contribute a few surprising benchmarks. I am NOT using UNICODE, and I do not use writeln other than on program termination to display the result. I am using the sysutils timer.

The timings are on a simple mandelbrot program. I can't be bothered to exactly specify which settings I use for debug / normal mode, suffice to say that the usual checks Range, Stack, Overflow, are ON for debug mode and optimisations are OFF. In normal mode, it's the other way round, checks off, O1, O2, O3 ON. The values are the average of three runs each, unless they differed by more than the timer resolution - otherwise something is wrong and I'll fix it.

"sqr" means we replace the Zi*Zi, Zr*Zi in the mandelbrot code by the sqr operator sqr (Zi), sqr (Zr).

First round
FPC 2.6.4:
normal mode: 1.75 sec
normal mode + {$FPUTYPE SSE2}: 1.25 sec -> BEST result
normal mode + sqr: 1.73 sec
normal mode + {$FPUTYPE SSE2} + sqr: 4.81 sec  -> WORST result by far. sqr seems to be a problem for SSE2.

FPC 3.1.1
normal mode: 3.19 sec -> very bad and in fact slower than 2.6.4 in debug mode !!! But:
debug mode: 1.82 sec -> still a bit slower than 2.6.4 but faster than normal mode !
normal mode + {$FPUTYPE SSE2}: 1.45 sec
normal mode + sqr: 3.21 sec
normal mode + {$FPUTYPE SSE2} + sqr: 1.45 sec  -> surprise, sqr is no problem for SSE2 in 3.1.1

Second round, having done something else, ran a few other programs and came back:


FPC 2.6.4:
normal mode: 1.59 sec
normal mode + {$FPUTYPE SSE2}: 1.11 sec -> BEST result
normal mode + sqr: 1.56 sec
normal mode + {$FPUTYPE SSE2} + sqr: 4.38 sec  -> WORST result by far

FPC 3.1.1
normal mode: 2.88 sec
debug mode: 1.61 sec
normal mode + {$FPUTYPE SSE2}: 1.25 sec
normal mode + sqr: 2.88 sec
normal mode + {$FPUTYPE SSE2} + sqr: 1.24 sec

So the machine has become faster in the meantime. Well maybe the virus scanner did something I did not notice, but at least the change to the better is consistent.
Now there are a few surprises such that SSE2 fares rather poorly with the sqr operator in 2.6.4, and 3.1.1 may generate faster code in debug mode than in normal mode... would anyone have expected that ?

I should say I find microbenchmarks rather pointless. Measuring the speed of a whole application normally delivers more stable results (@FPK: I cannot submit my whole code of course). Microbenchmarks may yield wildly changing values, which is just not so much the case for a full application.

But coming back to the thread's topis: Again, I find that 3.1.1 is a bit slower than 2.6.4. At least on this machine (AMD Athlon 3500+, single core, win32). 


FPK

  • Moderator
  • Full Member
  • *****
  • Posts: 118
Re: FPC 3.1.1. vs 2.6.4
« Reply #21 on: March 05, 2015, 09:03:10 pm »
I guess the differences on the Athlon are simply caused by alignment issues. Older CPUs were very sensitive to this.

But coming back to the thread's topis: Again, I find that 3.1.1 is a bit slower than 2.6.4. At least on this machine (AMD Athlon 3500+, single core, win32). 

As said, this is not unlikely for certain applications. The stuff mentioned above like the unicode changes cause additional bloat and we can try to improve this only if users tell us exactly what the problem is.

lagprogramming

  • Sr. Member
  • ****
  • Posts: 406
Re: FPC 3.1.1. vs 2.6.4
« Reply #22 on: March 06, 2015, 02:20:10 pm »
   @Nitorami
   I think you should double check your preliminary conclusions by reading the bug(id 26089) at:
http://bugs.freepascal.org/view.php?id=26089
   Many code execution speed measurements are completely useless for the combination between FPC and your CPU series. The funny thing is you haven't noticed that regarding the very same function/procedure, if it's moved within the pascal code then it may run slower or faster. For example, you take a function and you nest it within a calling function. Even if you call the nested function only once(you have a loop inside the function), that function may run a lot slower(or faster) compared to when the function is not nested. In fact, you may have the surprise to see that the same program might run faster or slower when you restart the computer :)). All of these happen with the same FPC version.
   If you don't want to follow Florian's suggestion to restart that bug report, then you have the following choices:
   - insert "nop"s within assembler code;
   - carefully remove some CPU branches. Their simple existence might decrease speed, due to improper code alignment.
   - use another compiler.

   All the best!

Leledumbo

  • Hero Member
  • *****
  • Posts: 8757
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: FPC 3.1.1. vs 2.6.4
« Reply #23 on: March 06, 2015, 06:51:30 pm »
This proves exactly the point: unicode in 3.x is an issue but we cannot do anything against this.
How far has unicode integrate into the language? I mean when we just plainly declare a variable of type String (with {$H+}), does it automatically map to unicodestring? What if one just wants to have plain ansistring or shortstring? Will it be faster?

Jonas Maebe

  • Hero Member
  • *****
  • Posts: 1059
Re: FPC 3.1.1. vs 2.6.4
« Reply #24 on: March 06, 2015, 07:24:15 pm »
I mean when we just plainly declare a variable of type String (with {$H+}), does it automatically map to unicodestring?
Only if {$modeswitch unicodestrings} is active, which by default is only the case in {$mode delphiunicode}.

Quote
What if one just wants to have plain ansistring or shortstring? Will it be faster?
I think the main slowdown comes from the fact that the RTL now is correctly aware of the fact that on Windows the ansi codepage (i.e., the codepage of default ansistrings and shortstrings, aka CP_ACP) is different from the console codepage (the code page that must be used by writeln, aka CP_OEM). As a result, on Windows every single write(ln) of any string is going to result in a code page conversion with new FPC versions, while with old FPC versions that was not the case (which probably also meant that some characters could be output wrongly).

I don't think console I/O will be (noticeably) slower on non-Windows platforms going from 2.6.4 to 3.1.1, even with a widestring manager installed.

FPK

  • Moderator
  • Full Member
  • *****
  • Posts: 118
Re: FPC 3.1.1. vs 2.6.4
« Reply #25 on: March 06, 2015, 07:26:20 pm »
This proves exactly the point: unicode in 3.x is an issue but we cannot do anything against this.
How far has unicode integrate into the language? I mean when we just plainly declare a variable of type String (with {$H+}), does it automatically map to unicodestring? What if one just wants to have plain ansistring or shortstring? Will it be faster?

No, not really. The helper routines contain checks for code pages, the text and file variables take now unicode chars etc. This makes a program simply slower.

Leledumbo

  • Hero Member
  • *****
  • Posts: 8757
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: FPC 3.1.1. vs 2.6.4
« Reply #26 on: March 07, 2015, 05:32:06 am »
I don't think console I/O will be (noticeably) slower on non-Windows platforms going from 2.6.4 to 3.1.1, even with a widestring manager installed.
The test above is on a Linux x64 machine. 3.1.1 is still faster than 2.6.4 even with I/O, but the gap is rather small. When no I/O is used, the gap is big.
No, not really. The helper routines contain checks for code pages, the text and file variables take now unicode chars etc. This makes a program simply slower.
So there's nothing we can do about it for now? Since the bottleneck is string operations, will buffering the string and reducing I/O calls still help?

FPK

  • Moderator
  • Full Member
  • *****
  • Posts: 118
Re: FPC 3.1.1. vs 2.6.4
« Reply #27 on: March 07, 2015, 10:20:37 am »
I don't think console I/O will be (noticeably) slower on non-Windows platforms going from 2.6.4 to 3.1.1, even with a widestring manager installed.
The test above is on a Linux x64 machine. 3.1.1 is still faster than 2.6.4 even with I/O, but the gap is rather small. When no I/O is used, the gap is big.
No, not really. The helper routines contain checks for code pages, the text and file variables take now unicode chars etc. This makes a program simply slower.
So there's nothing we can do about it for now? Since the bottleneck is string operations, will buffering the string and reducing I/O calls still help?

Buffering requires also string operations. If you want to get the topic forward, use kcachegrind and investigate where the time goes. Maybe there is a real bottleneck.

Jonas Maebe

  • Hero Member
  • *****
  • Posts: 1059
Re: FPC 3.1.1. vs 2.6.4
« Reply #28 on: March 07, 2015, 06:26:12 pm »
I think it has nothing to do at all with the speed of writeln, and it's again a case of micro benchmarking gone awry. Here were my initial results with ppc386 on OS X with plain -O2 (with the "time" command, so background activity is eliminated):

Code: [Select]
2.6.4, with writeln
  user 0m5.572s
  sys  0m0.293s

2.6.4, no writeln
  user 0m4.648s
  sys  0m0.005s

3.1.1, with writeln
  user 0m2.243s
  sys  0m0.266s

3.1.1, no writeln
  user 0m2.038s
  sys  0m0.004s

So for me, writeln actually seems to have less overhead in 2.6.4 than on 3.1.1. Then I figured to test the overhead with cwstring included, and it seemed to get really weird:

Code: [Select]
2.6.4, with writeln, with cwstring
  user 0m4.216s
  sys  0m0.269s

2.6.4, no writeln, with cwstring
  user 0m4.020s
  sys  0m0.006s

3.1.1, with writeln, with cwstring
  user 0m1.427s
  sys  0m0.254s

3.1.1, no writeln, with cwstring
  user 0m1.275s
  sys  0m0.004s

Look at that: including cwstring makes the program a lot faster both on 2.6.4 and 3.1.1, even when there's no input/output at all! If you see something like that, you can be virtually certain it's a case of memory alignment.

And indeed: if I move all of the code of the program into a subroutine (so all variables become local variables) and then play with the maximum alignment for local variables (all cases without writeln, with/without cwstring stays the same):

Code: [Select]
3.1.1, -Oalocalmax=4
  user 0m1.233s
  sys  0m0.004s

3.1.1, -Oalocalmax=8 (and anything else > 4, doesn't change generated code compared to 8)
  user 0m3.843s
  sys  0m0.014s

So as soon as the maximum alignment for locals is increased above 8, you get a huge increase in run time here. So it's clear that there's a cache effect playing somewhere, because forcing the alignment to 4 bytes means that several doubles are now only aligned at 4 bytes (which in theory should reduce performance). In the original code, all variables were global variables and hence including another unit will affect their alignment/placement too.

Now, all of the above is without -Cfsse2. If I add -Cfsse2, then the speed is the same with 4 and 8 byte alignment. Reason: all values are kept in SSE2 registers, so the stack alignment is irrelevant.

Now, if you add writeln, then the impact will again become potentially bigger because then the sse values will have to be spilled to the stack and hence cache effects come back into play. Another thing that may be relevant is that if FPC is able to better optimise the register-based code in 3.1.1 than in 2.6.4 (or is better able to put global variables into registers), then logically if values need to be spilled the performance degredation will be relatively larger in 3.1.1 than in 2.6.4, but that could just be because 2.6.4 generated worse code to start with.

And now I've spent way more time on this already than I ever planned to.
« Last Edit: March 07, 2015, 06:27:47 pm by Jonas Maebe »

BeniBela

  • Hero Member
  • *****
  • Posts: 906
    • homepage
Re: FPC 3.1.1. vs 2.6.4
« Reply #29 on: March 07, 2015, 11:30:16 pm »

I think the main slowdown comes from the fact that the RTL now is correctly aware of the fact that on Windows the ansi codepage (i.e., the codepage of default ansistrings and shortstrings, aka CP_ACP) is different from the console codepage (the code page that must be used by writeln, aka CP_OEM). As a result, on Windows every single write(ln) of any string is going to result in a code page conversion with new FPC versions, while with old FPC versions that was not the case (which probably also meant that some characters could be output wrongly).


Is there an option to change that?

E.g. if you want to output UTF-8, if the output is redirected to a file

 

TinyPortal © 2005-2018