Recent

Author Topic: [SOLVED] Pascal performance on polynomial benchmark slower than expected  (Read 18120 times)

botster

  • New Member
  • *
  • Posts: 18
Hi All,

Being a Linux user, I have been working with Gambas (BASIC) which employs a visual GUI designer as Lazarus does. It is a nice app, but it does not do cross-platform compiling. So, I have recently took another look at Lazarus.

I became curious how a Free Pascal compiled program would perform against a Gambas 'just in time' compiled program. I thought it would blow Gambas out of the water, so to speak. It did not.

Please understand that I am not trying to be a troll. I just think there is something in my system that makes an FPC compiled program not perform as it should. And, I am hoping someone might be able to help me find what that is.

Here's what I did. I took the Gambas polynomial benchmark program from http://gambaswiki.org/wiki/doc/benchmark/polynom and converted it to Pascal, compiled it with `fpc polynom.pas`, and then timed its execution with the Linux 'time' command (`time ./polynom`).

The time information of the output was:
real    0m38.576s
user    0m20.965s
sys     0m0.063s

For the Gambas program, executed with `time gbs3 -f -c polynom.gambas`, the time info was:
real    0m17.093s
user    0m9.604s
sys     0m0.066s

Over twice as fast as the pre-compiled Pascal program. The '-f' option invokes the Just-In-Time compiler, and the '-c' option ignores the compile cache to force a compile. So, the time for the Gambas program includes compile time.

Now, what makes me think there is something wrong on my system is that another user on the Gambas user list (using FPC 2.6.2-8 [2014/01/22] for x86_64 -- older than mine) reported times of 5.376s and 4.172s for Pascal and Gambas, respectively -- showing Pascal to be only marginally slower; not two times slower.

Yes, I have a slow system:
Intel(R) Pentium(R) 4 CPU 2.40GHz, 1G RAM
Mageia 3 (Linux), Kernel 3.10.54, KDE4 Desktop
Free Pascal Compiler version 2.6.4 [2014/03/07] for i386
Gambas 3.5.4

Here's the Pascal program:
Code: [Select]
program Polynom;
{$mode objfpc}

var
  z : integer;

function DoIt(x : double) : double;
  var Mu : double = 10.0;
  var Pu, Su : double;
  var I, J, N : integer;
  var aPoly : array [0..99] of double;

begin
  N := 500000;
  Pu := 0;

  For I := 0 To N-1 do
  begin
    For J := 0 To 99 do
    begin
      Mu :=  (Mu + 2.0) / 2.0;
      aPoly[J] := Mu;
    end;
    Su := 0.0;
    For J := 0 To 99 do
    begin
      Su := X * Su + aPoly[J];
    end;
    Pu := Pu + Su;
  end;

  DoIt := Pu;
end;

Begin

For z := 1 To 10 do
begin
writeln( DoIt(0.2) );
end;

End.

I could not attach my "fpc.cfg" file. So I've included it here:
Code: [Select]
#
# Config file generated by fpcmkcfg on 27-9-14 - 03:58:23
# Example fpc.cfg for Free Pascal Compiler
#

# ----------------------
# Defines (preprocessor)
# ----------------------

#
# nested #IFNDEF, #IFDEF, #ENDIF, #ELSE, #DEFINE, #UNDEF are allowed
#
# -d is the same as #DEFINE
# -u is the same as #UNDEF
#

#
# Some examples (for switches see below, and the -? helppages)
#
# Try compiling with the -dRELEASE or -dDEBUG on the commandline
#

# For a release compile with optimizes and strip debuginfo
#IFDEF RELEASE
  -O2
  -Xs
  #WRITE Compiling Release Version
#ENDIF

# For a debug version compile with debuginfo and all codegeneration checks on
#IFDEF DEBUG
  -gl
  -Crtoi
  #WRITE Compiling Debug Version
#ENDIF

# assembling
#ifdef darwin
# use pipes instead of temporary files for assembling
-ap
# path to Xcode 4.3+ utilities (no problem if it doesn't exist)
-FD/Applications/Xcode.app/Contents/Developer/usr/bin
#endif

# ----------------
# Parsing switches
# ----------------

# Pascal language mode
#      -Mfpc      free pascal dialect (default)
#      -Mobjfpc   switch some Delphi 2 extensions on
#      -Mdelphi   tries to be Delphi compatible
#      -Mtp       tries to be TP/BP 7.0 compatible
#      -Mgpc      tries to be gpc compatible
#      -Mmacpas   tries to be compatible to the macintosh pascal dialects
#
# Turn on Object Pascal extensions by default
#-Mobjfpc

# Assembler reader mode
#      -Rdefault  use default assembler
#      -Ratt      read AT&T style assembler
#      -Rintel    read Intel style assembler
#
# All assembler blocks are AT&T styled by default
#-Ratt

# Semantic checking
#      -S2        same as -Mobjfpc
#      -Sc        supports operators like C (*=,+=,/= and -=)
#      -Sa        include assertion code.
#      -Sd        same as -Mdelphi
#      -Se<x>     error options. <x> is a combination of the following:
#         <n> : compiler stops after <n> errors (default is 1)
#         w   : compiler stops also after warnings
#         n   : compiler stops also after notes
#         h   : compiler stops also after hints
#      -Sg        allow LABEL and GOTO
#      -Sh        Use ansistrings
#      -Si        support C++ styled INLINE
#      -Sk        load fpcylix unit
#      -SI<x>     set interface style to <x>
#         -SIcom    COM compatible interface (default)
#         -SIcorba  CORBA compatible interface
#      -Sm        support macros like C (global)
#      -So        same as -Mtp
#      -Sp        same as -Mgpc
#      -Ss        constructor name must be init (destructor must be done)
#      -Sx        enable exception keywords (default in Delphi/ObjFPC modes)
#
# Allow goto, inline, C-operators, C-vars
-Sgic

# ---------------
# Code generation
# ---------------

# Uncomment the next line if you always want static/dynamic units by default
# (can be overruled with -CD, -CS at the commandline)
#-CS
#-CD

# Set the default heapsize to 8Mb
#-Ch8000000

# Set default codegeneration checks (iocheck, overflow, range, stack)
#-Ci
#-Co
#-Cr
#-Ct

# Optimizer switches
# -Os        generate smaller code
# -Oa=N      set alignment to N
# -O1        level 1 optimizations (quick optimizations, debuggable)
# -O2        level 2 optimizations (-O1 + optimizations which make debugging more difficult)
# -O3        level 3 optimizations (-O2 + optimizations which also may make the program slower rather than faster)
# -Oo<x>     switch on optimalization x. See fpc -i for possible values
# -OoNO<x>   switch off optimalization x. See fpc -i for possible values
# -Op<x>     set target cpu for optimizing, see fpc -i for possible values

#ifdef darwin
#ifdef cpui386
-Cppentiumm
-Oppentiumm
#endif
#endif

# -----------------------
# Set Filenames and Paths
# -----------------------

# Both slashes and backslashes are allowed in paths

# path to the messagefile, not necessary anymore but can be used to override
# the default language
#-Fr/usr/lib/fpc/$fpcversion/msg/errore.msg
#-Fr/usr/lib/fpc/$fpcversion/msg/errorn.msg
#-Fr/usr/lib/fpc/$fpcversion/msg/errores.msg
#-Fr/usr/lib/fpc/$fpcversion/msg/errord.msg
#-Fr/usr/lib/fpc/$fpcversion/msg/errorr.msg

# searchpath for units and other system dependent things
-Fu/usr/lib/fpc/$fpcversion/units/$fpctarget
-Fu/usr/lib/fpc/$fpcversion/units/$fpctarget/*
-Fu/usr/lib/fpc/$fpcversion/units/$fpctarget/rtl

#IFDEF FPCAPACHE_1_3
-Fu/usr/lib/fpc/$fpcversion/units/$fpctarget/httpd13/
#ELSE
#IFDEF FPCAPACHE_2_0
-Fu/usr/lib/fpc/$fpcversion/units/$fpctarget/httpd20
#ELSE
-Fu/usr/lib/fpc/$fpcversion/units/$fpctarget/httpd22
#ENDIF
#ENDIF

# searchpath for fppkg user-specific packages
-Fu~/.fppkg/lib/fpc/$fpcversion/units/$FPCTARGET/*

# path to the gcclib
#ifdef cpui386
-Fl/usr/lib/gcc/i586-mageia-linux-gnu/4.7.2
#endif
#ifdef cpux86_64
-Fl/usr/lib/gcc/i586-mageia-linux-gnu/4.7.2
#endif

# searchpath for libraries
#-Fl/usr/lib/fpc/$fpcversion/lib
#-Fl/lib;/usr/lib
-Fl/usr/lib/fpc/$fpcversion/lib/$FPCTARGET

# searchpath for tools
-FD/usr/lib/fpc/$fpcversion/bin/$FPCTARGET

#IFNDEF CPUI386
#IFNDEF CPUAMD64
#DEFINE NEEDCROSSBINUTILS
#ENDIF
#ENDIF

#IFNDEF Linux
#DEFINE NEEDCROSSBINUTILS
#ENDIF

# binutils prefix for cross compiling
#IFDEF FPC_CROSSCOMPILING
#IFDEF NEEDCROSSBINUTILS
  -XP$FPCTARGET-
#ENDIF
#ENDIF


# -------------
# Linking
# -------------

# generate always debugging information for GDB (slows down the compiling
# process)
#      -gc        generate checks for pointers
#      -gd        use dbx
#      -gg        use gsym
#      -gh        use heap trace unit (for memory leak debugging)
#      -gl        use line info unit to show more info for backtraces
#      -gv        generates programs tracable with valgrind
#      -gw        generate dwarf debugging info
#
# Enable debuginfo and use the line info unit by default
#-gl

# always pass an option to the linker
#-k-s

# Always strip debuginfo from the executable
-Xs


# -------------
# Miscellaneous
# -------------

# Write always a nice FPC logo ;)
-l

# Verbosity
#      e : Show errors (default)       d : Show debug info
#      w : Show warnings               u : Show unit info
#      n : Show notes                  t : Show tried/used files
#      h : Show hints                  s : Show time stamps
#      i : Show general info           q : Show message numbers
#      l : Show linenumbers            c : Show conditionals
#      a : Show everything             0 : Show nothing (except errors)
#      b : Write file names messages   r : Rhide/GCC compatibility mode
#          with full path              x : Executable info (Win32 only)
#      v : write fpcdebug.txt with     p : Write tree.log with parse tree
#          lots of debugging info
#
# Display Info, Warnings and Notes
-viwn
# If you don't want so much verbosity use
#-vw

I have searched both the web and this forum for optimizations related to floating point numbers, but came up with nothing useful.

Thank you for any clues to guide me.

Lee
« Last Edit: October 16, 2014, 12:17:02 am by botster »
Lee

Blaazen

  • Hero Member
  • *****
  • Posts: 2782
  • POKE 54296,15
    • Eye-Candy Controls
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #1 on: October 13, 2014, 07:39:09 pm »
Try to compile with higher level of optimizations:
Code: [Select]
fpc polynom.pas -O3
Lazarus 2.1.0 r59757M FPC 3.3.1 r40507 x86_64-linux-qt Chakra, Qt 4.8.7/5.11.2, Plasma 5.14.2
Lazarus 1.8.2 r57369 FPC 3.0.4 i386-win32-win32/win64 Wine 3.21

Try Eye-Candy Controls: https://sourceforge.net/projects/eccontrols/files/

Blaazen

  • Hero Member
  • *****
  • Posts: 2782
  • POKE 54296,15
    • Eye-Candy Controls
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #2 on: October 13, 2014, 07:49:17 pm »
Also, try replace
Code: [Select]
... / 2.0;with
Code: [Select]
... * 0.5;Maybe it is done automatically, I'm not sure.
Lazarus 2.1.0 r59757M FPC 3.3.1 r40507 x86_64-linux-qt Chakra, Qt 4.8.7/5.11.2, Plasma 5.14.2
Lazarus 1.8.2 r57369 FPC 3.0.4 i386-win32-win32/win64 Wine 3.21

Try Eye-Candy Controls: https://sourceforge.net/projects/eccontrols/files/

serbod

  • Full Member
  • ***
  • Posts: 127
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #3 on: October 13, 2014, 08:02:32 pm »
Lazarus 1.2.4 with FPC 2.6.4
(Win7-32, Intel Duo T5250, 2Gb RAM)

Your code in default GUI application (Create new.. application)
time: 00:00:10.681

no debugging, added -O3 optimizations
time: 00:00:22.319  (Why?!)

-Or (use register variables), {$MAXFPUREGISTERS 5}
time: 00:00:10.602  (Why?)

Looking at assembler - variables not in registers, even cycle counter.

Upd: * 0.5 instead of / 2.0
time: 00:00:10.371
« Last Edit: October 13, 2014, 08:08:34 pm by serbod »

jarto

  • Full Member
  • ***
  • Posts: 106
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #4 on: October 13, 2014, 08:14:43 pm »
Tested on Linux Mint with Intel(R) Core(TM) i5 CPU       U 470  @ 1.33GHz

fpc 2.6.4, 64 bit, -O3, smart linking, no debug:
real   0m6.324s
user   0m6.319s
sys   0m0.000s

fpc 2.7.1, 64 bit, -O3, smart linking, no debug:
real   0m5.818s
user   0m5.808s
sys   0m0.004s

fpc 2.7.1, 64 bit, -O3, smart linking, no debug, a few tweaks to the code:
real   0m5.165s
user   0m5.159s
sys   0m0.004s

Almost always, if FPC performs really slow, you are compiling with debug on. Check if the compiled binary is about 35k or a lot bigger.
« Last Edit: October 13, 2014, 08:19:25 pm by jarto »

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7495
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #5 on: October 13, 2014, 08:17:46 pm »
Benchmarking rows of code that can be reduced compiletime to the answer is always a risk. Some compiler might bother to optimize such microbenchmark more than others, but that's hardly a measure of realworld performance.

Try to make the number of iterations or some other crucial number something that is read from a textfile to avoid this.

With that improvement, it is still a microbenchmark (since still a single procedure in a single source), but at least the really extreme differences should be tampered.
« Last Edit: October 13, 2014, 08:32:06 pm by marcov »

howardpc

  • Hero Member
  • *****
  • Posts: 3176
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #6 on: October 13, 2014, 08:22:15 pm »
Does your hardware have an FPU, and do you know for sure it is being utilised in this exercise; or is the compiler emulating it?

DelphiFreak

  • Full Member
  • ***
  • Posts: 246
    • Fresh sound.
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #7 on: October 13, 2014, 08:32:39 pm »
Hi,

I am not sure if your "translated" code is correct.
Because there a few things which could be optimized.

See my code.

Code: [Select]
program project1;
{$mode objfpc}

uses
  windows,
  interfaces,
  sysutils;

var
  z: integer;
  _start:longint;

  function DoIt(const x: double): double;
  var
    Mu: double = 10.0;
    Su: double;
    I, J, N: integer;
  begin
    _start:=GetTickCount;
    N := 500000;
    result := 0;

    for I := 0 to N - 1 do begin
      Su := 0.0;
      for J := 0 to 99 do begin
        Mu := (Mu + 2.0) / 2.0;
        Su := x * Su + Mu;
      end;
      result := result + Su;
    end;
  end;

begin
  _start:=GetTickCount;
  for z := 1 to 10 do writeln(DoIt(0.2));
  writeln(inttostr(GetTickCount-_Start));
  readln;
end. 
Linux Mint 19.1, Lazarus 2.0, Windows 7&10, Delphi 7, Delphi 10.3 Rio

kamischi

  • Full Member
  • ***
  • Posts: 177
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #8 on: October 13, 2014, 11:09:24 pm »
Does Gambas float really correspond to double or to single? This could make quite some difference. Actually, on 64bit it does not make a difference. With ppc386 there is one. But a real boost came from changing that one line to:
   Mu := 0.5 * Mu + 1.0;
and compile with -O3
and the time went down to below 50%.
« Last Edit: October 13, 2014, 11:37:30 pm by kamischi »
fpc 2.6.4, lazarus 1.4.0, Mac OS X, fink

kamischi

  • Full Member
  • ***
  • Posts: 177
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #9 on: October 13, 2014, 11:42:07 pm »
Also, try replace
Code: [Select]
... / 2.0;with
Code: [Select]
... * 0.5;Maybe it is done automatically, I'm not sure.
It does not seem to be done.
fpc 2.6.4, lazarus 1.4.0, Mac OS X, fink

botster

  • New Member
  • *
  • Posts: 18
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #10 on: October 14, 2014, 01:16:23 am »
Wow. Thank you for all your fast and helpful responses!

@Blaazen: I apologize. I neglected to mention previously that I tried compiling with LEVEL3 optimizations, and that increased execution time (as 'serbod' also experienced). Also, your suggestion to change the divide by 2.0 to a multiply by 0.5 did make an appreciable difference -- more on that toward the bottom.

@jarto: The compiled binary is 133K. I tried compiling with {$DEBUGINFO OFF}, but that did not change the file size at all, nor did it speed up the execution. Perhaps I didn't use the correct directive?

@howardpc: Yes, my hardware does have a FPU. I did not target that before. Compiling with {$FPUTYPE SSE2} decreased the execution time consistently by about 1 second over not using that directive. Interestingly, having my web browser running with 13 tabs open causes about a 5-15% difference in execution speed (targeting the FPU vs. not targeting it). So, maybe what I am experiencing is an issue with not enough RAM.

@DelphiFreak: I agree. Storing values temporarily in that array is a bit redundant. But, I was striving for as near equivalent to the Gambas code as I could get. Using your optimized code did give quite a performance increase of about 35%; down to only 22s execution time. But, to compare apples with apples, I also optimized the Gambas code which then ran in only 5s; making the difference in speed a ratio of about 4:1 instead of only 2:1 as before. (By the way, why did the variable "result" not need to be declared in the "var" section before it was used?)

@kamischi: Yes, the Gambas float corresponds to a double. http://www.gambaswiki.org/wiki/cat/datatypes, "Float | Like the double datatype in C. | 8 bytes (Size in memory)".


There has been more than one suggestion to change the floating point divide ("... / 2.0") into a multiply ("... * 0.5"). And that did make a big difference -- down to only 24.253s from 33.462s. (Floating point division is inherently slower than multiplication?) Kamischi's suggestion to use "Mu := 0.5 * Mu + 1.0" and optimize with LEVEL3 did not increase that improvement.

But, I am still not getting anywhere near the negligible difference between the pre-compiled binary (Pascal) and the just-in-time compiled binary (Gambas) that another Gambas user did (5.376s and 4.172s, respectively).

I don't know what else I can do to fix my system, other than maybe add more RAM.

I used Turbo Pascal years ago. But I am only recently revisiting Pascal, so I am not well-versed in the optimization and configuration options that are available; though I have tried making use of the Programmers Guide.

I appreciate the help you all have provided.

Lee
Lee

jarto

  • Full Member
  • ***
  • Posts: 106
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #11 on: October 14, 2014, 07:02:37 am »
On my 64bit computer:

fpc polynom.lpr
158600 bytes, 8.178s

fpc -B -MObjFPC -Scghi -CX -O3 -Or -Px86_64 -XX polynom.lpr
34856 bytes, 5.732s

fpc version 2.7.1 and -B -MObjFPC -Scghi -CX -O3 -Or -Px86_64 -XX polynom.lpr
35712 bytes, 5.156s

All tests run with kamischi's change (use multiplication instead of division)

engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #12 on: October 14, 2014, 07:07:26 am »
I wonder if this 32bit version of procedure DoIt is any better.
Code: [Select]
const
  dblTen: double = 10.0;
  dblTwo: double = 2.0;
  dblHalf: double = 0.5;

function DoIt(x : double) : double;
  var aPoly : array [0..99] of double;

  label
    MainLoop,
    Loop1,
    Loop2;

begin
{$ASMMODE intel}
  asm
    fld qword ptr [dblTen] //st5 = 10.0
    fldz                   //st4 = 0.0 ~ Mu
    fldz                   //st3 = 0.0 ~ Su
    fld qword ptr [x]      //st2 = x
    fld qword ptr [dblTwo] //st1 = 2.0
    fld qword ptr [dblHalf]//st0 = 0.5
    fxch st(5)             //st0 = 10.0 , st5 = 0.5

    mov ecx, 500000           //ecx ~ J

    MainLoop://ecx

      xor eax, eax           //eax ~ I
      Loop1://eax
        inc eax
        cmp eax, 100
        fadd st, st(1)         //Mu := Mu + 2;
        fmul st, st(5)         //Mu := Mu * 0.5;
        fst qword ptr [aPoly+8*eax-8]  //aPoly[I] = Mu
      jl Loop1
      fld st(4)              //st4 = Mu

      xor eax, eax
      Loop2://eax
        add eax, 1
        cmp eax, 100
        fmul st, st(3)                 //Su := Su * x;
        fadd qword ptr [aPoly+8*eax-8] //Su := Su + aPoly[I];
      jl Loop2

      faddp st(4), st                //Pu := Pu + Su;

      sub ecx, 1
    jnz MainLoop

    fstp st(2)
    fstp st(1)
    fstp st
    fstp st(2)
    fstp st
    fstp qword ptr [Result]
  end;
end;
« Last Edit: October 14, 2014, 07:14:35 am by engkin »

kamischi

  • Full Member
  • ***
  • Posts: 177
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #13 on: October 14, 2014, 07:23:30 am »
I did a test lately comparing fpc, gcc and gfortran. Their performance was very similar, unless i switched on fastmath optimization in gcc or gfortran. That made a difference of a factor 2. But i think that affects higher order functions like power, sin, ...

MiSchi
fpc 2.6.4, lazarus 1.4.0, Mac OS X, fink

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7495
Re: Pascal performance on polynomial benchmark slower than expected
« Reply #14 on: October 14, 2014, 09:51:49 am »
With 2.7.1 try -O4 -Oloopunroll and maybe some inline here and there.  Specially the loopunroll might be interesting, since it is a constant number of iterations. (see my earlier suggestions)
« Last Edit: October 14, 2014, 10:17:25 am by marcov »