Recent

Author Topic: Code speed in 64bit compiler vs 32bit compiler  (Read 1887 times)

ad1mt

  • Sr. Member
  • ****
  • Posts: 327
    • Mark Taylor's Home Page
Code speed in 64bit compiler vs 32bit compiler
« on: November 10, 2024, 10:44:28 am »
I have written some code using 64bit integers, which I then compiled with both the 64bit and 32bit compilers.

I expected the code produced by the 64bit compiler to run faster, but that is not the case. The code produced by the 32bit compiler runs slightly faster.

Does anyone have any insight into why this is?

Thanks.

440bx

  • Hero Member
  • *****
  • Posts: 4908
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #1 on: November 10, 2024, 11:07:39 am »
Does anyone have any insight into why this is?
Without seeing the code, only guesses can be provided. 

Apparently the code you wrote takes fewer clock cycles to execute in 32 bit than in 64 bit.  The reason why ?... (without seeing the code) who knows!
(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12005
  • FPC developer.
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #2 on: November 10, 2024, 11:54:55 am »
Or, if you calculate large tables or parse large data structures, the cache/memory might be the bottleneck, not the CPU.

Thaddy

  • Hero Member
  • *****
  • Posts: 16419
  • Censorship about opinions does not belong here.
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #3 on: November 10, 2024, 12:17:53 pm »
Or, the fpu settiing for that matter. The compiler is sooo conservative upto and including -O2.
There is also a noticable difference between ARM32/AARCH64, it is not only Intel.
There is nothing wrong with being blunt. At a minimum it is also honest.

MathMan

  • Sr. Member
  • ****
  • Posts: 405
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #4 on: November 10, 2024, 12:45:10 pm »
Or the target - if the target is a 32 bit CPU, then ...

Regarding source I think ad1mt is refering to his Multi-Int package, as he posted the same comment here - https://forum.lazarus.freepascal.org/index.php/topic,68290.msg537657/topicseen.html#new
« Last Edit: November 10, 2024, 01:01:15 pm by MathMan »

LV

  • Full Member
  • ***
  • Posts: 197
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #5 on: November 10, 2024, 02:44:13 pm »
The floating point results are also interesting (FPC 3.2.2. no optimization).  :o

Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. uses
  4.   SysUtils;
  5.  
  6. var
  7.   start: int64;
  8.   i, j, k: integer;
  9.   x: double;
  10.   p: Pointer;
  11. begin
  12.   writeln('system ', SizeOf(p) * 8, '  bit');
  13.   for k := 1 to 10 do
  14.   begin
  15.     start := gettickcount64;
  16.     for j := 0 to 10000 do
  17.       for i := 0 to 10000 do
  18.         x := ln(i + 1);
  19.     writeln(k, '  finished', ' took: ', gettickcount64 - start);
  20.   end;
  21.   readln;
  22. end.
  23.  

Code: Text  [Select][+][-]
  1. system 32  bit
  2. 1  finished took: 1485
  3. 2  finished took: 1437
  4. 3  finished took: 1422
  5. 4  finished took: 1422
  6. 5  finished took: 1437
  7. 6  finished took: 1422
  8. 7  finished took: 1438
  9. 8  finished took: 1406
  10. 9  finished took: 1437
  11. 10  finished took: 1422
  12.  
  13. system 64  bit
  14. 1  finished took: 2734
  15. 2  finished took: 2641
  16. 3  finished took: 2641
  17. 4  finished took: 2640
  18. 5  finished took: 2625
  19. 6  finished took: 2641
  20. 7  finished took: 2640
  21. 8  finished took: 2625
  22. 9  finished took: 2641
  23. 10  finished took: 2625
  24.  

Windows 11.

Lazarus 2.2.6 (rev lazarus_2_2_6) FPC 3.2.2 x86_64-win64-win32/win64
« Last Edit: November 10, 2024, 02:53:33 pm by LV »

LV

  • Full Member
  • ***
  • Posts: 197
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #6 on: November 10, 2024, 03:36:09 pm »
Another test.

Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. uses
  4.   SysUtils;
  5.  
  6. const
  7.   MN = 8000;
  8.  
  9. type
  10.   MyArr = array of array of double;
  11.  
  12. var
  13.   Arrx, Arry, Arrz: MyArr;
  14.   start: int64;
  15.   i, j, k, l: integer;
  16.   x: double;
  17.   p: Pointer;
  18.  
  19.   procedure Computing(i: integer; j: integer);
  20.   begin
  21.     Arry[i, j] := Arrx[i, j] - (Arrx[i, j] * Arrx[i, j] - Arrx[i - 1, j] *
  22.       Arrx[i - 1, j] / 2);
  23.     Arrz[i, j] := Arrx[i, j] - (Arrx[i, j] * Arrx[i, j] - Arrx[i, j - 1] *
  24.       Arrx[i, j - 1] / 2);
  25.   end;
  26.  
  27. begin
  28.   writeln('system ', SizeOf(p) * 8, '  bit');
  29.   SetLength(Arrx, MN + 1, MN + 1);
  30.   SetLength(Arry, MN + 1, MN + 1);
  31.   SetLength(Arrz, MN + 1, MN + 1);
  32.   for l := 1 to 5 do
  33.   begin
  34.     start := gettickcount64;
  35.     for k := 1 to 5 do
  36.       for i := 1 to MN do
  37.         for j := 1 to MN do
  38.           Computing(i, j);
  39.     writeln(l, '  finished', ' took: ', gettickcount64 - start);
  40.   end;
  41.   readln;
  42. end.
  43.  

Code: Text  [Select][+][-]
  1. system 32  bit
  2. 1  finished took: 3281
  3. 2  finished took: 3250
  4. 3  finished took: 3250
  5. 4  finished took: 3266
  6. 5  finished took: 3234
  7.  
  8. system 64  bit
  9. 1  finished took: 2953
  10. 2  finished took: 2906
  11. 3  finished took: 2906
  12. 4  finished took: 2906
  13. 5  finished took: 2891
  14.  

LV

  • Full Member
  • ***
  • Posts: 197
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #7 on: November 10, 2024, 04:15:30 pm »
I have a multithreaded program that I tested on a computer with an i7 8700 processor running Windows 11. I set it to run with 6 threads.

In the Win32 compilation mode (cross-compilation target: i386), the execution time was 16 seconds.
In the Win64 compilation mode (target: Default), the execution time was 8 seconds.

tetrastes

  • Hero Member
  • *****
  • Posts: 622
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #8 on: November 10, 2024, 04:33:49 pm »
The floating point results are also interesting (FPC 3.2.2. no optimization).  :o

This is because ppcx64 does not use x87 FPU.

BrunoK

  • Hero Member
  • *****
  • Posts: 647
  • Retired programmer
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #9 on: November 10, 2024, 04:43:04 pm »
Code: Pascal  [Select][+][-]
  1. bk_1 : x := ln(i + 1);
  2. bk_2 :  procedure Computing(i: integer; j: integer);
Notes :
   bk_1 : the 32 bit mode uses the in processor x87 ln intrinsic operation while the 64 bit mode uses a more exacting compiler procedure in 3.2.2\rtl\inc\genmath.inc from line 1370. Seems that the intrinsic x87 floating point is twice faster than the windows x86_64 implementation.
  bk_2 :  On my system, the 64 bit mode is slightly faster than the x87 fpu mode. Probably due to Double (64 bits) matching size of register (8 bit 8 bytes).

As for compilation, try to compile with -O1 -OoREGVAR. The generated code remains very debugable but is nearly as fast as with higher optimization.
« Last Edit: November 10, 2024, 04:45:11 pm by BrunoK »

Thaddy

  • Hero Member
  • *****
  • Posts: 16419
  • Censorship about opinions does not belong here.
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #10 on: November 10, 2024, 05:06:56 pm »
The floating point results are also interesting (FPC 3.2.2. no optimization).  :o

This is because ppcx64 does not use x87 FPU.
That is a misconception. SSE3+, AVX<x> and X86-64-V<x> are much faster and even with better precision in some cases. But all these do not belong to the default optimization upto level O2. You can get upwards of 128 bit - here upto 512 - precision from those and that is a lot more than 80 bit. It just takes a bit of understanding what you want exactly...and what the fpu settings do...
e.g. avx2+ is a lot faster and with more precision than the olden x87 concept if required. Although it might not very well be integrated in the compiler yet.
« Last Edit: November 10, 2024, 05:18:29 pm by Thaddy »
There is nothing wrong with being blunt. At a minimum it is also honest.

tetrastes

  • Hero Member
  • *****
  • Posts: 622
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #11 on: November 10, 2024, 05:47:06 pm »
The floating point results are also interesting (FPC 3.2.2. no optimization).  :o

This is because ppcx64 does not use x87 FPU.
That is a misconception. SSE3+, AVX<x> and X86-64-V<x> are much faster and even with better precision in some cases. But all these do not belong to the default optimization upto level O2.

I have written about the particular code, and why here x87 is faster despite all optimizations, you may read in answer of BrunoK.

You can get upwards of 128 bit - here upto 512 - precision from those and that is a lot more than 80 bit. It just takes a bit of understanding what you want exactly...and what the fpu settings do...
e.g. avx2+ is a lot faster and with more precision than the olden x87 concept if required. Although it might not very well be integrated in the compiler yet.

That is the point. You cannot achieve anything without using assembler. Not to say, that there are no floating point types of high precisions, even Extended is removed, may be it can be used to some extend (?) for all that cool things, at least to have 80-bit precision.
Though talks about precision have no matter to this code.


MathMan

  • Sr. Member
  • ****
  • Posts: 405
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #12 on: November 10, 2024, 06:57:09 pm »
@BrunoK

Quote
bk_1 : the 32 bit mode uses the in processor x87 ln intrinsic operation while the 64 bit mode uses a more exacting compiler procedure in 3.2.2\rtl\inc\genmath.inc from line 1370. Seems that the intrinsic x87 floating point is twice faster than the windows x86_64 implementation.

Nearly correct - it uses 'FYL2X FYL2XP1' plus a multiplication by constant - taking ~80-90 cycles, as per Agner Fogs measurements. The 64 bit mode variant for Double is however not more exact - the Intel microcode implementation of x87 'FYL2X FYL2XP1' extended are correct to ~1.5 units in the last place which is much better than anything achievable for Double. See the Intel x87 documentation - e.g. here http://www.infophysics.net/x87.pdf.

Edit: Oops, sorry - maybe I misread 'intrinsic' as 'mnemonic' here?

In general terms there has been a lot of progress recently (last 10-15 years) on floating point libraries in terms of speed and accuracy for Single & Double algebraic and transcendental functions. Search for the 'Core Math' project of Paul Zimmermann if you want to take a peek. Modern libraries are now faster than the x87 microcode for Single & Double. If the library in FPC RTL is slower, as you suggest, then it probably hasn't been updated for quite some time.

The Extended type nowadays is on a major retreat as far as I can see.

@Thaddy

Quote
That is a misconception. SSE3+, AVX<x> and X86-64-V<x> are much faster and even with better precision in some cases. But all these do not belong to the default optimization upto level O2. You can get upwards of 128 bit - here upto 512 - precision from those and that is a lot more than 80 bit. It just takes a bit of understanding what you want exactly...and what the fpu settings do...

I think the misconception is on your side here, sorry. All vector extensions of Intel (and also ARM / RISK-V and others afaia) tops out at Double (IEEE FP64 bit floating point type). What they do offer is the ability to do operations on multiple instances of FP32 or FP64 at the same time, but not providing added precision due to that. You can do e.g. 2 / 4 / 8 multiplications of FP64 in parallel on SSE / AVX2 / AVX512 - but this is not the same as doing 1 FP512 multiplication!

They are faster on the basic operations like +,-,* than the x87 but at the same have lost all algebraic / transzendental functions (with exception of SQRT and INVSQRT) which now must be implemented by libraries.

@tetrastes:

Quote
Not to say, that there are no floating point types of high precisions, even Extended is removed, may be it can be used to some extend (?) for all that cool things, at least to have 80-bit precision.

Yes they can - there are some libraries implementing what is called double quad or quad quad - a non IEEE floating point format with 106 / 212 bit precision. However this is stitched together from 2 / 4 IEEE FP64, and though fast seems to have been neglected recently (<= I may be on error here, haven't followed thoroughly).
« Last Edit: November 10, 2024, 07:03:12 pm by MathMan »

LV

  • Full Member
  • ***
  • Posts: 197
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #13 on: November 10, 2024, 07:10:14 pm »
As for compilation, try to compile with -O1 -OoREGVAR. The generated code remains very debugable but is nearly as fast as with higher optimization.

@BrunoK, thanks. That's what I usually do. In the test (Reply #7), a fairly large program for Win64 was compiled with the (-O1 + quick optimizations) (-O2) option. Out of curiosity, I checked it for Win32 with the same option.

For the sake of completeness, here are the results of tests with optimization (Reply #5):

Code: Text  [Select][+][-]
  1. system 32  bit   (-O1 + quick optimizations) (-O2)
  2. 1  finished took: 1391
  3. 2  finished took: 1375
  4. 3  finished took: 1375
  5. 4  finished took: 1359
  6. 5  finished took: 1391
  7.  
  8. system 64  bit   (-O1 + quick optimizations) (-O2)
  9. 1  finished took: 2719
  10. 2  finished took: 2687
  11. 3  finished took: 2641
  12. 4  finished took: 2703
  13. 5  finished took: 2703
  14.  

and (Reply #6):

Code: Text  [Select][+][-]
  1. system 32  bit   (-O1 + quick optimizations) (-O2)
  2. 1  finished took: 1500
  3. 2  finished took: 1484
  4. 3  finished took: 1469
  5. 4  finished took: 1469
  6. 5  finished took: 1468
  7.  
  8. system 64  bit   (-O1 + quick optimizations) (-O2)
  9. 1  finished took: 1047
  10. 2  finished took: 1031
  11. 3  finished took: 1047
  12. 4  finished took: 1031
  13. 5  finished took: 1031
  14.  

@MathMan, thank you; I have expanded my horizons.

Best regards

ad1mt

  • Sr. Member
  • ****
  • Posts: 327
    • Mark Taylor's Home Page
Re: Code speed in 64bit compiler vs 32bit compiler
« Reply #14 on: November 10, 2024, 07:17:22 pm »
It appears this might be rather complicated...
A very simple code test I just did, adding integers together, is sometimes faster in the 32bit compiler, and sometimes faster in the 64bit compiler.

One of things that the speed depends on is the -O optimisation level. Another thing is mixing 32bit variables and 64bit variables in assignments or expressions.
« Last Edit: November 10, 2024, 07:19:12 pm by ad1mt »

 

TinyPortal © 2005-2018