Recent

Author Topic: Overflow intrinsic  (Read 1157 times)

LemonParty

  • Sr. Member
  • ****
  • Posts: 355
Overflow intrinsic
« on: September 16, 2025, 12:16:37 am »
Hello.

I think there should be added function for adding or substracting which return overflow state. Because using exeptions is not performant.
It should look like this:
Code: Pascal  [Select][+][-]
  1. begin
  2.   {Result is where the sum of A and B written}
  3.   if AddOverflow(Result, A, B) then
  4.     Writeln('Overflow happened');
  5. end.
  6.  
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 11793
  • Debugger - SynEdit - and more
    • wiki
Re: Overflow intrinsic
« Reply #1 on: September 16, 2025, 12:55:53 am »
Well, while not as fast as checking the carry flag of a CPU (if the CPU has the info / e.g. bit more complex with adding 2 int64 on a 32bit platform), you can do

1) If you add 2 integer
Code: Pascal  [Select][+][-]
  1. var t: int64;
  2. begin
  3.   t := int64(a) + int64(b);
  4.   overflow := (t > high(integer)) or (t < low(integer));
  5.   result := integer(t);  
  6. end;

If you add 2 QWORD, both positive.
Code: Pascal  [Select][+][-]
  1. var r QWord; // result
  2. begin
  3.   {$PUSH}{$R-}{$Q-} // no exceptions
  4.   r := a + b;
  5.   {$POP}
  6.   overflow := r < min(a, b);
  7. end;

Warfley

  • Hero Member
  • *****
  • Posts: 2020
Re: Overflow intrinsic
« Reply #2 on: September 16, 2025, 12:06:18 pm »
How about just doing it yourself:
Code: Pascal  [Select][+][-]
  1. generic function AddOverflow<T>(out outp: T; const lhs, rhs: T): Boolean; inline;
  2. begin
  3.   outp := lhs + rhs;
  4.   Result := (outp<lhs) and (outp<rhs);
  5. end;
Disabling short circuit bools will probably improve performance.

If you add 2 QWORD, both positive.
Code: Pascal  [Select][+][-]
  1. var r QWord; // result
  2. begin
  3.   {$PUSH}{$R-}{$Q-} // no exceptions
  4.   r := a + b;
  5.   {$POP}
  6.   overflow := r < min(a, b);
  7. end;

The min is not even necessary, if both are positive if no overflow the result should be larger than any. The min is only necessary if one can be negative
« Last Edit: September 16, 2025, 12:08:05 pm by Warfley »

LemonParty

  • Sr. Member
  • ****
  • Posts: 355
Re: Overflow intrinsic
« Reply #3 on: September 16, 2025, 03:21:18 pm »
The examples you shown are kind of slow too. The fast implementation is checking the CPU flag as mentioned.
« Last Edit: September 16, 2025, 03:39:12 pm by LemonParty »
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 11793
  • Debugger - SynEdit - and more
    • wiki
Re: Overflow intrinsic
« Reply #4 on: September 16, 2025, 05:44:55 pm »
The examples you shown are kind of slow too. The fast implementation is checking the CPU flag as mentioned.
Actually, yes and no.

If you take the (quite limited) positive number only (Qword or Dword ...) then you end up with 2 < compares. If (and I don't know if...) the compiler generates branch-less code for that (at least the first) then you end up with
- 2 cmp
- 2 flag tests
- 1 branch at the end

versus
- 1 flag test
- 1 branch

On the paper that looks massive. But given the way modern CPU do prediction and elsewhat internal optimizations, you have to very special hand written asm code surrounding that to actually get a difference (and run it millions of times in a tight loop with very little else in that loop).

It may differ when you compile for a RISC target....

And well, despite the above, if it was avail, and I had code that needs the test => I would gladly use it. Probably desire it, too. So yes, I do backup the idea that it would be good to have.

Only, it will probably gain you at best (very optimistically) one or two percent.
(that is in an app wich does more than only adding that one percent is watered down, but the idea is that many such optimization, each contributing a percent on a different part of the code, and you will get some noticeable result)



Btw, if you do need to get something faster (as in by all means, and even if just a tiny bit) => try FPC 3.3.1. Some of my testcases (doesn't matter, I just use them to bench/compare diff fpc versions) had 3% or 4% gains, just from being compiled with 3.3.1 (at -O4 vs -O4 with 3.2.2 or 3.2.3)


LemonParty

  • Sr. Member
  • ****
  • Posts: 355
Re: Overflow intrinsic
« Reply #5 on: September 16, 2025, 06:50:05 pm »
I wrote a simple benchmark to test performance and got +7-9% in 64-bit mode. Difference in 32-bit mode should be bigger because Int64 there is not native.
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 11793
  • Debugger - SynEdit - and more
    • wiki
Re: Overflow intrinsic
« Reply #6 on: September 16, 2025, 08:50:19 pm »
I tested on 64 bits.

Your asm returns true (has overflow) for
  2147483597 + 37

But that isn't true. The result is
2147483634
$7FFFFFF2

Code: Pascal  [Select][+][-]
  1.         writeln(AddOverflow(x1, $7FFFFFCD, 37));
  2.         writeln(x1);
  3. ///
  4. TRUE
  5. 2147483634




In my tests, the sign extend seems to be expensive.

Try
Code: Pascal  [Select][+][-]
  1. function AddOverflow(var aResult: LongInt; A, B: LongInt): Boolean;
  2. var
  3.         lResult: qword;
  4. begin
  5.         lResult:= DWORD(A) + DWORD(B);
  6.         Result:= lResult <= High(DWord);
  7.         aResult:= integer(lResult);
  8. end;
  9.  

The sign bit will still be added, and after casting have its original meaning.

The compiler adds some "and" at the start
Code: Text  [Select][+][-]
  1.         andl    %edx,%edx
  2.         andl    %r8d,%r8d
  3.         addq    %r8,%rdx
  4.         movl    $4294967295,%eax
  5.         cmpq    %rax,%rdx
  6.         setbeb  %al
  7.         movl    %edx,(%rcx)
  8.         ret
  9.  

I did not have time to hand optimize that, to have comparable asm.

After all, if we had the intrinsic, FPC might still add some strange to the code using the intrinsic. So you are not just comparing the method (carry flag vs compare), but also hand written asm vs compliler generated asm (that is in this case not optimally representing the task)

I also did not test, if an AND $FFFFFFFF00000000  and <> 0 would be better (unlikely)



I'll be away for the next 8 days. So I may not respond.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 11793
  • Debugger - SynEdit - and more
    • wiki
Re: Overflow intrinsic
« Reply #7 on: September 16, 2025, 08:52:17 pm »
in a hurry. of course ANDL not AND => 32 to 64

LemonParty

  • Sr. Member
  • ****
  • Posts: 355
Re: Overflow intrinsic
« Reply #8 on: September 16, 2025, 09:09:49 pm »
By the way, this is how checking for overflow of Int64 look like:
Code: Pascal  [Select][+][-]
  1. function SumOverflow(A, B, SumAB: Int64): Boolean;inline;
  2. begin
  3.   if (A xor B) < 0 {Different signs -> never overflows}
  4.     then Result:= False
  5.     else Result:= (SumAB xor B) < 0; {Sign changed -> Overflow}
  6. end;
  7.  
For this case intrinsic will be even better.
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 11793
  • Debugger - SynEdit - and more
    • wiki
Re: Overflow intrinsic
« Reply #9 on: September 16, 2025, 09:34:18 pm »
Well, I can't do lots of tests now...

I am astonished it made 7% in your test. Then my guess was off..
Still that is 7% in a loop doing nothing else....

About the XOR => it will probably/maybe matter, if you can write it such that the 331 compiler generates branch-less code.
Also, it shouldn't do the full result, just the top bit, but I guess it was just to showcase the idea.

Warfley

  • Hero Member
  • *****
  • Posts: 2020
Re: Overflow intrinsic
« Reply #10 on: September 16, 2025, 11:48:57 pm »
But why does this need to be an intrinsic? If you look at the assembly clang generates for the same function I wrote earlier translated to C (after inlining) it does not generate a compare instruction, but instead recognizes the overflow check and exchanges it with a test for the overflow bit.

Adding an intrinsic, which is a permanent feature to the language, that is hardcoded in the compiler, should not be done for something that as demonstrated by clang, can be done by the optimizer without any changes to the language. Also FPC supports LLVM as backend for some systems, so chances are you get the same optimizations as clang does when you compile your pascal code using LLVM.
And if you need that optimization but can't use LLVM or wait for this optimization to be implemented in the FPC directly, just do what you already did and write an nostackframe assembly function

LemonParty

  • Sr. Member
  • ****
  • Posts: 355
Re: Overflow intrinsic
« Reply #11 on: September 17, 2025, 12:47:19 am »
The nostackframe assembly function is not inlined. Intrinsic can be inlined. That is why it should be an intrinsic.

For what targets used LLVM?
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

Warfley

  • Hero Member
  • *****
  • Posts: 2020
Re: Overflow intrinsic
« Reply #12 on: September 17, 2025, 01:16:01 am »
Yeah, but again an intrinsic is hardcoded in the compiler, take a look at this and this. Handling of intrinsics is done in one giant 1kloc case-of statement just for the first pass, and then another giant unit just for code generation. Adding an intrinsic permanently increases the complexity of this code further.
There should be a very good reason why this is actually needed as an intrinsic.

About LLVM, afaik LLVM is available for Unix-like systems, not yet Windows (but this information could already be outdated, that was a few years ago the case). It's its own target and you need an FPC version compiled for LLVM. If you have that you should be able to compile into LLVM bytecode.
« Last Edit: September 17, 2025, 01:18:33 am by Warfley »

LemonParty

  • Sr. Member
  • ****
  • Posts: 355
Re: Overflow intrinsic
« Reply #13 on: September 17, 2025, 02:11:55 pm »
OK. I understand.
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

runewalsh

  • Full Member
  • ***
  • Posts: 106
Re: Overflow intrinsic
« Reply #14 on: September 17, 2025, 04:40:39 pm »
As I read somewhere, Pascal (and Go) way is “80% optimizations for 5% efforts”, not “all imaginable optimizations, no matter how complex and how long it takes to compile”. Intrinsics are simpler than implicit optimizations.

Moreover, implicit optimizations will probably need the intrinsic anyway, e.g. FPC knows the x shl 27 or x shr 5RorDWord(x, 5) optimization but it obviously relies on the RorDWord intrinsic. Why not publish the intrinsic, then, so your critical places can both not rely on optimizations and be more readable (RorDWord vs. sequence of bitwise operations).

 

TinyPortal © 2005-2018