Recent

Author Topic: IndexDWord -> Repne scasd  (Read 1338 times)

mika

  • Full Member
  • ***
  • Posts: 102
IndexDWord -> Repne scasd
« on: January 18, 2023, 11:02:27 pm »
In issue https://gitlab.com/freepascal.org/fpc/source/-/issues/40119
proposed solution will not change a thing because Repne scasd is not at fault.

64 bit uses Generic IndexDWord and results is exactly the same as for 32 bit.

output for 32bit with Repne scasd
Quote
Matrix4_x_Matrix4:         28 ns/call

System.IndexDWord(#15):    16 ns/call
Generic IndexDWord(#15):   7.1 ns/call

System.IndexDWord(#1007):  497 ns/call
Generic IndexDWord(#1007): 253 ns/call

output for 64 bit where System.IndexDWord == Generic IndexDWord
Quote
Matrix4_x_Matrix4:         16 ns/call

System.IndexDWord(#15):    11 ns/call
Generic IndexDWord(#15):   6.7 ns/call

System.IndexDWord(#1007):  486 ns/call
Generic IndexDWord(#1007): 250 ns/call

mika

  • Full Member
  • ***
  • Posts: 102
Re: IndexDWord -> Repne scasd
« Reply #1 on: January 19, 2023, 04:25:13 pm »
I investigate this more and found out that IndexDWord version with Repne scasd is bad.
Generic IndexDWord is better mostly of optimizations compiler using during rtl compilation.

previously i was using older version
after updating i found out that optimizations have improved greatly
output for 64 bit where System.IndexDWord == Generic IndexDWord
Quote
Matrix4_x_Matrix4:         16 ns/call

System.IndexDWord(#15):    6.6 ns/call
Generic IndexDWord(#15):   7.0 ns/call

System.IndexDWord(#1007):  261 ns/call
Generic IndexDWord(#1007): 257 ns/call

Thaddy

  • Hero Member
  • *****
  • Posts: 14359
  • Sensorship about opinions does not belong here.
Re: IndexDWord -> Repne scasd
« Reply #2 on: January 19, 2023, 08:07:19 pm »
That highly depends on the chip and the chip maker.
E.g. repne scasd and family used to be slow on Intel, but fast on AMD and VIA.
Also it depends for FPC on your optimization settings regarding CPU family  -Cp<cpu> and -Op<cpu> and cache control, for that matter.
In general the compiler chooses a best option. I would not waste time on assembler optimizations, unless you have to. And exactly know the processor make and model you are optimizing for......
It often makes me laugh... :D Hey I can write optimized code in assembler!(NOT!) which translates to: Hey I am wasting my time....in most but not all cases.
« Last Edit: January 19, 2023, 08:18:23 pm by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11445
  • FPC developer.
Re: IndexDWord -> Repne scasd
« Reply #3 on: January 19, 2023, 09:01:38 pm »
Ivy bridge  -O4 32-bit

Quote
Matrix4_x_Matrix4:         27 ns/call

System.IndexDWord(#15):    26 ns/call
Generic IndexDWord(#15):   12 ns/call

System.IndexDWord(#1007):  569 ns/call
Generic IndexDWord(#1007): 549 ns/call

64-bit

Quote
Matrix4_x_Matrix4:         22 ns/call

System.IndexDWord(#15):    28 ns/call
Generic IndexDWord(#15):   11 ns/call

System.IndexDWord(#1007):  572 ns/call
Generic IndexDWord(#1007): 546 ns/call

Which only seems slightly higher readings for short searches (which might be due to checks, still to investigated)

That is the newest Intel that I own. I'll try some AMD (5700x/4800) tomorrow.

Note that the gcc note in the benchmark recommends using SIMD, which might be ok for 64-bit (since that is SSE2/3 minimum), but is requirements increasing for 32-bit.

That said, probably it might require some updating, but first, let's get the facts above table. Starting with what processor your tests ran on.

mika

  • Full Member
  • ***
  • Posts: 102
Re: IndexDWord -> Repne scasd
« Reply #4 on: January 20, 2023, 05:39:27 pm »
Ivy bridge  -O4 32-bit

Quote
Matrix4_x_Matrix4:         27 ns/call

System.IndexDWord(#15):    26 ns/call
Generic IndexDWord(#15):   12 ns/call

System.IndexDWord(#1007):  569 ns/call
Generic IndexDWord(#1007): 549 ns/call

64-bit

Quote
Matrix4_x_Matrix4:         22 ns/call

System.IndexDWord(#15):    28 ns/call
Generic IndexDWord(#15):   11 ns/call

System.IndexDWord(#1007):  572 ns/call
Generic IndexDWord(#1007): 546 ns/call

Which only seems slightly higher readings for short searches (which might be due to checks, still to investigated)

That is the newest Intel that I own. I'll try some AMD (5700x/4800) tomorrow.

Note that the gcc note in the benchmark recommends using SIMD, which might be ok for 64-bit (since that is SSE2/3 minimum), but is requirements increasing for 32-bit.

That said, probably it might require some updating, but first, let's get the facts above table. Starting with what processor your tests ran on.

you have strange benchmark results

i wrote SMID version fo IndexDWord, just to test
amd 2700x
32 bit fpc 3.2.2
Code: Text  [Select][+][-]
  1. Matrix4_x_Matrix4:         28 ns/call
  2.  
  3. Repne Scasd Index (#15):   16 ns/call
  4. System.IndexDWord(#15):    16 ns/call
  5. Generic IndexDWord(#15):   7.7 ns/call
  6. Xmm IndexDWord(#15):       8.5 ns/call
  7. Ymm IndexDWord(#15):       8.8 ns/call
  8.  
  9. Repne Scasd Index (#1007): 518 ns/call
  10. System.IndexDWord(#1007):  509 ns/call
  11. Generic IndexDWord(#1007): 266 ns/call
  12. Xmm IndexDWord (#1007):    146 ns/call
  13. Ymm IndexDWord (#1007):    91 ns/call

64 bit fpc 3.2.0
Code: Text  [Select][+][-]
  1. Matrix4_x_Matrix4:         28 ns/call
  2.  
  3. Repne Scasd Index (#15):   16 ns/call
  4. System.IndexDWord(#15):    16 ns/call
  5. Generic IndexDWord(#15):   7.5 ns/call
  6. Xmm IndexDWord(#15):       8.3 ns/call
  7. Ymm IndexDWord(#15):       8.6 ns/call
  8.  
  9. Repne Scasd Index (#1007): 504 ns/call
  10. System.IndexDWord(#1007):  506 ns/call
  11. Generic IndexDWord(#1007): 261 ns/call
  12. Xmm IndexDWord (#1007):    149 ns/call
  13. Ymm IndexDWord (#1007):    89 ns/call

SIMD versions are faster. Assembler isn't dead.
« Last Edit: January 20, 2023, 05:43:04 pm by mika »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11445
  • FPC developer.
Re: IndexDWord -> Repne scasd
« Reply #5 on: January 21, 2023, 12:12:16 am »
32-bit:
Code: [Select]
Matrix4_x_Matrix4:         27 ns/call

Repne Scasd Index (#15):   26 ns/call
System.IndexDWord(#15):    25 ns/call
Generic IndexDWord(#15):   11 ns/call
Xmm IndexDWord(#15):       12 ns/call
Ymm IndexDWord(#15):       16 ns/call

Repne Scasd Index (#1007): 550 ns/call
System.IndexDWord(#1007):  547 ns/call
Generic IndexDWord(#1007): 537 ns/call
Xmm IndexDWord (#1007):    151 ns/call
An unhandled exception occurred at $0040176C:
EAccessViolation: Access violation
  $0040176C
  $00401CE5
  $0040215C

... since Ivy Bridge has no AVX2, only floating point AVX1. (and the short array doesn't actually run the ymm code)

laptop AMD 4800H  32-bit
Code: [Select]
Matrix4_x_Matrix4:         26 ns/call

Repne Scasd Index (#15):   16 ns/call
System.IndexDWord(#15):    16 ns/call
Generic IndexDWord(#15):   6.5 ns/call
Xmm IndexDWord(#15):       7.1 ns/call
Ymm IndexDWord(#15):       7.4 ns/call

Repne Scasd Index (#1007): 493 ns/call
System.IndexDWord(#1007):  493 ns/call
Generic IndexDWord(#1007): 249 ns/call
Xmm IndexDWord (#1007):    130 ns/call
Ymm IndexDWord (#1007):    77 ns/call

That said, we can maybe make some cases for when the RTL is compiled with YMM support as minimum.  You might want to look at opcode vbroadcast
« Last Edit: January 24, 2023, 09:22:27 am by marcov »

mika

  • Full Member
  • ***
  • Posts: 102
Re: IndexDWord -> Repne scasd
« Reply #6 on: January 21, 2023, 02:38:11 am »
... since Ivy Bridge has no AVX2, only floating point AVX1. (and the short array doesn't actually run the ymm code)

That said, we can maybe make some cases for when the RTL is compiled with YMM support as minimum.  You might want to look at opcode vbroadcast

Version with xmm registers uses only SSE2 instruction set. With small Lenght i did not use any SIMD because overhead do not paying off.
vbroadcast is AVX2 and it did not gave any benefit time wise.
It would be great if RTL had CPU dispatched code path depending on supported instruction set. But that goes against fpc philosophy - one source for every use case.
« Last Edit: January 21, 2023, 02:51:18 am by mika »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11445
  • FPC developer.
Re: IndexDWord -> Repne scasd
« Reply #7 on: January 21, 2023, 02:15:42 pm »
As said, SSE2 can't be assumed for x86.

It can be for x86_64. Maybe SSE3 can also be assumed (only the AMD original Hammer series didn't have them, but they didn't have certain scheduler instructions that make them not supported by modern OSes either)

vbroadcast is AVX2 and it did not gave any benefit time wise.

I know, but I mentioned it because the AVX2 code didn't have it either. (and it was AVX2, and not -1 since it crashed). It was just a FYI.

Possible code:

     vmovd xmm0,b
     vpbroadcastd ymm0,xmm0

Quote
It would be great if RTL had CPU dispatched code path depending on supported instruction set. But that goes against fpc philosophy - one source for every use case.

If on startup SIMD options are stored in a few booleans, it is only a compare and a jump (to the unaligned trailing code portion) extra to test.  Currently there is no AVX1/2 support though.  (AVX1 is enough for e.g. move)
« Last Edit: January 22, 2023, 06:04:42 pm by marcov »

runewalsh

  • Jr. Member
  • **
  • Posts: 82
Re: IndexDWord -> Repne scasd
« Reply #8 on: April 18, 2023, 03:06:05 pm »
I’m necroing this a bit and things have changed further since then (now there are SSE2 versions for i386, terrible being the GCC output instead of manually coded but better than nothing), but I just wanted to say that it’s not optimizations that were improved, but the generic version itself, I made the proposal about removing REP SCAS-based Index* only after that. What author got is that, without pulling and recompiling fresh RTL, “System.IndexDWord” was the old generic version called from System and “Generic IndexDWord” was the new one built into the benchmark.

REPs might have acceptable or even good (ERMSB) performance on large scale but they also have large startup cost, I noticed it a long time ago but didn’t attribute to them at first. I think it has something to do with their CISCiness.

 

TinyPortal © 2005-2018