Forum > FPC development

IndexDWord -> Repne scasd

<< < (2/2)

marcov:
32-bit:

--- Code: ---Matrix4_x_Matrix4:         27 ns/call

Repne Scasd Index (#15):   26 ns/call
System.IndexDWord(#15):    25 ns/call
Generic IndexDWord(#15):   11 ns/call
Xmm IndexDWord(#15):       12 ns/call
Ymm IndexDWord(#15):       16 ns/call

Repne Scasd Index (#1007): 550 ns/call
System.IndexDWord(#1007):  547 ns/call
Generic IndexDWord(#1007): 537 ns/call
Xmm IndexDWord (#1007):    151 ns/call
An unhandled exception occurred at $0040176C:
EAccessViolation: Access violation
  $0040176C
  $00401CE5
  $0040215C

--- End code ---

... since Ivy Bridge has no AVX2, only floating point AVX1. (and the short array doesn't actually run the ymm code)

laptop AMD 4800H  32-bit

--- Code: ---Matrix4_x_Matrix4:         26 ns/call

Repne Scasd Index (#15):   16 ns/call
System.IndexDWord(#15):    16 ns/call
Generic IndexDWord(#15):   6.5 ns/call
Xmm IndexDWord(#15):       7.1 ns/call
Ymm IndexDWord(#15):       7.4 ns/call

Repne Scasd Index (#1007): 493 ns/call
System.IndexDWord(#1007):  493 ns/call
Generic IndexDWord(#1007): 249 ns/call
Xmm IndexDWord (#1007):    130 ns/call
Ymm IndexDWord (#1007):    77 ns/call

--- End code ---

That said, we can maybe make some cases for when the RTL is compiled with YMM support as minimum.  You might want to look at opcode vbroadcast

mika:

--- Quote from: marcov on January 21, 2023, 12:12:16 am ---... since Ivy Bridge has no AVX2, only floating point AVX1. (and the short array doesn't actually run the ymm code)

That said, we can maybe make some cases for when the RTL is compiled with YMM support as minimum.  You might want to look at opcode vbroadcast

--- End quote ---

Version with xmm registers uses only SSE2 instruction set. With small Lenght i did not use any SIMD because overhead do not paying off.
vbroadcast is AVX2 and it did not gave any benefit time wise.
It would be great if RTL had CPU dispatched code path depending on supported instruction set. But that goes against fpc philosophy - one source for every use case.

marcov:
As said, SSE2 can't be assumed for x86.

It can be for x86_64. Maybe SSE3 can also be assumed (only the AMD original Hammer series didn't have them, but they didn't have certain scheduler instructions that make them not supported by modern OSes either)


--- Quote from: mika on January 21, 2023, 02:38:11 am ---vbroadcast is AVX2 and it did not gave any benefit time wise.

--- End quote ---

I know, but I mentioned it because the AVX2 code didn't have it either. (and it was AVX2, and not -1 since it crashed). It was just a FYI.

Possible code:

     vmovd xmm0,b
     vpbroadcastd ymm0,xmm0


--- Quote ---It would be great if RTL had CPU dispatched code path depending on supported instruction set. But that goes against fpc philosophy - one source for every use case.

--- End quote ---

If on startup SIMD options are stored in a few booleans, it is only a compare and a jump (to the unaligned trailing code portion) extra to test.  Currently there is no AVX1/2 support though.  (AVX1 is enough for e.g. move)

runewalsh:
I’m necroing this a bit and things have changed further since then (now there are SSE2 versions for i386, terrible being the GCC output instead of manually coded but better than nothing), but I just wanted to say that it’s not optimizations that were improved, but the generic version itself, I made the proposal about removing REP SCAS-based Index* only after that. What author got is that, without pulling and recompiling fresh RTL, “System.IndexDWord” was the old generic version called from System and “Generic IndexDWord” was the new one built into the benchmark.

REPs might have acceptable or even good (ERMSB) performance on large scale but they also have large startup cost, I noticed it a long time ago but didn’t attribute to them at first. I think it has something to do with their CISCiness.

Navigation

[0] Message Index

[*] Previous page

Go to full version