IndexDWord -> Repne scasd

mika

Full Member
Posts: 102

IndexDWord -> Repne scasd

« on: January 18, 2023, 11:02:27 pm »

In issue https://gitlab.com/freepascal.org/fpc/source/-/issues/40119
proposed solution will not change a thing because Repne scasd is not at fault.

64 bit uses Generic IndexDWord and results is exactly the same as for 32 bit.

output for 32bit with Repne scasd

Quote

Matrix4_x_Matrix4: 28 ns/call

System.IndexDWord(#15): 16 ns/call
Generic IndexDWord(#15): 7.1 ns/call

System.IndexDWord(#1007): 497 ns/call
Generic IndexDWord(#1007): 253 ns/call

output for 64 bit where System.IndexDWord == Generic IndexDWord

Quote

Matrix4_x_Matrix4: 16 ns/call

System.IndexDWord(#15): 11 ns/call
Generic IndexDWord(#15): 6.7 ns/call

System.IndexDWord(#1007): 486 ns/call
Generic IndexDWord(#1007): 250 ns/call

Logged

mika

Full Member
Posts: 102

Re: IndexDWord -> Repne scasd

« Reply #1 on: January 19, 2023, 04:25:13 pm »

I investigate this more and found out that IndexDWord version with Repne scasd is bad.
Generic IndexDWord is better mostly of optimizations compiler using during rtl compilation.

previously i was using older version
after updating i found out that optimizations have improved greatly
output for 64 bit where System.IndexDWord == Generic IndexDWord

Quote

Matrix4_x_Matrix4: 16 ns/call

System.IndexDWord(#15): 6.6 ns/call
Generic IndexDWord(#15): 7.0 ns/call

System.IndexDWord(#1007): 261 ns/call
Generic IndexDWord(#1007): 257 ns/call

Logged

Thaddy

Hero Member
Posts: 14359
Sensorship about opinions does not belong here.

Re: IndexDWord -> Repne scasd

« Reply #2 on: January 19, 2023, 08:07:19 pm »

That highly depends on the chip and the chip maker.
E.g. repne scasd and family used to be slow on Intel, but fast on AMD and VIA.
Also it depends for FPC on your optimization settings regarding CPU family -Cp<cpu> and -Op<cpu> and cache control, for that matter.
In general the compiler chooses a best option. I would not waste time on assembler optimizations, unless you have to. And exactly know the processor make and model you are optimizing for......
It often makes me laugh...

Hey I can write optimized code in assembler!(NOT!) which translates to: Hey I am wasting my time....in most but not all cases.

« Last Edit: January 19, 2023, 08:18:23 pm by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

marcov

Administrator
Hero Member
Posts: 11445
FPC developer.

Re: IndexDWord -> Repne scasd

« Reply #3 on: January 19, 2023, 09:01:38 pm »

Ivy bridge -O4 32-bit

Quote

Matrix4_x_Matrix4: 27 ns/call

System.IndexDWord(#15): 26 ns/call
Generic IndexDWord(#15): 12 ns/call

System.IndexDWord(#1007): 569 ns/call
Generic IndexDWord(#1007): 549 ns/call

64-bit

Quote

Matrix4_x_Matrix4: 22 ns/call

System.IndexDWord(#15): 28 ns/call
Generic IndexDWord(#15): 11 ns/call

System.IndexDWord(#1007): 572 ns/call
Generic IndexDWord(#1007): 546 ns/call

Which only seems slightly higher readings for short searches (which might be due to checks, still to investigated)

That is the newest Intel that I own. I'll try some AMD (5700x/4800) tomorrow.

Note that the gcc note in the benchmark recommends using SIMD, which might be ok for 64-bit (since that is SSE2/3 minimum), but is requirements increasing for 32-bit.

That said, probably it might require some updating, but first, let's get the facts above table. Starting with what processor your tests ran on.

Logged

mika

Full Member
Posts: 102

Re: IndexDWord -> Repne scasd

« Reply #4 on: January 20, 2023, 05:39:27 pm »

Quote from: marcov on January 19, 2023, 09:01:38 pm

Ivy bridge -O4 32-bit

Quote
Matrix4_x_Matrix4: 27 ns/call

System.IndexDWord(#15): 26 ns/call
Generic IndexDWord(#15): 12 ns/call

System.IndexDWord(#1007): 569 ns/call
Generic IndexDWord(#1007): 549 ns/call

64-bit

Quote
Matrix4_x_Matrix4: 22 ns/call

System.IndexDWord(#15): 28 ns/call
Generic IndexDWord(#15): 11 ns/call

System.IndexDWord(#1007): 572 ns/call
Generic IndexDWord(#1007): 546 ns/call

Which only seems slightly higher readings for short searches (which might be due to checks, still to investigated)

That is the newest Intel that I own. I'll try some AMD (5700x/4800) tomorrow.

Note that the gcc note in the benchmark recommends using SIMD, which might be ok for 64-bit (since that is SSE2/3 minimum), but is requirements increasing for 32-bit.

That said, probably it might require some updating, but first, let's get the facts above table. Starting with what processor your tests ran on.

you have strange benchmark results

i wrote SMID version fo IndexDWord, just to test
amd 2700x
32 bit fpc 3.2.2

Code: Text [Select][+]

Matrix4_x_Matrix4:         28 ns/call
 
Repne Scasd Index (#15):   16 ns/call
System.IndexDWord(#15):    16 ns/call
Generic IndexDWord(#15):   7.7 ns/call
Xmm IndexDWord(#15):       8.5 ns/call
Ymm IndexDWord(#15):       8.8 ns/call
 
Repne Scasd Index (#1007): 518 ns/call
System.IndexDWord(#1007):  509 ns/call
Generic IndexDWord(#1007): 266 ns/call
Xmm IndexDWord (#1007):    146 ns/call
Ymm IndexDWord (#1007):    91 ns/call

64 bit fpc 3.2.0

Code: Text [Select][+]

Matrix4_x_Matrix4:         28 ns/call
 
Repne Scasd Index (#15):   16 ns/call
System.IndexDWord(#15):    16 ns/call
Generic IndexDWord(#15):   7.5 ns/call
Xmm IndexDWord(#15):       8.3 ns/call
Ymm IndexDWord(#15):       8.6 ns/call
 
Repne Scasd Index (#1007): 504 ns/call
System.IndexDWord(#1007):  506 ns/call
Generic IndexDWord(#1007): 261 ns/call
Xmm IndexDWord (#1007):    149 ns/call
Ymm IndexDWord (#1007):    89 ns/call

SIMD versions are faster. Assembler isn't dead.

indexdword7.pas (12.5 kB - downloaded 30 times.)

« Last Edit: January 20, 2023, 05:43:04 pm by mika »

Logged

marcov

Administrator
Hero Member
Posts: 11445
FPC developer.

Re: IndexDWord -> Repne scasd

« Reply #5 on: January 21, 2023, 12:12:16 am »

32-bit:

Code: [Select]

Matrix4_x_Matrix4:         27 ns/call

Repne Scasd Index (#15):   26 ns/call
System.IndexDWord(#15):    25 ns/call
Generic IndexDWord(#15):   11 ns/call
Xmm IndexDWord(#15):       12 ns/call
Ymm IndexDWord(#15):       16 ns/call

Repne Scasd Index (#1007): 550 ns/call
System.IndexDWord(#1007):  547 ns/call
Generic IndexDWord(#1007): 537 ns/call
Xmm IndexDWord (#1007):    151 ns/call
An unhandled exception occurred at $0040176C:
EAccessViolation: Access violation
  $0040176C
  $00401CE5
  $0040215C

... since Ivy Bridge has no AVX2, only floating point AVX1. (and the short array doesn't actually run the ymm code)

laptop AMD 4800H 32-bit

Code: [Select]

Matrix4_x_Matrix4:         26 ns/call

Repne Scasd Index (#15):   16 ns/call
System.IndexDWord(#15):    16 ns/call
Generic IndexDWord(#15):   6.5 ns/call
Xmm IndexDWord(#15):       7.1 ns/call
Ymm IndexDWord(#15):       7.4 ns/call

Repne Scasd Index (#1007): 493 ns/call
System.IndexDWord(#1007):  493 ns/call
Generic IndexDWord(#1007): 249 ns/call
Xmm IndexDWord (#1007):    130 ns/call
Ymm IndexDWord (#1007):    77 ns/call

That said, we can maybe make some cases for when the RTL is compiled with YMM support as minimum. You might want to look at opcode vbroadcast

« Last Edit: January 24, 2023, 09:22:27 am by marcov »

Logged

mika

Full Member
Posts: 102

Re: IndexDWord -> Repne scasd

« Reply #6 on: January 21, 2023, 02:38:11 am »

Quote from: marcov on January 21, 2023, 12:12:16 am

... since Ivy Bridge has no AVX2, only floating point AVX1. (and the short array doesn't actually run the ymm code)

That said, we can maybe make some cases for when the RTL is compiled with YMM support as minimum. You might want to look at opcode vbroadcast

Version with xmm registers uses only SSE2 instruction set. With small Lenght i did not use any SIMD because overhead do not paying off.
vbroadcast is AVX2 and it did not gave any benefit time wise.
It would be great if RTL had CPU dispatched code path depending on supported instruction set. But that goes against fpc philosophy - one source for every use case.

« Last Edit: January 21, 2023, 02:51:18 am by mika »

Logged

marcov

Administrator
Hero Member
Posts: 11445
FPC developer.

Re: IndexDWord -> Repne scasd

« Reply #7 on: January 21, 2023, 02:15:42 pm »

As said, SSE2 can't be assumed for x86.

It can be for x86_64. Maybe SSE3 can also be assumed (only the AMD original Hammer series didn't have them, but they didn't have certain scheduler instructions that make them not supported by modern OSes either)

Quote from: mika on January 21, 2023, 02:38:11 am

vbroadcast is AVX2 and it did not gave any benefit time wise.

I know, but I mentioned it because the AVX2 code didn't have it either. (and it was AVX2, and not -1 since it crashed). It was just a FYI.

Possible code:

vmovd xmm0,b
vpbroadcastd ymm0,xmm0

Quote

It would be great if RTL had CPU dispatched code path depending on supported instruction set. But that goes against fpc philosophy - one source for every use case.

If on startup SIMD options are stored in a few booleans, it is only a compare and a jump (to the unaligned trailing code portion) extra to test. Currently there is no AVX1/2 support though. (AVX1 is enough for e.g. move)

« Last Edit: January 22, 2023, 06:04:42 pm by marcov »

Logged

runewalsh

Jr. Member
Posts: 82

Re: IndexDWord -> Repne scasd

« Reply #8 on: April 18, 2023, 03:06:05 pm »

I’m necroing this a bit and things have changed further since then (now there are SSE2 versions for i386, terrible being the GCC output instead of manually coded but better than nothing), but I just wanted to say that it’s not optimizations that were improved, but the generic version itself, I made the proposal about removing REP SCAS-based Index* only after that. What author got is that, without pulling and recompiling fresh RTL, “System.IndexDWord” was the old generic version called from System and “Generic IndexDWord” was the new one built into the benchmark.

REPs might have acceptable or even good (ERMSB) performance on large scale but they also have large startup cost, I noticed it a long time ago but didn’t attribute to them at first. I think it has something to do with their CISCiness.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: IndexDWord -> Repne scasd (Read 1338 times)

mika

IndexDWord -> Repne scasd

mika

Re: IndexDWord -> Repne scasd

Thaddy

Re: IndexDWord -> Repne scasd

marcov

Re: IndexDWord -> Repne scasd

mika

Re: IndexDWord -> Repne scasd

marcov

Re: IndexDWord -> Repne scasd

mika

Re: IndexDWord -> Repne scasd

marcov

Re: IndexDWord -> Repne scasd

runewalsh

Re: IndexDWord -> Repne scasd

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook