IndexDWord -> Repne scasd

Forum > FPC development

(1/2) > >>

mika:
In issue https://gitlab.com/freepascal.org/fpc/source/-/issues/40119
proposed solution will not change a thing because Repne scasd is not at fault.

64 bit uses Generic IndexDWord and results is exactly the same as for 32 bit.

output for 32bit with Repne scasd

--- Quote ---Matrix4_x_Matrix4: 28 ns/call

System.IndexDWord(#15): 16 ns/call
Generic IndexDWord(#15): 7.1 ns/call

System.IndexDWord(#1007): 497 ns/call
Generic IndexDWord(#1007): 253 ns/call
--- End quote ---

output for 64 bit where System.IndexDWord == Generic IndexDWord

--- Quote ---Matrix4_x_Matrix4: 16 ns/call

System.IndexDWord(#15): 11 ns/call
Generic IndexDWord(#15): 6.7 ns/call

System.IndexDWord(#1007): 486 ns/call
Generic IndexDWord(#1007): 250 ns/call

--- End quote ---

mika:
I investigate this more and found out that IndexDWord version with Repne scasd is bad.
Generic IndexDWord is better mostly of optimizations compiler using during rtl compilation.

previously i was using older version
after updating i found out that optimizations have improved greatly
output for 64 bit where System.IndexDWord == Generic IndexDWord

--- Quote ---Matrix4_x_Matrix4: 16 ns/call

System.IndexDWord(#15): 6.6 ns/call
Generic IndexDWord(#15): 7.0 ns/call

System.IndexDWord(#1007): 261 ns/call
Generic IndexDWord(#1007): 257 ns/call

--- End quote ---

Thaddy:
That highly depends on the chip and the chip maker.
E.g. repne scasd and family used to be slow on Intel, but fast on AMD and VIA.
Also it depends for FPC on your optimization settings regarding CPU family -Cp<cpu> and -Op<cpu> and cache control, for that matter.
In general the compiler chooses a best option. I would not waste time on assembler optimizations, unless you have to. And exactly know the processor make and model you are optimizing for......
It often makes me laugh... :D Hey I can write optimized code in assembler!(NOT!) which translates to: Hey I am wasting my time....in most but not all cases.

marcov:
Ivy bridge -O4 32-bit

--- Quote ---Matrix4_x_Matrix4: 27 ns/call

System.IndexDWord(#15): 26 ns/call
Generic IndexDWord(#15): 12 ns/call

System.IndexDWord(#1007): 569 ns/call
Generic IndexDWord(#1007): 549 ns/call

--- End quote ---

64-bit

--- Quote ---Matrix4_x_Matrix4: 22 ns/call

System.IndexDWord(#15): 28 ns/call
Generic IndexDWord(#15): 11 ns/call

System.IndexDWord(#1007): 572 ns/call
Generic IndexDWord(#1007): 546 ns/call

--- End quote ---

Which only seems slightly higher readings for short searches (which might be due to checks, still to investigated)

That is the newest Intel that I own. I'll try some AMD (5700x/4800) tomorrow.

Note that the gcc note in the benchmark recommends using SIMD, which might be ok for 64-bit (since that is SSE2/3 minimum), but is requirements increasing for 32-bit.

That said, probably it might require some updating, but first, let's get the facts above table. Starting with what processor your tests ran on.

mika:

--- Quote from: marcov on January 19, 2023, 09:01:38 pm ---Ivy bridge -O4 32-bit

--- Quote ---Matrix4_x_Matrix4: 27 ns/call

System.IndexDWord(#15): 26 ns/call
Generic IndexDWord(#15): 12 ns/call

System.IndexDWord(#1007): 569 ns/call
Generic IndexDWord(#1007): 549 ns/call

--- End quote ---

64-bit

--- Quote ---Matrix4_x_Matrix4: 22 ns/call

System.IndexDWord(#15): 28 ns/call
Generic IndexDWord(#15): 11 ns/call

System.IndexDWord(#1007): 572 ns/call
Generic IndexDWord(#1007): 546 ns/call

--- End quote ---

Which only seems slightly higher readings for short searches (which might be due to checks, still to investigated)

That is the newest Intel that I own. I'll try some AMD (5700x/4800) tomorrow.

Note that the gcc note in the benchmark recommends using SIMD, which might be ok for 64-bit (since that is SSE2/3 minimum), but is requirements increasing for 32-bit.

That said, probably it might require some updating, but first, let's get the facts above table. Starting with what processor your tests ran on.

--- End quote ---

you have strange benchmark results

i wrote SMID version fo IndexDWord, just to test
amd 2700x
32 bit fpc 3.2.2

--- Code: Text [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---Matrix4_x_Matrix4: 28 ns/call Repne Scasd Index (#15): 16 ns/callSystem.IndexDWord(#15): 16 ns/callGeneric IndexDWord(#15): 7.7 ns/callXmm IndexDWord(#15): 8.5 ns/callYmm IndexDWord(#15): 8.8 ns/call Repne Scasd Index (#1007): 518 ns/callSystem.IndexDWord(#1007): 509 ns/callGeneric IndexDWord(#1007): 266 ns/callXmm IndexDWord (#1007): 146 ns/callYmm IndexDWord (#1007): 91 ns/call
64 bit fpc 3.2.0

--- Code: Text [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---Matrix4_x_Matrix4: 28 ns/call Repne Scasd Index (#15): 16 ns/callSystem.IndexDWord(#15): 16 ns/callGeneric IndexDWord(#15): 7.5 ns/callXmm IndexDWord(#15): 8.3 ns/callYmm IndexDWord(#15): 8.6 ns/call Repne Scasd Index (#1007): 504 ns/callSystem.IndexDWord(#1007): 506 ns/callGeneric IndexDWord(#1007): 261 ns/callXmm IndexDWord (#1007): 149 ns/callYmm IndexDWord (#1007): 89 ns/call
SIMD versions are faster. Assembler isn't dead.

Navigation

[0] Message Index

[#] Next page