AVX512 Support

Forum > FPC development

AVX512 Support

(1/2) > >>

schuler:
Hi,
wondering if is there any flag to enable (or any compilation instruction) so we can enable AVX512 assembler instructions?

I miss coding things like:

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} --- vmovups zmm2, [rax] vmovups zmm3, [rax+64]
In the case that I'm feeling brave to add support to FPC to some AVX512 instructions. From where should I start?

schuler:
Just been told that this is actually in development :) :
https://svn.freepascal.org/svn/fpc/branches/tg74/avx512/

marcov:
What is your intended application? I did some minor avx2 work (*) last summer, but found out that the lane concept was stiffling for straight work.

(*) among others https://stackoverflow.com/questions/47478010/sse2-8x8-byte-matrix-transpose-code-twice-as-slow-on-haswell-then-on-ivy-bridge

schuler:
I have mixed results with AVX / AVX2 and expect the same with AVX512. In my own application (neural networks), in some server systems I do perceive improvement with AVX2 over AVX. First experiments in a notebook were not promising.

All AVX specific code on this library will have a version for AVX512:
https://sourceforge.net/p/cai/svncode/HEAD/tree/trunk/lazarus/libs/uvolume.pas

This is the typical code that I intend to improve (replace ymm registers by zmm registers):

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} --- vmovups ymm2, [rax] vmovups ymm3, [rax+32] vmovups ymm4, [rax+64] vmovups ymm5, [rax+96] vaddps ymm2, ymm2, [rdx] vaddps ymm3, ymm3, [rdx+32] vaddps ymm4, ymm4, [rdx+64] vaddps ymm5, ymm5, [rdx+96] vmovups [rax], ymm2 vmovups [rax+32], ymm3 vmovups [rax+64], ymm4 vmovups [rax+96], ymm5[code=pascal] Above code will only be fast if memory bandwitdth is really good.

marcov:

--- Quote from: schuler on June 15, 2018, 09:59:00 am ---I have mixed results with AVX / AVX2 and expect the same with AVX512. In my own application (neural networks), in some server systems I do perceive improvement with AVX2 over AVX. First experiments in a notebook were not promising.

--- End quote ---

If you don't have to shuffle, avx(2) is nice. But I have quite some applications that need to shuffle.

I use either 8-bit monochrome or some color format (rgb) that I have to rearrange so I get a register of R's, a register of G's etc.

I tried to scale up the SSE code in the earlier link (8x8 byte matrix rotation) to 16x16 using AVX2, but the number of cycles just went nuts. The code below is already 20 cycles while 8x8 takes 10 cycles. So little gain is to be expected (every row still needs to be gathered from two registers and stored, and there was some other problem. Can't quickly test since this machine is not avx2 capable)

--- Code: ---procedure rot16x16(src,dest:pbbyte;rowpitchsrc,rowpitchdest:integer;nrxstep,nrystep:integer ); [public,alias: 'rot16x16'];
// src rcx, dest rdx, rpsrc r8, rpdest r9
// vol: rax,r10,r11

// init src ptr op 0,0, outerloop: y+=step * rowpitch innerloop: x+x+stepsize
// init dest ptr op width-stepsize, outerloop x:=x-stepsize innerloop: y:=y+step*rowpitch
begin
asm
{$ifdef iacamarker}
mov ebx, 111 // Start marker bytes
db $64, $67, $90 // Start marker bytes
{$endif}

mov r12,r8
shl r12,1
add r12,r8 // r12 = 3*rpsrc

// load 16x16 bytes into 8 32 byte registers while interleaving
vmovdqa xmm0, xmmword ptr [rcx]
vinsertf128 ymm0,ymm0,xmmword ptr [rcx+2*R8],1 // xmm0 a0..a15 b0..b15
vmovdqa xmm1, xmmword ptr [rcx+r8]
vinsertf128 ymm1,ymm1,xmmword ptr [rcx+R12],1 // xmm1 c0..c15 d0..d15
lea rcx,[rcx+4*r8]

vpunpcklbw ymm2,ymm0,ymm1 // xmm2 *2* a0c0..a7c7 *1* b0d0..b7d7
vpunpckhbw ymm3,ymm0,ymm1 // xmm3 *1* a8c8..a15c15 *2*b8d8..b15d15

vmovdqa xmm0, xmmword ptr [rcx]
vinsertf128 ymm0,ymm0,xmmword ptr [rcx+2*R8],1

VPERM2F128 ymm4,ymm2,ymm3,1+16*2 // xmm5 b0d0..b7d7 a8..c8..a15..b16
vpblendd ymm5,ymm2,ymm3,240 // xmm4 a0c0..a7c7 b8d8..b15d15

vmovdqa xmm1, xmmword ptr [rcx+r8]
vinsertf128 ymm1,ymm1,xmmword ptr [rcx+R12],1
lea rcx,[rcx+4*r8]

// we have a problem here. in the higher half, the arguments are swapped.

vpunpcklwd ymm6,ymm5,ymm4 // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
vpunpckhwd ymm7,ymm5,ymm4 // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15

vpunpcklbw ymm2,ymm0,ymm1 // xmm2 *2* a0c0..a7c7 *1* b0d0..b7d7
vpunpckhbw ymm3,ymm0,ymm1 // xmm3 *1* a8c8..a15c15 *2*b8d8..b15d15
vmovdqa xmm0, xmmword ptr [rcx]
vinsertf128 ymm0,ymm0,xmmword ptr [rcx+2*R8],1

VPERM2F128 ymm4,ymm2,ymm3,1+16*2 // xmm5 b0d0..b7d7 a8..c8..a15..b16
vpblendd ymm5,ymm2,ymm3,240 // xmm4 a0c0..a7c7 b8d8..b15d15
vmovdqa xmm1, xmmword ptr [rcx+r8]
vinsertf128 ymm1,ymm1,xmmword ptr [rcx+R12],1
lea rcx,[rcx+4*r8]

vpunpcklwd ymm8,ymm5,ymm4 // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
vpunpckhwd ymm9,ymm5,ymm4 // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15

vpunpcklbw ymm2,ymm0,ymm1 // xmm2 *2* a0c0..a7c7 *1* b0d0..b7d7
vpunpckhbw ymm3,ymm0,ymm1 // xmm3 *1* a8c8..a15c15 *2*b8d8..b15d15
vmovdqa xmm0, xmmword ptr [rcx]
vinsertf128 ymm0,ymm0,xmmword ptr [rcx+2*R8],1

VPERM2F128 ymm4,ymm2,ymm3,1+16*2 // xmm5 b0d0..b7d7 a8..c8..a15..b16
vpblendd ymm5,ymm2,ymm3,240 // xmm4 a0c0..a7c7 b8d8..b15d15
vmovdqa xmm1, xmmword ptr [rcx+r8]
vinsertf128 ymm1,ymm1,xmmword ptr [rcx+R12],1

vpunpcklwd ymm10,ymm5,ymm4 // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
vpunpckhwd ymm11,ymm5,ymm4 // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15
vpunpcklbw ymm2,ymm0,ymm1 // xmm2 *2* a0c0..a7c7 *1* b0d0..b7d7
vpunpckhbw ymm3,ymm0,ymm1 // xmm3 *1* a8c8..a15c15 *2*b8d8..b15d15
VPERM2F128 ymm4,ymm2,ymm3,1+16*2 // xmm5 b0d0..b7d7 a8..c8..a15..b16
vpblendd ymm5,ymm2,ymm3,240 // xmm4 a0c0..a7c7 b8d8..b15d15
vpunpcklwd ymm12,ymm5,ymm4 // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
vpunpckhwd ymm13,ymm5,ymm4 // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15

--- End code ---

I do have a colour distance routine in avx2 that is nice though, but a bit specific.

Navigation

[0] Message Index

[#] Next page