Forum > FPC development

AVX512 Support

(1/2) > >>

schuler:
Hi,
wondering if is there any flag to enable (or any compilation instruction) so we can enable AVX512 assembler instructions?

I miss coding things like:


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---  vmovups zmm2, [rax]  vmovups zmm3, [rax+64] 
In the case that I'm feeling brave to add support to FPC to some AVX512 instructions. From where should I start?

schuler:
Just been told that this is actually in development :) :
https://svn.freepascal.org/svn/fpc/branches/tg74/avx512/

marcov:
What is your intended application?  I did some minor avx2 work (*) last summer, but found out that the lane concept was stiffling for straight work.


(*) among others https://stackoverflow.com/questions/47478010/sse2-8x8-byte-matrix-transpose-code-twice-as-slow-on-haswell-then-on-ivy-bridge

schuler:
I have mixed results with AVX / AVX2 and expect the same with AVX512. In my own application (neural networks), in some server systems I do perceive improvement with AVX2 over AVX. First experiments in a notebook were not promising.

All AVX specific code on this library will have a version for AVX512:
https://sourceforge.net/p/cai/svncode/HEAD/tree/trunk/lazarus/libs/uvolume.pas

This is the typical code that I intend to improve (replace ymm registers by zmm registers):


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---  vmovups ymm2, [rax]  vmovups ymm3, [rax+32]  vmovups ymm4, [rax+64]  vmovups ymm5, [rax+96]   vaddps  ymm2, ymm2, [rdx]  vaddps  ymm3, ymm3, [rdx+32]  vaddps  ymm4, ymm4, [rdx+64]  vaddps  ymm5, ymm5, [rdx+96]   vmovups [rax],    ymm2  vmovups [rax+32], ymm3  vmovups [rax+64], ymm4  vmovups [rax+96], ymm5[code=pascal] Above code will only be fast if memory bandwitdth is really good.

marcov:

--- Quote from: schuler on June 15, 2018, 09:59:00 am ---I have mixed results with AVX / AVX2 and expect the same with AVX512. In my own application (neural networks), in some server systems I do perceive improvement with AVX2 over AVX. First experiments in a notebook were not promising.

--- End quote ---

If you don't have to shuffle, avx(2) is nice. But I have quite some applications that need to shuffle.

I use either 8-bit monochrome or some color format (rgb) that I have to rearrange so I get a register of R's, a register of G's etc.

I tried to scale up the SSE code in the earlier link  (8x8 byte matrix rotation) to 16x16 using AVX2, but the number of cycles just went nuts.  The code below is already 20 cycles while 8x8 takes 10 cycles. So little gain is to be expected (every row still needs to be gathered from two registers and stored, and there was some other problem. Can't quickly test since this machine is not avx2 capable)


--- Code: ---procedure rot16x16(src,dest:pbbyte;rowpitchsrc,rowpitchdest:integer;nrxstep,nrystep:integer  ); [public,alias: 'rot16x16'];
// src rcx, dest rdx, rpsrc r8, rpdest r9
// vol:  rax,r10,r11

// init src ptr op 0,0, outerloop: y+=step * rowpitch    innerloop: x+x+stepsize
// init dest ptr op width-stepsize,  outerloop x:=x-stepsize   innerloop: y:=y+step*rowpitch
begin
asm
 {$ifdef iacamarker}
          mov ebx, 111          // Start marker bytes
         db $64, $67, $90   // Start marker bytes
    {$endif}

  mov r12,r8
  shl r12,1
  add r12,r8   // r12 = 3*rpsrc

  // load 16x16 bytes into 8 32 byte registers while interleaving
   vmovdqa xmm0, xmmword ptr [rcx]
   vinsertf128  ymm0,ymm0,xmmword ptr [rcx+2*R8],1              // xmm0 a0..a15  b0..b15
   vmovdqa xmm1, xmmword ptr [rcx+r8]
   vinsertf128  ymm1,ymm1,xmmword ptr [rcx+R12],1             // xmm1 c0..c15  d0..d15
   lea rcx,[rcx+4*r8]

   vpunpcklbw ymm2,ymm0,ymm1                                  // xmm2 *2* a0c0..a7c7  *1* b0d0..b7d7
   vpunpckhbw ymm3,ymm0,ymm1                                  // xmm3 *1* a8c8..a15c15  *2*b8d8..b15d15

   vmovdqa xmm0, xmmword ptr [rcx]
   vinsertf128  ymm0,ymm0,xmmword ptr [rcx+2*R8],1


   VPERM2F128 ymm4,ymm2,ymm3,1+16*2                           // xmm5 b0d0..b7d7  a8..c8..a15..b16
   vpblendd   ymm5,ymm2,ymm3,240                              // xmm4 a0c0..a7c7  b8d8..b15d15

   vmovdqa xmm1, xmmword ptr [rcx+r8]
   vinsertf128  ymm1,ymm1,xmmword ptr [rcx+R12],1
   lea rcx,[rcx+4*r8]

   // we have a problem here. in the higher half, the arguments are swapped.

   vpunpcklwd ymm6,ymm5,ymm4                                  // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
   vpunpckhwd ymm7,ymm5,ymm4                                  // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15

   vpunpcklbw ymm2,ymm0,ymm1                                  // xmm2 *2* a0c0..a7c7  *1* b0d0..b7d7
   vpunpckhbw ymm3,ymm0,ymm1                                  // xmm3 *1* a8c8..a15c15  *2*b8d8..b15d15
   vmovdqa xmm0, xmmword ptr [rcx]
   vinsertf128  ymm0,ymm0,xmmword ptr [rcx+2*R8],1

   VPERM2F128 ymm4,ymm2,ymm3,1+16*2                           // xmm5 b0d0..b7d7  a8..c8..a15..b16
   vpblendd   ymm5,ymm2,ymm3,240                              // xmm4 a0c0..a7c7  b8d8..b15d15
   vmovdqa xmm1, xmmword ptr [rcx+r8]
   vinsertf128  ymm1,ymm1,xmmword ptr [rcx+R12],1
   lea rcx,[rcx+4*r8]

   vpunpcklwd ymm8,ymm5,ymm4                                  // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
   vpunpckhwd ymm9,ymm5,ymm4                                  // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15

   vpunpcklbw ymm2,ymm0,ymm1                                  // xmm2 *2* a0c0..a7c7  *1* b0d0..b7d7
   vpunpckhbw ymm3,ymm0,ymm1                                  // xmm3 *1* a8c8..a15c15  *2*b8d8..b15d15
   vmovdqa xmm0, xmmword ptr [rcx]
   vinsertf128  ymm0,ymm0,xmmword ptr [rcx+2*R8],1

   VPERM2F128 ymm4,ymm2,ymm3,1+16*2                           // xmm5 b0d0..b7d7  a8..c8..a15..b16
   vpblendd   ymm5,ymm2,ymm3,240                              // xmm4 a0c0..a7c7  b8d8..b15d15
   vmovdqa xmm1, xmmword ptr [rcx+r8]
   vinsertf128  ymm1,ymm1,xmmword ptr [rcx+R12],1

   vpunpcklwd ymm10,ymm5,ymm4                                  // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
   vpunpckhwd ymm11,ymm5,ymm4                                  // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15
   vpunpcklbw ymm2,ymm0,ymm1                                  // xmm2 *2* a0c0..a7c7  *1* b0d0..b7d7
   vpunpckhbw ymm3,ymm0,ymm1                                  // xmm3 *1* a8c8..a15c15  *2*b8d8..b15d15
   VPERM2F128 ymm4,ymm2,ymm3,1+16*2                           // xmm5 b0d0..b7d7  a8..c8..a15..b16
   vpblendd   ymm5,ymm2,ymm3,240                              // xmm4 a0c0..a7c7  b8d8..b15d15
   vpunpcklwd ymm12,ymm5,ymm4                                  // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
   vpunpckhwd ymm13,ymm5,ymm4                                  // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15

--- End code ---

I do have a colour distance routine in avx2 that is nice though, but a bit specific.

Navigation

[0] Message Index

[#] Next page

Go to full version