Recent

Author Topic: AVX512 Support  (Read 7683 times)

schuler

  • Full Member
  • ***
  • Posts: 223
AVX512 Support
« on: June 14, 2018, 09:33:35 am »
Hi,
wondering if is there any flag to enable (or any compilation instruction) so we can enable AVX512 assembler instructions?

I miss coding things like:

Code: Pascal  [Select][+][-]
  1.   vmovups zmm2, [rax]
  2.   vmovups zmm3, [rax+64]
  3.  

In the case that I'm feeling brave to add support to FPC to some AVX512 instructions. From where should I start?
« Last Edit: June 14, 2018, 10:03:23 pm by schuler »

schuler

  • Full Member
  • ***
  • Posts: 223
Re: AVX512 Support
« Reply #1 on: June 15, 2018, 01:43:48 am »
Just been told that this is actually in development :) :
https://svn.freepascal.org/svn/fpc/branches/tg74/avx512/


marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: AVX512 Support
« Reply #2 on: June 15, 2018, 09:33:17 am »
What is your intended application?  I did some minor avx2 work (*) last summer, but found out that the lane concept was stiffling for straight work.


(*) among others https://stackoverflow.com/questions/47478010/sse2-8x8-byte-matrix-transpose-code-twice-as-slow-on-haswell-then-on-ivy-bridge

schuler

  • Full Member
  • ***
  • Posts: 223
Re: AVX512 Support
« Reply #3 on: June 15, 2018, 09:59:00 am »
I have mixed results with AVX / AVX2 and expect the same with AVX512. In my own application (neural networks), in some server systems I do perceive improvement with AVX2 over AVX. First experiments in a notebook were not promising.

All AVX specific code on this library will have a version for AVX512:
https://sourceforge.net/p/cai/svncode/HEAD/tree/trunk/lazarus/libs/uvolume.pas

This is the typical code that I intend to improve (replace ymm registers by zmm registers):

Code: Pascal  [Select][+][-]
  1.   vmovups ymm2, [rax]
  2.   vmovups ymm3, [rax+32]
  3.   vmovups ymm4, [rax+64]
  4.   vmovups ymm5, [rax+96]
  5.  
  6.   vaddps  ymm2, ymm2, [rdx]
  7.   vaddps  ymm3, ymm3, [rdx+32]
  8.   vaddps  ymm4, ymm4, [rdx+64]
  9.   vaddps  ymm5, ymm5, [rdx+96]
  10.  
  11.   vmovups [rax],    ymm2
  12.   vmovups [rax+32], ymm3
  13.   vmovups [rax+64], ymm4
  14.   vmovups [rax+96], ymm5[code=pascal]
  15.  
  16. Above code will only be fast if memory bandwitdth is really good.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: AVX512 Support
« Reply #4 on: June 15, 2018, 11:34:06 am »
I have mixed results with AVX / AVX2 and expect the same with AVX512. In my own application (neural networks), in some server systems I do perceive improvement with AVX2 over AVX. First experiments in a notebook were not promising.

If you don't have to shuffle, avx(2) is nice. But I have quite some applications that need to shuffle.

I use either 8-bit monochrome or some color format (rgb) that I have to rearrange so I get a register of R's, a register of G's etc.

I tried to scale up the SSE code in the earlier link  (8x8 byte matrix rotation) to 16x16 using AVX2, but the number of cycles just went nuts.  The code below is already 20 cycles while 8x8 takes 10 cycles. So little gain is to be expected (every row still needs to be gathered from two registers and stored, and there was some other problem. Can't quickly test since this machine is not avx2 capable)

Code: [Select]
procedure rot16x16(src,dest:pbbyte;rowpitchsrc,rowpitchdest:integer;nrxstep,nrystep:integer  ); [public,alias: 'rot16x16'];
// src rcx, dest rdx, rpsrc r8, rpdest r9
// vol:  rax,r10,r11

// init src ptr op 0,0, outerloop: y+=step * rowpitch    innerloop: x+x+stepsize
// init dest ptr op width-stepsize,  outerloop x:=x-stepsize   innerloop: y:=y+step*rowpitch
begin
asm
 {$ifdef iacamarker}
          mov ebx, 111          // Start marker bytes
         db $64, $67, $90   // Start marker bytes
    {$endif}

  mov r12,r8
  shl r12,1
  add r12,r8   // r12 = 3*rpsrc

  // load 16x16 bytes into 8 32 byte registers while interleaving
   vmovdqa xmm0, xmmword ptr [rcx]
   vinsertf128  ymm0,ymm0,xmmword ptr [rcx+2*R8],1              // xmm0 a0..a15  b0..b15
   vmovdqa xmm1, xmmword ptr [rcx+r8]
   vinsertf128  ymm1,ymm1,xmmword ptr [rcx+R12],1             // xmm1 c0..c15  d0..d15
   lea rcx,[rcx+4*r8]

   vpunpcklbw ymm2,ymm0,ymm1                                  // xmm2 *2* a0c0..a7c7  *1* b0d0..b7d7
   vpunpckhbw ymm3,ymm0,ymm1                                  // xmm3 *1* a8c8..a15c15  *2*b8d8..b15d15

   vmovdqa xmm0, xmmword ptr [rcx]
   vinsertf128  ymm0,ymm0,xmmword ptr [rcx+2*R8],1


   VPERM2F128 ymm4,ymm2,ymm3,1+16*2                           // xmm5 b0d0..b7d7  a8..c8..a15..b16
   vpblendd   ymm5,ymm2,ymm3,240                              // xmm4 a0c0..a7c7  b8d8..b15d15

   vmovdqa xmm1, xmmword ptr [rcx+r8]
   vinsertf128  ymm1,ymm1,xmmword ptr [rcx+R12],1
   lea rcx,[rcx+4*r8]

   // we have a problem here. in the higher half, the arguments are swapped.

   vpunpcklwd ymm6,ymm5,ymm4                                  // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
   vpunpckhwd ymm7,ymm5,ymm4                                  // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15

   vpunpcklbw ymm2,ymm0,ymm1                                  // xmm2 *2* a0c0..a7c7  *1* b0d0..b7d7
   vpunpckhbw ymm3,ymm0,ymm1                                  // xmm3 *1* a8c8..a15c15  *2*b8d8..b15d15
   vmovdqa xmm0, xmmword ptr [rcx]
   vinsertf128  ymm0,ymm0,xmmword ptr [rcx+2*R8],1

   VPERM2F128 ymm4,ymm2,ymm3,1+16*2                           // xmm5 b0d0..b7d7  a8..c8..a15..b16
   vpblendd   ymm5,ymm2,ymm3,240                              // xmm4 a0c0..a7c7  b8d8..b15d15
   vmovdqa xmm1, xmmword ptr [rcx+r8]
   vinsertf128  ymm1,ymm1,xmmword ptr [rcx+R12],1
   lea rcx,[rcx+4*r8]

   vpunpcklwd ymm8,ymm5,ymm4                                  // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
   vpunpckhwd ymm9,ymm5,ymm4                                  // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15

   vpunpcklbw ymm2,ymm0,ymm1                                  // xmm2 *2* a0c0..a7c7  *1* b0d0..b7d7
   vpunpckhbw ymm3,ymm0,ymm1                                  // xmm3 *1* a8c8..a15c15  *2*b8d8..b15d15
   vmovdqa xmm0, xmmword ptr [rcx]
   vinsertf128  ymm0,ymm0,xmmword ptr [rcx+2*R8],1

   VPERM2F128 ymm4,ymm2,ymm3,1+16*2                           // xmm5 b0d0..b7d7  a8..c8..a15..b16
   vpblendd   ymm5,ymm2,ymm3,240                              // xmm4 a0c0..a7c7  b8d8..b15d15
   vmovdqa xmm1, xmmword ptr [rcx+r8]
   vinsertf128  ymm1,ymm1,xmmword ptr [rcx+R12],1

   vpunpcklwd ymm10,ymm5,ymm4                                  // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
   vpunpckhwd ymm11,ymm5,ymm4                                  // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15
   vpunpcklbw ymm2,ymm0,ymm1                                  // xmm2 *2* a0c0..a7c7  *1* b0d0..b7d7
   vpunpckhbw ymm3,ymm0,ymm1                                  // xmm3 *1* a8c8..a15c15  *2*b8d8..b15d15
   VPERM2F128 ymm4,ymm2,ymm3,1+16*2                           // xmm5 b0d0..b7d7  a8..c8..a15..b16
   vpblendd   ymm5,ymm2,ymm3,240                              // xmm4 a0c0..a7c7  b8d8..b15d15
   vpunpcklwd ymm12,ymm5,ymm4                                  // xmm6 a0b0cd0..a3b3c3d3 a8b8c8d8..a11b11c11d11
   vpunpckhwd ymm13,ymm5,ymm4                                  // xmm7 a4b4cd4..a7b4c7d7 a12b12c12d12..a15b15c15d15

I do have a colour distance routine in avx2 that is nice though, but a bit specific.

schuler

  • Full Member
  • ***
  • Posts: 223
Re: AVX512 Support
« Reply #5 on: June 16, 2018, 04:00:14 am »
MARCOV!!!
Before I suggest anything, if I understand you well, you are doing something like this (simplified):

Code: Pascal  [Select][+][-]
  1. // transform this:
  2. v1: RGBRGBRGB
  3. v2: RGBRGBRGB
  4. v3: RGBRGBRGB
  5. v4: RGBRGBRGB
  6. v5: RGBRGBRGB
  7. v6: RGBRGBRGB
  8. v7: RGBRGBRGB
  9. b8: RGBRGBRGB
  10. v9: RGBRGBRGB
  11.  
  12. // into this:
  13. v1: RRRRRRRRR
  14. v2: GGGGGGGGG
  15. v3: BBBBBBBBB
  16. v4: RRRRRRRRR
  17. v5: GGGGGGGGG
  18. v6: BBBBBBBBB
  19. v7: RRRRRRRRR
  20. v8: GGGGGGGGG
  21. v9: BBBBBBBBB
  22.  

Is the above correct (I'm not trying to match same number of vectors and sizes - the above was picked to be simple to understand)? If yes, I'll then try to code.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: AVX512 Support
« Reply #6 on: June 16, 2018, 10:58:59 am »
MARCOV!!!
Before I suggest anything, if I understand you well, you are doing something like this (simplified):

I already have it,  what I didn't finish is the 16x16 bytewise rotation routine because it looked like it wouldn't be faster than 4 times the (11 cycles) 8-bit routine.

As for the colour distance, after I have the RGB in separate vectors (like your result), I usually do a subtraction with a reference image, and then store the distance to a result image. (usually 24-bit+ 24-bit reference -> 16-bit scalar result)

Depending on correctness and computational time and  memory bandwidth, it goes over HSV, or works on RGB (  |R-R'| + |B-B'| + |G+G'|, store in 16-bits ). This was done for a tender for a 15-20Gbyte/s  (4 NBase-T cameras) inspection system. We got the tender from the machine factory, but our customer's customer ordered a lower spec machine without vision in the end. Bummer.  The RGB abs routine is the only avx2 one, except for the fragment of the rot16x16 above. The rest is all still SSE2.

In the final application to reach max speed, a bayer filter would have to be integrated.

Currently, there are no color projects, so my asm is fairly low again, contrary to last summer.

Quote
Is the above correct (I'm not trying to match same number of vectors and sizes - the above was picked to be simple to understand)? If yes, I'll then try to code.

I already have various SSE2/3/4 routines for that. Also 24->32-bit conversion and vice versa etc. Mostly this is used for less realtime purposes like testing/tender stage and in apps to occasionally precalculate stuff (like the reference image)

Most of them are pretty useless for others since they assume start and widths are aligned.  Something that my application (and most industrial cameras) enforce, but doesn't work for e.g. webimages. This makes them one core loop for the sse/avx case only without pre and post padding.

Going to the limit is not always useful, sometimes memory latency is king (images range from 1.3 to 20-30Mpix. The tender case was extreme, with bayered images of up to 150MByte coming in at scary speeds.

schuler

  • Full Member
  • ***
  • Posts: 223
Re: AVX512 Support
« Reply #7 on: June 17, 2018, 01:21:02 am »
Hi Marcov,
I'm certain that you know more about x86 assembler than myself. I'll not try to teach the professor. Therefore, this is just an idea to share in regards to 9x9 bytes transpose (you can use bigger sizes - this is just an example):

We have 3 constants:

C1:FF0000FF0000FF0000
C2:00FF0000FF0000FF00
C3:0000FF0000FF0000FF

Inputs come as V1..V9 as per previous message. I would then shr V2 and V3

shr V2,  8
shr V3, 16

and V1,C1
and V2,C2
and V3,C3
 
// we now have:

v1: R 0 0 R 0 0 R 0 0
v2: 0 R 0 0 R 0 0 R 0
v3: 0 0 R 0 0 R 0 0 R

// we then join everything:

OR V1, V2
OR V1, V3

// We end with:
V1: R R R R R R R R R

Then, we need to do the same for the other 2 color channels.
We'll have 7 instructions per channel in total (2 SHRs, 3 ANDs and 2 ORs) plus some load/store instructions.  As these instructions are simple, I would assume they will run very fast.

I'm not sure if this idea is good. I'm just sharing thoughts.

The above would benefit from large AVX/AVX512 instructions for bigger transpositions.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: AVX512 Support
« Reply #8 on: June 19, 2018, 11:56:10 am »
To get R's , G's and B's in a register together I first vshufb with this

const
      splitsh6 :  array[0..31] of byte = (  $00,$04,$08,$0C,$01,$05,$09,$0d,
                                            $02,$06,$0A,$0E,$03,$07,$0B,$0F,
                                            $00,$04,$08,$0C,$01,$05,$09,$0d,
                                            $02,$06,$0A,$0E,$03,$07,$0B,$0F);

Which sorts according to color code. and then permutate the dwords using vpermd

      permto8 :  array[0..31] of byte = (   $00,$00,$00,$00,$04,$00,$00,$00,
                                            $01,$00,$00,$00,$05,$00,$00,$00,
                                            $02,$00,$00,$00,$06,$00,$00,$00,
                                            $03,$00,$00,$00,$07,$00,$00,$00);

Which means that after two instructions I have

r0..r7,g0..g7,b0..b7,a0..a7 in one register. (in reality it is abs(r0-r0')...abs(r7..r7'), I already did a subtraction with saturation)

I then grow a register with just r's,g's'b's each (ignoring a) and  just add (  and store them as 8 bit colour distance.

The RGB to rgba is SSE2/3 and shuffles, shifts and ors.

These are all utility routines to be able to prototype at a sane speed with large images (say 10 fps). They are optimized but still usually just one transformation per pass (which is fast enough, and often SSE since that is more convenient ).

For production, specialist routines that do multiple routines (debayer -  abs(subtract) - color channel sortingsubtract) - store)


schuler

  • Full Member
  • ***
  • Posts: 223
Re: AVX512 Support
« Reply #9 on: August 26, 2018, 03:13:34 am »
Hello,
In regards to testing this branch:
https://svn.freepascal.org/svn/fpc/branches/tg74/avx512/

This is working:

zmm registers are properly recognized:

Code: Pascal  [Select][+][-]
  1.   end  [
  2.     'RAX', 'RCX', 'RDX',
  3.     'zmm0', 'zmm1', 'zmm2'
  4.   ];
  5.  

These commands work:

Code: Pascal  [Select][+][-]
  1. asm
  2.   VBROADCASTSS zmm0, [rdx]
  3.   vmulps  zmm2, zmm0, [rax]
  4.   vmulps  zmm3, zmm0, [rax+64]
  5.   vmulps  zmm2, zmm5, [rdx]
  6.   vmulps  zmm3, zmm5, [rdx+64]
  7.   vmovups [rax],    zmm2
  8.   vmovups [rax+64], zmm3
  9.   vaddps  zmm2, zmm2, [rdx]
  10.   vaddps  zmm3, zmm3, [rdx+64]
  11.   vsubps  zmm2, zmm2, [rdx]
  12.   vsubps  zmm3, zmm3, [rdx+64]
  13. end;

Given that the above works, I have already started coding support for AVX512 in my own project (uvolume.pas).

Wish everyone happy pascal coding.

 

TinyPortal © 2005-2018