Forum > FPC development
AVX512 Support
schuler:
MARCOV!!!
Before I suggest anything, if I understand you well, you are doing something like this (simplified):
--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---// transform this:v1: RGBRGBRGBv2: RGBRGBRGBv3: RGBRGBRGBv4: RGBRGBRGBv5: RGBRGBRGBv6: RGBRGBRGBv7: RGBRGBRGBb8: RGBRGBRGBv9: RGBRGBRGB // into this:v1: RRRRRRRRRv2: GGGGGGGGGv3: BBBBBBBBBv4: RRRRRRRRRv5: GGGGGGGGGv6: BBBBBBBBBv7: RRRRRRRRRv8: GGGGGGGGGv9: BBBBBBBBB
Is the above correct (I'm not trying to match same number of vectors and sizes - the above was picked to be simple to understand)? If yes, I'll then try to code.
marcov:
--- Quote from: schuler on June 16, 2018, 04:00:14 am ---MARCOV!!!
Before I suggest anything, if I understand you well, you are doing something like this (simplified):
--- End quote ---
I already have it, what I didn't finish is the 16x16 bytewise rotation routine because it looked like it wouldn't be faster than 4 times the (11 cycles) 8-bit routine.
As for the colour distance, after I have the RGB in separate vectors (like your result), I usually do a subtraction with a reference image, and then store the distance to a result image. (usually 24-bit+ 24-bit reference -> 16-bit scalar result)
Depending on correctness and computational time and memory bandwidth, it goes over HSV, or works on RGB ( |R-R'| + |B-B'| + |G+G'|, store in 16-bits ). This was done for a tender for a 15-20Gbyte/s (4 NBase-T cameras) inspection system. We got the tender from the machine factory, but our customer's customer ordered a lower spec machine without vision in the end. Bummer. The RGB abs routine is the only avx2 one, except for the fragment of the rot16x16 above. The rest is all still SSE2.
In the final application to reach max speed, a bayer filter would have to be integrated.
Currently, there are no color projects, so my asm is fairly low again, contrary to last summer.
--- Quote ---Is the above correct (I'm not trying to match same number of vectors and sizes - the above was picked to be simple to understand)? If yes, I'll then try to code.
--- End quote ---
I already have various SSE2/3/4 routines for that. Also 24->32-bit conversion and vice versa etc. Mostly this is used for less realtime purposes like testing/tender stage and in apps to occasionally precalculate stuff (like the reference image)
Most of them are pretty useless for others since they assume start and widths are aligned. Something that my application (and most industrial cameras) enforce, but doesn't work for e.g. webimages. This makes them one core loop for the sse/avx case only without pre and post padding.
Going to the limit is not always useful, sometimes memory latency is king (images range from 1.3 to 20-30Mpix. The tender case was extreme, with bayered images of up to 150MByte coming in at scary speeds.
schuler:
Hi Marcov,
I'm certain that you know more about x86 assembler than myself. I'll not try to teach the professor. Therefore, this is just an idea to share in regards to 9x9 bytes transpose (you can use bigger sizes - this is just an example):
We have 3 constants:
C1:FF0000FF0000FF0000
C2:00FF0000FF0000FF00
C3:0000FF0000FF0000FF
Inputs come as V1..V9 as per previous message. I would then shr V2 and V3
shr V2, 8
shr V3, 16
and V1,C1
and V2,C2
and V3,C3
// we now have:
v1: R 0 0 R 0 0 R 0 0
v2: 0 R 0 0 R 0 0 R 0
v3: 0 0 R 0 0 R 0 0 R
// we then join everything:
OR V1, V2
OR V1, V3
// We end with:
V1: R R R R R R R R R
Then, we need to do the same for the other 2 color channels.
We'll have 7 instructions per channel in total (2 SHRs, 3 ANDs and 2 ORs) plus some load/store instructions. As these instructions are simple, I would assume they will run very fast.
I'm not sure if this idea is good. I'm just sharing thoughts.
The above would benefit from large AVX/AVX512 instructions for bigger transpositions.
marcov:
To get R's , G's and B's in a register together I first vshufb with this
const
splitsh6 : array[0..31] of byte = ( $00,$04,$08,$0C,$01,$05,$09,$0d,
$02,$06,$0A,$0E,$03,$07,$0B,$0F,
$00,$04,$08,$0C,$01,$05,$09,$0d,
$02,$06,$0A,$0E,$03,$07,$0B,$0F);
Which sorts according to color code. and then permutate the dwords using vpermd
permto8 : array[0..31] of byte = ( $00,$00,$00,$00,$04,$00,$00,$00,
$01,$00,$00,$00,$05,$00,$00,$00,
$02,$00,$00,$00,$06,$00,$00,$00,
$03,$00,$00,$00,$07,$00,$00,$00);
Which means that after two instructions I have
r0..r7,g0..g7,b0..b7,a0..a7 in one register. (in reality it is abs(r0-r0')...abs(r7..r7'), I already did a subtraction with saturation)
I then grow a register with just r's,g's'b's each (ignoring a) and just add ( and store them as 8 bit colour distance.
The RGB to rgba is SSE2/3 and shuffles, shifts and ors.
These are all utility routines to be able to prototype at a sane speed with large images (say 10 fps). They are optimized but still usually just one transformation per pass (which is fast enough, and often SSE since that is more convenient ).
For production, specialist routines that do multiple routines (debayer - abs(subtract) - color channel sortingsubtract) - store)
schuler:
Hello,
In regards to testing this branch:
https://svn.freepascal.org/svn/fpc/branches/tg74/avx512/
This is working:
zmm registers are properly recognized:
--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} --- end [ 'RAX', 'RCX', 'RDX', 'zmm0', 'zmm1', 'zmm2' ];
These commands work:
--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---asm VBROADCASTSS zmm0, [rdx] vmulps zmm2, zmm0, [rax] vmulps zmm3, zmm0, [rax+64] vmulps zmm2, zmm5, [rdx] vmulps zmm3, zmm5, [rdx+64] vmovups [rax], zmm2 vmovups [rax+64], zmm3 vaddps zmm2, zmm2, [rdx] vaddps zmm3, zmm3, [rdx+64] vsubps zmm2, zmm2, [rdx] vsubps zmm3, zmm3, [rdx+64]end;
Given that the above works, I have already started coding support for AVX512 in my own project (uvolume.pas).
Wish everyone happy pascal coding.
Navigation
[0] Message Index
[#] Next page
[*] Previous page