* * *

Author Topic: AVX and SSE support question  (Read 31590 times)

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #165 on: December 24, 2017, 09:22:37 am »
Jerome,

Here are the helpers for *nix 64 and 32 bit, along with the main sources which include the three new methods in the base class. The avx helpers are just stubs at the moment.

AverageNorm4 is now completed, needs 1e-7 in the test as epsilon.

Also in a folder is the example of this single function to show a 'submission' request for a 'new' method using the template.

Bit painful this as you put your helpers in the main code file so I had to sort out quite a bit before I could even compile and start work. Lots of minor problems with case etc, but no point listing all these when I can sort them out later.

Ready for more functions to work on now :)

Merry Christmas  O:-)
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Jr. Member
  • **
  • Posts: 87
Re: AVX and SSE support question
« Reply #166 on: December 26, 2017, 08:00:39 pm »
Hi Peter

Hi Peter

I just created a repository for our 'SIMD Vector Math Unit Tester' on Github https://github.com/jdelauney/SIMD-VectorMath-UnitTest

I also  added  type TGLZVector2f and implement some functions in SSE (Win64) and tests

See you soon

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #167 on: December 26, 2017, 09:31:38 pm »
Hi Jerome,

My github handle is the same as here, I wiil do a pull and get some minor fixes diffed up.

Peter
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #168 on: December 27, 2017, 02:36:46 am »
Ok I have the Unix 64 bit 2d vector working in 7 local commits along with the two native targets. Just awaiting the ability to push now.
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #169 on: December 28, 2017, 09:34:18 am »
My first checkin of getting unix working along with the start of vector2f and vector4f structural changes. Will continue with all other configs so they compile by setting up stubs for work that needs doing.

Added a .gitgnore in the project dir so git status does not fill the screen with crap.

Plane seems to be broken in win64 now, got normalize working by adding some var initialisation to overridden setup. (plane functions never worked in unix.)

Removed lps files.

I have a local mod here, not checked in, to change the utils xml handling from widestring to utf8 (more lazarus friendly in my opinion) will hold this till you agree. Otherwise I have other work (ifdefs) to make the xml work in unix. 
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Jr. Member
  • **
  • Posts: 87
Re: AVX and SSE support question
« Reply #170 on: December 28, 2017, 09:57:45 am »
No problem for me for handling xml from WideString to UTF8  :D

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #171 on: December 28, 2017, 12:42:43 pm »
32 bit  unix added for sse.

For single results in 32bit the ABI wants the result value in st0.

We already have the result in xmm0 (st0 by another name when in mmx mode)

So I cannot use nostackframe and have to copy the result to the stack, the compiler then copies this value on the stack back to st0.

Anyone any ideas on a method to not have to do this stack copy and just leave the value in xmm0?
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #172 on: December 28, 2017, 05:39:14 pm »
Jerome,

Everything that is not win64 is created, stubbed and runs, I am not saying it works, just you can run any test in unix64, 32 and win32 without the compiler complaining or runtime generating a seg fault.

I am sure this will not last as you get some more routines started, but I will try to keep it at least in this state, so you can concentrate on just win64 and one codebase.
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

SonnyBoyXXl

  • New member
  • *
  • Posts: 40
Re: AVX and SSE support question
« Reply #173 on: January 04, 2018, 10:56:33 am »
Hi all,
I've finished the translation of the DirectX Math headers and test now the functions. I got a problem with this one:
Code: Pascal  [Select]
  1. function XMVectorSetBinaryConstant(constref C0: UINT32; constref C1: UINT32; constref C2: UINT32; constref C3: UINT32): TXMVECTOR;{ assembler;}
  2. const
  3.     g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
  4. asm
  5.            // Move the parms to a vector
  6.            // __m128i vTemp = _mm_set_epi32(C3,C2,C1,C0);
  7.            MOVUPS        XMM0,TXMVECTOR([c3])
  8.            MOVUPS        XMM1,TXMVECTOR([c2])
  9.            MOVUPS        XMM2,TXMVECTOR([c1])
  10.            MOVUPS        XMM3,TXMVECTOR([c0])
  11.            PUNPCKLDQ   XMM3,XMM1
  12.            PUNPCKLDQ   XMM2,XMM0
  13.            PUNPCKLDQ   XMM3,XMM2  // XMM3 = vTemp
  14.            // Mask off the low bits
  15.            PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
  16.            // 0xFFFFFFFF on true bits
  17.            PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
  18.            // 0xFFFFFFFF -> 1.0f, 0x00000000 -> 0.0f
  19.            PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
  20.            MOVUPS  TXMVECTOR([result]), XMM3// return _mm_castsi128_ps(vTemp);
  21. end;  

The result in XMM3 is correct, as I see this in the debugger. But the function doesn't return the result.
Debuging the value of result gives a strange behavior. The result is available at a breakpoint at MOVUPS but not at end. See attachted pictures.
What is here wrong? I use
Code: Pascal  [Select]
  1. {$ASMMODE intel}
  2. {$Z4}
  3. {$CODEALIGN CONSTMIN=16}
  4. {$A4}  
and compiler flag -CfSSE.

The casting of constants C0, C1, C2, C3 as TXMVector is to avoid a compiler hint that MOVUPS needs a M128 adress. I no casting is done, the result is the same.


CuriousKit

  • Jr. Member
  • **
  • Posts: 71
Re: AVX and SSE support question
« Reply #174 on: January 04, 2018, 11:50:25 pm »
Try looking at the disassembly of the program to see what it's doing in the function epilogue, and to also see what Result actually represents (likely a pre-reserved block of memory).

BeanzMaster

  • Jr. Member
  • **
  • Posts: 87
Re: AVX and SSE support question
« Reply #175 on: January 05, 2018, 12:33:49 am »
Hi
1st instead of cast and movups use "movq"
2nd for const access in you are in a 64 bit system use the "RIP" mov xmm0, [RIP+MyConst]
3rd don't cast result Mov {REsult], xmm0 is enought

And like CuriousKit say, take a look in th .s file (see compiler -a options)

SonnyBoyXXl

  • New member
  • *
  • Posts: 40
Re: AVX and SSE support question
« Reply #176 on: January 05, 2018, 01:25:25 am »
I'v found some time today to work on that problem.
First I changed the ASM code. I've checked how M$ VS 2017 handles the _mm_set_epi32 intrinsic. This is the
new routine:
Code: Pascal  [Select]
  1. function XMVectorSetBinaryConstant(constref C0: UINT32; constref C1: UINT32; constref C2: UINT32; constref C3: UINT32): TXMVECTOR;
  2.      assembler;
  3. const
  4.     g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
  5. asm
  6.            // Move the parms to a vector
  7.            // __m128i vTemp = _mm_set_epi32(C3,C2,C1,C0);
  8.            movd        xmm0,dword ptr [C3]
  9.            movd        xmm1,dword ptr[C2]
  10.            movd        xmm2,dword ptr[C1]
  11.            movd        xmm3,dword ptr[C0]
  12.            punpckldq   xmm3,xmm1
  13.            punpckldq   xmm2,xmm0
  14.            punpckldq   xmm3,xmm2 // XMM3 = vTemp
  15.            // Mask off the low bits
  16.            PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
  17.            // 0xFFFFFFFF on true bits
  18.            PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
  19.            // 0xFFFFFFFF -> 1.0f, 0x00000000 -> 0.0f
  20.            PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
  21.            MOVUPS  TXMVECTOR([result]), XMM3// return _mm_castsi128_ps(vTemp);
  22. end;      

When I now make a breakpoint at the "movd        xmm1,dword ptr[C2]" line. I see in the debugger that the value of XMM0 is not what it should be.
Now I looked at the .s file.

Quote
DIRECTX.MATH_$$_XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$TXMVECTOR:
.Lc128:
.Ll314:
# [2903] g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
   pushl   %ebp
.Lc130:
.Lc131:
   movl   %esp,%ebp
.Lc132:
# Var C0 located in register eax
# Var C1 located in register edx
# Var C2 located in register ecx
# Var C3 located at ebp+12, size=OS_32
# Var $result located at ebp+8, size=OS_32
.Ll315:
# [2907] movd        xmm0,dword ptr [C3]
   movd   12(%ebp),%xmm0
.Ll316:
# [2908] movd        xmm1,dword ptr[C2]
   movd   (%ecx),%xmm1
.Ll317:
# [2909] movd        xmm2,dword ptr[C1]
   movd   (%edx),%xmm2
.Ll318:
# [2910] movd        xmm3,dword ptr[C0]
   movd   (%eax),%xmm3
.Ll319:
# [2911] punpckldq   xmm3,xmm1
   punpckldq   %xmm1,%xmm3
.Ll320:
# [2912] punpckldq   xmm2,xmm0
   punpckldq   %xmm0,%xmm2
.Ll321:
# [2913] punpckldq   xmm3,xmm2 // XMM3 = vTemp
   punpckldq   %xmm2,%xmm3
.Ll322:
# [2915] PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
   pand   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll323:
# [2917] PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
   pcmpeqd   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll324:
# [2919] PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
   pand   TC_$DIRECTX.MATH_$$_G_XMONE,%xmm3
.Ll325:
# [2920] MOVUPS  TXMVECTOR([result]), XMM3// return _mm_castsi128_ps(vTemp);
   movups   %xmm3,8(%ebp)
.Ll326:
# [2921] end;
   leave
   ret   $8
.Lc129:
.Lt14:
.Ll327:

The C3 ist located on the stack. So I change the function to
Code: Pascal  [Select]
  1. function XMVectorSetBinaryConstant(constref C0: UINT32; constref C1: UINT32; constref C2: UINT32; const C3: UINT32): TXMVECTOR;
  2.      assembler;  

The .s output is
Quote
DIRECTX.MATH_$$_XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$TXMVECTOR:
.Lc128:
.Ll314:
# [2903] g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
   pushl   %ebp
.Lc130:
.Lc131:
   movl   %esp,%ebp
.Lc132:
# Var C0 located in register eax
# Var C1 located in register edx
# Var C2 located in register ecx
# Var C3 located at ebp+12, size=OS_32
# Var $result located at ebp+8, size=OS_32
.Ll315:
# [2907] movd        xmm0,dword ptr [C3]
   movd   12(%ebp),%xmm0
.Ll316:
# [2908] movd        xmm1,dword ptr[C2]
   movd   (%ecx),%xmm1
.Ll317:
# [2909] movd        xmm2,dword ptr[C1]
   movd   (%edx),%xmm2
.Ll318:
# [2910] movd        xmm3,dword ptr[C0]
   movd   (%eax),%xmm3
.Ll319:
# [2911] punpckldq   xmm3,xmm1
   punpckldq   %xmm1,%xmm3
.Ll320:
# [2912] punpckldq   xmm2,xmm0
   punpckldq   %xmm0,%xmm2
.Ll321:
# [2913] punpckldq   xmm3,xmm2 // XMM3 = vTemp
   punpckldq   %xmm2,%xmm3
.Ll322:
# [2915] PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
   pand   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll323:
# [2917] PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
   pcmpeqd   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll324:
# [2919] PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
   pand   TC_$DIRECTX.MATH_$$_G_XMONE,%xmm3
.Ll325:
# [2920] MOVUPS  TXMVECTOR([result]), XMM3// return _mm_castsi128_ps(vTemp);
   movups   %xmm3,8(%ebp)
.Ll326:
# [2921] end;
   leave
   ret   $8
.Lc129:
.Lt14:
.Ll327:
As you see, the output is the same. But most of all, the value in XMM0 is now valid.

The only problem remain is that the result is still not valid.
If I change the routine that also the result is in a register and not on the stack everythink works perfekt (this means, I pass a TXMVector as input instead of the four UINT32. So I have the in-var in a register and also the out-var).
Seems this is a problem when result lays on the stack?
And I have found this post https://forum.lazarus.freepascal.org/index.php?topic=29097.0
This is the bug tracker https://bugs.freepascal.org/view.php?id=32710#c104254.

So I think the problem is the same on Windows?


« Last Edit: January 05, 2018, 01:32:35 am by SonnyBoyXXl »

SonnyBoyXXl

  • New member
  • *
  • Posts: 40
Re: AVX and SSE support question
« Reply #177 on: January 06, 2018, 05:54:01 pm »
I got the function now running with this modifications:

Code: Pascal  [Select]
  1. function XMVectorSetBinaryConstant(const C0: UINT32; const C1: UINT32; const C2: UINT32; const C3: UINT32): PXMVECTOR;
  2. const
  3.     g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
  4. var
  5.     x: TXMVECTOR;
  6. begin
  7.     asm
  8.                // Move the parms to a vector
  9.                // __m128i vTemp = _mm_set_epi32(C3,C2,C1,C0);
  10.                MOVD        XMM0, [C3]
  11.                MOVD        XMM1, [C2]
  12.                MOVD        XMM2, [C1]
  13.                MOVD        XMM3, [C0]
  14.                PUNPCKLDQ   XMM3,XMM1
  15.                PUNPCKLDQ   XMM2,XMM0
  16.                PUNPCKLDQ   XMM3,XMM2 // XMM3 = vTemp
  17.                // Mask off the low bits
  18.                PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
  19.                // 0xFFFFFFFF on true bits
  20.                PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
  21.                // 0xFFFFFFFF -> 1.0f, 0x00000000 -> 0.0f
  22.                PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
  23.                MOVUPS  [x], XMM3// return _mm_castsi128_ps(vTemp);
  24.     end;
  25.     Result := @x;
  26. end;

This is the .s output:

Quote
DIRECTX.MATH_$$_XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$PXMVECTOR:
.Lc128:
.Ll314:
# [2925] begin
   pushl   %ebp
.Lc130:
.Lc131:
   movl   %esp,%ebp
.Lc132:
   leal   -80(%esp),%esp
# Var C0 located at ebp-16, size=OS_32
# Var C1 located at ebp-32, size=OS_32
# Var C2 located at ebp-48, size=OS_32
# Var C3 located at ebp+8, size=OS_32
# Var $result located at ebp-64, size=OS_32
# Var x located at ebp-80, size=OS_NO
   movl   %eax,-16(%ebp)
   movl   %edx,-32(%ebp)
   movl   %ecx,-48(%ebp)
#  CPU PENTIUM
.Ll315:
# [2929] movd        xmm0, [C3]
   movd   8(%ebp),%xmm0
.Ll316:
# [2930] movd        xmm1, [C2]
   movd   -48(%ebp),%xmm1
.Ll317:
# [2931] movd        xmm2, [C1]
   movd   -32(%ebp),%xmm2
.Ll318:
# [2932] movd        xmm3, [C0]
   movd   -16(%ebp),%xmm3
.Ll319:
# [2933] punpckldq   xmm3,xmm1
   punpckldq   %xmm1,%xmm3
.Ll320:
# [2934] punpckldq   xmm2,xmm0
   punpckldq   %xmm0,%xmm2
.Ll321:
# [2935] punpckldq   xmm3,xmm2 // XMM3 = vTemp
   punpckldq   %xmm2,%xmm3
.Ll322:
# [2937] PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
   pand   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$PXMVECTOR_$$_G_VMASK1,%xmm3
.Ll323:
# [2939] PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
   pcmpeqd   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$PXMVECTOR_$$_G_VMASK1,%xmm3
.Ll324:
# [2941] PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
   pand   TC_$DIRECTX.MATH_$$_G_XMONE,%xmm3
.Ll325:
# [2942] MOVUPS 
  • , XMM3// return _mm_castsi128_ps(vTemp);

   movups   %xmm3,-80(%ebp)
#  CPU PENTIUM
.Ll326:
# [2944] result:=@x;
   leal   -80(%ebp),%eax
   movl   %eax,-64(%ebp)
.Ll327:
# [2945] end;
   movl   %ebp,%esp
   popl   %ebp
   ret   $4
.Lc129:
.Lt14:
.Ll328:

Why is this working?

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #178 on: January 06, 2018, 09:55:05 pm »
When you have parameters or returns on the stack you have to look at the size to try to work out if it is a value or a pointer.

if the return is a pointer then you can use something like this.
Code: Pascal  [Select]
  1.   mov    ebx,  [Result]
  2.   vmovups [ebx], xmm0                
  3.  

for parameter pointers which are one the stack you will require something like this

Code: Pascal  [Select]
  1.   mov    ebx,  [right]
  2.   movups xmm5, [ebx]      

32 bit usually puts pointer for most things on the stack.

Looking at your case you declared a local variable which was allocated space on the stack which is why you have the following
Code: Pascal  [Select]
  1. movups   %xmm3,-80(%ebp)
this is a value on the stack not a pointer.
« Last Edit: January 06, 2018, 10:03:39 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

SonnyBoyXXl

  • New member
  • *
  • Posts: 40
Re: AVX and SSE support question
« Reply #179 on: January 07, 2018, 12:36:29 am »
Yes, it is really confusing.
 :(

I've now continue testing and have now another function:
Code: Pascal  [Select]
  1. function XMVectorSet(const x, y, z, w: single): TXMVECTOR; assembler;
  2. asm
  3.                MOVD        XMM0, [w]
  4.                MOVD        XMM1, [z]
  5.                MOVD        XMM2, [y]
  6.                MOVD        XMM3, [x]
  7.                PUNPCKLDQ   XMM3,XMM1
  8.                PUNPCKLDQ   XMM2,XMM0
  9.                PUNPCKLDQ   XMM3,XMM2
  10.                MOVUPS  [result], XMM3 // _mm_set_ps( w, z, y, x );
  11. end;  

As you see, this is the same assembler code as the first part of XMVectorSetBinaryConstant. The difference is that the input parameters are of type single.
Therefore the .s output is

Quote
DIRECTX.MATH_$$_XMVECTORSET$SINGLE$SINGLE$SINGLE$SINGLE$$TXMVECTOR:
.Lc261:
.Ll822:
# [5426] asm
   pushl   %ebp
.Lc263:
.Lc264:
   movl   %esp,%ebp
.Lc265:
# Var $result located in register eax
# Var x located at ebp+20, size=OS_F32
# Var y located at ebp+16, size=OS_F32
# Var z located at ebp+12, size=OS_F32
# Var w located at ebp+8, size=OS_F32
.Ll823:
# [5427] MOVD        XMM0, [w]
   movd   8(%ebp),%xmm0
.Ll824:
# [5428] MOVD        XMM1, [z]
   movd   12(%ebp),%xmm1
.Ll825:
# [5429] MOVD        XMM2, [y]
   movd   16(%ebp),%xmm2
.Ll826:
# [5430] MOVD        XMM3,

   movd   20(%ebp),%xmm3
.Ll827:
# [5431] PUNPCKLDQ   XMM3,XMM1
   punpckldq   %xmm1,%xmm3
.Ll828:
# [5432] PUNPCKLDQ   XMM2,XMM0
   punpckldq   %xmm0,%xmm2
.Ll829:
# [5433] PUNPCKLDQ   XMM3,XMM2
   punpckldq   %xmm2,%xmm3
.Ll830:
# [5434] MOVUPS  [result], XMM3 // _mm_set_ps( w, z, y, x );
   movups   %xmm3,(%eax)
.Ll831:
# [5435] end;
   leave
   ret   $16

the difference is that here the result is in an register.

So what comes out is:

Same routine, input params as UINT32:
Quote
# Var C0 located in register eax
# Var C1 located in register edx
# Var C2 located in register ecx
# Var C3 located at ebp+12, size=OS_32
# Var $result located at ebp+8, size=OS_32
--> not working directly, address of result is on the stack, must be loaded into register


input params as SINGLE:
Quote
# Var $result located in register eax
# Var C0 located at ebp+20, size=OS_F32
# Var C1 located at ebp+16, size=OS_F32
# Var C2 located at ebp+12, size=OS_F32
# Var C3 located at ebp+8, size=OS_F32
--> working, cause address of result is located in register

I've added your comment about the stack parameter in the routine, and is working now.

Code: Pascal  [Select]
  1. function XMVectorSetBinaryConstant(constref C0, C1, C2: UINT32; const c3: UINT32): TXMVECTOR; assembler;
  2. const
  3.     g_vMask1: TXMVECTOR = (u32: (1, 1, 1, 1));
  4. asm
  5.            // Move the parms to a vector
  6.            // __m128i vTemp = _mm_set_epi32(C3,C2,C1,C0);
  7.            MOVD        XMM0, [C3]
  8.            MOVD        XMM1, [C2]
  9.            MOVD        XMM2, [C1]
  10.            MOVD        XMM3, [C0]
  11.            PUNPCKLDQ   XMM3,XMM1
  12.            PUNPCKLDQ   XMM2,XMM0
  13.            PUNPCKLDQ   XMM3,XMM2 // XMM3 = vTemp
  14.            // Mask off the low bits
  15.            PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
  16.            // 0xFFFFFFFF on true bits
  17.            PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
  18.            // 0xFFFFFFFF -> 1.0f, 0x00000000 -> 0.0f
  19.            PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
  20.            PUSH    EBX
  21.            MOV     EBX, [result]
  22.            MOVUPS  [EBX], XMM3 // return _mm_castsi128_ps(vTemp);
  23.            POP     EBX
  24. end;    

Thanks!
« Last Edit: January 07, 2018, 01:09:10 am by SonnyBoyXXl »

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads Open Hub project report for Lazarus