Recent

Author Topic: AVX and SSE support question  (Read 89728 times)

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #120 on: December 04, 2017, 05:25:46 am »

  mov ecx, RESULT
  mov [ecx], xmm0

not working : vectormath_vector_win32_sse_imp.inc(269,5) Error: Asm: [mov mem??,xmmreg] invalid combination of opcode and operands

and this is what i have in the S file :

Quote
.globl   GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE
GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE:
   # Register ebp allocated
# [258] Asm
   pushl   %ebp
   movl   %esp,%ebp
   leal   -4(%esp),%esp
# Var A located in register edx
# Var $self located in register eax
# Temp -4,4 allocated
# Var $result located at ebp-4, size=OS_F32
   # Register eax,ecx,edx allocated

another example this do not work too


This one is easy you should be returning a single not a 128 bit record, use MOVSS not movaps

Peter
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

CuriousKit

  • Jr. Member
  • **
  • Posts: 78
Re: AVX and SSE support question
« Reply #121 on: December 04, 2017, 01:37:23 pm »
I'm having some difficulty compiling the latest version of the unit from BeanzMaster - the GLZTypes unit has an awkward dependency on GLZVectorMath and others, since TGLZVector and TGLZVector2i are not defined.  It's easy enough to fix, but it means that GLZTypes is not self-contained.

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #122 on: December 04, 2017, 03:57:17 pm »
Quote
Those errors are boring  >:D So perhaps by making and external object library with masm or  nasm/yasm, will be better than use internal asm ???

Still have to conform to pascal calling conventions so not much gain in doing so probably spend more time trying to get your params to your lib correctly..

I am writing some test cases, mark what is bad carry on coding and I'll try to sort out the 'annoying' errors.

I'm finding the issue is with 32bit result is not aligned so i had "movaps [RESULT], xmm0" instead of "movups [RESULT], xmm0" it's working
Under 64bit no problem, RESULT is aligned. But alway a problem with the clamp, lerp, combine, combine 2/3 functions. All others are ok in 32bit

As for this I have got this in unix64 should work for win64 I think from previous testing.

Code: Pascal  [Select][+][-]
  1.   class operator TGLZVector4f.+(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0,[A]
  4.   movss  xmm1,[B]
  5.   shufps xmm1, xmm1, $00
  6.   addps  xmm0,xmm1
  7.   movhlps xmm1,xmm0
  8. end;              
  9.  

Huch, you have the right Result with this ? because movhlps moving WZ value to XY value ??? if i well understood under Linux64 result is splitted  and the right result is, low in xmm0  and the high is in xmm1, i'm correct ?

Re comparison operators, in the pure pascal code as I read it every element must pass the comparison test, that was not happening in the case that one element failed in the asm. So it passed my tests with the following which also avoids branching. Comments please before I change a lot of code.
Code: Pascal  [Select][+][-]
  1.  
  2.     cmpps  xmm0, xmm1, cSSE_OPERATOR_LESS_OR_EQUAL  
  3.     movmskps eax, xmm0     // copies a 4 bit mask to eax
  4.     xor eax, $f    // only 1111 should should be correct for anded compares.
  5.     setz al          // true if zero            
  6.  

Edit 1 Negate fails tests that mask is doing a multiply by -1 not setting all items negative as the pascal code. Though I suspect the pascal code is wrong. Never had a use for setting all negative whereas *-1 is vector reverse.


I've tested it work, but result is wrong

 if v1 = v2 then Cells[1,25] := 'TRUE' else Cells[1,25] := 'FALSE';   

the ZEROFLAG is not set under 64bit so always return TRUE, but with 32bit your function is ok and return the right result

For negate you have right, under 64bit the result is wrong normaly in our sample the sign of the Y value should change. Under 32bit the function return the correct result. 
For X*-1 is equal as 0 - X so i've choose this latest Sub is normaly fastest than Mul.

I'm having some difficulty compiling the latest version of the unit from BeanzMaster - the GLZTypes unit has an awkward dependency on GLZVectorMath and others, since TGLZVector and TGLZVector2i are not defined.  It's easy enough to fix, but it means that GLZTypes is not self-contained.

Ouch sorry i've forget to delete the TGLZVectorX in the GLZType, this unit is only used by GLZMath, this is due because i've added the MinXYZComponent and the MAxXYZComponent and this 2 using the function Min3s and Max3s in GLZMath unit; So you can just copy / past this 2 functions in GLZVectorMath unit and delete the dependency of the GLZMath unit. Or simply comment the  MinXYZ/MaxXYZComponent  functions  :-[. This comes from my own project. Sorry   %)
« Last Edit: December 04, 2017, 03:59:01 pm by BeanzMaster »

CuriousKit

  • Jr. Member
  • **
  • Posts: 78
Re: AVX and SSE support question
« Reply #123 on: December 04, 2017, 06:58:33 pm »
It happens. In the meantime, I'm writing my own test kit, pulling on my experience in SQA. It might be additional effor since there are two test kits, but it means we get to doubly test your library for correctness and robustness and I have a framework from which to include and test my vector array functions.

One minor difference though is that in my own library, the CPU capabilities are checked upon program initialisation (via CPUID) and the best procedures selected based on what's available, using function pointers and inline wrappers. It also allows me to test and compare performance in just one cycle of the test kit by manually selecting which version of SSE or AVX to use. Makes things more complicated and greatly increases code size, but ensures it works on all platforms while taking advantage of modern features if they're available.

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #124 on: December 04, 2017, 08:45:58 pm »
Quote
For negate you have right, under 64bit the result is wrong normaly in our sample the sign of the Y value should change. Under 32bit the function return the correct result.
For X*-1 is equal as 0 - X so i've choose this latest Sub is normaly fastest than Mul.

So are you saying the Native Pascal code is wrong?

Quote
've tested it work, but result is wrong

 if v1 = v2 then Cells[1,25] := 'TRUE' else Cells[1,25] := 'FALSE';   

the ZEROFLAG is not set under 64bit so always return TRUE, but with 32bit your function is ok and return the right result

Ok I am not understanding your reply here. All I am doing is checking asm code reflects what the Pascal code does. I am currently only testing linux 64 bit and that asm code works in 64 bit for me.

Anyway all of this is getting confusing. So here is the code I am using for testing, you may find it useful.
It is using FPCUnit and has a Gui Runner and a command line runner for the tests. Basically I have recreated native class by copying the inc file with a quick rename of class so I can run pascal and assembly side by side and do comparison in the same code base. To test different compiler options, you have to change the options in the test project at the moment but it should be possible to automate by building multiple copies from the command line with differing parameters to the compiler.

Also you could just copy the lpi and rename it to reflect the build options that lpi uses for example wi32SSE win32AVX etc and open that project for testing.

Hacking the inc file is manual atm but as it is just a single search and replace of  TGLZVector4f to TNativeGLZVector4f a quick sed line in an automated test script would be all that would be needed.

It is a bit hard coded to sitting in folder alongside the code to be tested but nothing that can't be overcome. It is just a first attempt and it does make tracking down issues much easier and more reliable than eyeballing results on a screen.

I  have included you code in this so hopefully it just works out of the box and that code contains unix64 mods. Only three failures in whats in the test script both negate and pnegate and reflect.

Of course I still have to finish off the comparisons. As I am unsure of what your answer above means.
« Last Edit: December 04, 2017, 09:23:49 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #125 on: December 05, 2017, 12:02:21 am »
Hi Peter,

Your unit test is magic, I do not know this  :D I'm just adding Vector.ToString and FloatToStrF for see the result with my eyes

I make some test and correcting the Win64_SSE
All tests are now correct at home except 'AngleBetween' and 'reflect' but I think we can say that 'reflect' is correct

Quote
Vector Reflects do not match : Native = (X: 171.54222 ,Y: 677.06671 ,Z: 489.74261 ,W: 107.84930) --> SSE = (X: 171.54224 ,Y: 677.06677 ,Z: 489.74265 ,W: 107.84931)
Like you see the result is very, very near. For AngleBeetween SSE return me NAN  :'(

So are you saying the Native Pascal code is wrong?
Quote

Yes, in real for me, with the the native code should be like invert or class operator -

This 2 code give me the same result now :
Code: Pascal  [Select][+][-]
  1. procedure TNativeGLZVector4f.pNegate;
  2. begin
  3.   //if Self.X>0 then
  4.   Self.X := -Self.X;
  5.   //if Self.Y>0 then
  6.   Self.Y := -Self.Y;
  7.   //if Self.Z>0 then
  8.   Self.Z := -Self.Z;
  9.   //if Self.W>0 then
  10.   Self.W := -Self.W;
  11. end;
  12.  
  13. procedure TGLZVector4f.pNegate; assembler; nostackframe; register;
  14. asm
  15.   movaps xmm0,[RCX]
  16.   xorps xmm0, [RIP+cSSE_MASK_NEGATE]
  17.   movaps [RCX],xmm0
  18. End;
  19.  

but i'm little disturb by this because with my previous test result was correct to the native, like we see here http://forum.lazarus.freepascal.org/index.php/topic,32741.msg267332.html#msg267332
on the 2nd screenshot (on the 1st screenshot result are different)  :o so now i don't say exactly what's the real correct result

I'm also synchronize with your UNIX64_SSE, the  EQUAL function and this is work (not tested with SSE4 but should work to)

Code: Pascal  [Select][+][-]
  1. class operator TGLZVector4f.= (constref A, B: TGLZVector4f): boolean; assembler; nostackframe; register;
  2. asm
  3.   movaps xmm1,[A]
  4.   movaps xmm0,[B]
  5.   {$IFDEF USE_ASM_SSE_4}
  6.     cmpps xmm0,xmm1, cSSE_OPERATOR_EQUAL
  7.     ptest    xmm0, xmm1
  8.     jnz @no_differences
  9.     mov [RESULT],FALSE
  10.     jmp @END_SSE
  11.   {$ELSE}
  12.     cmpps  xmm0, xmm1, cSSE_OPERATOR_EQUAL    //  Yes: $FFFFFFFF, No: $00000000 ; 0 = Operator Equal
  13.     movmskps  eax, xmm0
  14.     test  eax, eax
  15.     setnz al
  16.   {$ENDIF}
  17. end;

Tommorrow if i have the time i'll make testunit with Win32

Many thanks and great work Peter like always 8-)

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #126 on: December 05, 2017, 12:13:46 am »
Code: Pascal  [Select][+][-]
  1.     cmpps  xmm0, xmm1, cSSE_OPERATOR_EQUAL    //  Yes: $FFFFFFFF, No: $00000000 ; 0 = Operator Equal
  2.     movmskps  eax, xmm0
  3.     test  eax, eax
  4.     setnz al

This code will return equal if only one item is equal not all items equal.

Quote
ector Reflects do not match : Native = (X: 171.54222 ,Y: 677.06671 ,Z: 489.74261 ,W: 107.84930) --> SSE = (X: 171.54224 ,Y: 677.06677 ,Z: 489.74265 ,W: 107.84931)

For this you can adjust the epsilon in the test,  as in

Code: Pascal  [Select][+][-]
  1. Compare(nt1,vt1, 1e-5)

See definition of Compare
Code: Pascal  [Select][+][-]
  1.   function Compare(constref A: TNativeGLZVector4f; constref B: TGLZVector4f;Espilon: Single = 1e-10): boolean;
  2.  

So you can override the resolution of the test.
« Last Edit: December 05, 2017, 01:32:03 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #127 on: December 05, 2017, 12:44:34 am »
Ok I just tested the code I provided before on win64 and it works for me.

Code: Pascal  [Select][+][-]
  1.     cmpps  xmm0, xmm1, cSSE_OPERATOR_LESS_OR_EQUAL    //  Yes: $FFFFFFFF, No: $00000000 ; 2 = Operator Less or Equal
  2.     movmskps eax, xmm0
  3.     xor eax, $F
  4.     setz al        

What gets returned in EAX is a mask of matched tests. So you could get 1010 in EAX which means x and z are less or equal but y and w are greater.

Though the test runner is so SLOW in windows.

Code: Pascal  [Select][+][-]
  1. 22:58:04 - Running All Tests
  2. 22:58:16 - Number of executed tests: 61  Time elapsed: 00:00:12.436

compared to linux

Code: Pascal  [Select][+][-]
  1. 2:13:25 - Running All Tests
  2. 12:13:26 - Number of executed tests: 61  Time elapsed: 00:00:00.149
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #128 on: December 05, 2017, 03:44:12 pm »
Ok I just tested the code I provided before on win64 and it works for me.

Code: Pascal  [Select][+][-]
  1.     cmpps  xmm0, xmm1, cSSE_OPERATOR_LESS_OR_EQUAL    //  Yes: $FFFFFFFF, No: $00000000 ; 2 = Operator Less or Equal
  2.     movmskps eax, xmm0
  3.     xor eax, $F
  4.     setz al        

What gets returned in EAX is a mask of matched tests. So you could get 1010 in EAX which means x and z are less or equal but y and w are greater.


Work for me to :) for testing i've used :
Code: Pascal  [Select][+][-]
  1.  
  2.  vt1.Create(2,  7,  -6, 3);
  3.  vt2.Create(1, 12,  -6, 8);

But always 1 failure --> Vector AngleBetweens do not match : 1.932 --> Nan

Though the test runner is so SLOW in windows.

Code: Pascal  [Select][+][-]
  1. 22:58:04 - Running All Tests
  2. 22:58:16 - Number of executed tests: 61  Time elapsed: 00:00:12.436

compared to linux

Code: Pascal  [Select][+][-]
  1. 2:13:25 - Running All Tests
  2. 12:13:26 - Number of executed tests: 61  Time elapsed: 00:00:00.149


Yes with windows is slow:
Code: Pascal  [Select][+][-]
  1. 15:36:46 - Running All Tests
  2. 15:36:47 - Number of executed tests: 61  Time elapsed: 00:00:00.859

I've begining tests with win32 i'm on the right way, some errors, but if i have enought time tonight i'm think i'll can correct all

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #129 on: December 05, 2017, 05:51:45 pm »
Ok Jerome,

Here is linux 64 with test harness for SSE SSE3 SSE4 and AVX.

Finished off the rest of the tests for comparison operators.

100% pass rate in all tests for linux 64 across all settings. I have placed -dUSE_ASM etc in project files so I do not have to comment/uncomment defines in the code. Just open the project and it all looks good with highlighting also showing that the right settings are there.

Not saying it is the most efficient atm, just that it works! and is a good starting point for fine tuning :D

Next on my list now is make timing tests in a similar manner. and a small framework for developing new functions. I have one function I want to get done as my program spends 30-40% of its time in this one function, according to callgrind.

Code: Pascal  [Select][+][-]
  1. function TCutPlane.GetNorm(cen, up, left, down, right: PAffineVector
  2.   ): TAffineVector;
  3. var
  4.   s,t,u,v: TAffineVector;
  5. begin
  6.   VectorSubtract(up^,cen^,s{%H-});
  7.   VectorSubtract(left^,cen^,t{%H-});
  8.   VectorSubtract(down^,cen^,u{%H-});
  9.   VectorSubtract(right^,cen^,v{%H-});
  10.  
  11.   Result.X := s.Y*t.Z - s.Z*t.Y + t.Y*u.Z - t.Z*u.Y + u.Y*v.Z - u.Z*v.Y + v.Y*s.Z - v.Z*s.Y;
  12.   Result.Y := s.Z*t.X - s.X*t.Z + t.Z*u.X - t.x*u.Z + u.Z*v.X - u.X*v.Z + v.Z*s.X - v.X*s.Z;
  13.   Result.Z := s.X*t.Y - s.Y*t.X + t.X*u.Y - t.Y*u.X + u.X*v.Y - u.Y*v.X + v.X*s.Y - v.Y*s.X;
  14.   NormalizeVector(Result);
  15. end;

« Last Edit: December 05, 2017, 05:53:42 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #130 on: December 05, 2017, 07:07:21 pm »
Quote
But always 1 failure --> Vector AngleBetweens do not match : 1.932 --> Nan

Pass your win64 updated code and I will have a look, this one was tricky as it is not a pure asm funtion and the parameter ordering is so different when not a pure asm, most are on the stack and need a mov pointer to register before loading into mmx register.
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #131 on: December 05, 2017, 08:23:50 pm »
Ok I got AngleBetween working in win64.

here is the code to load the mmx regs correctly.

Code: Pascal  [Select][+][-]
  1.    
  2.     movaps xmm0,[RCX]       //self is still in rcx
  3.     mov rax, [A]            // A is a pointer on the stack
  4.     movups xmm1, [RAX]
  5.     mov rax, [ACenterPoint] // ACenterPoint is a pointer on the stack
  6.     movups xmm2, [RAX]                
  7.  

Peter
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #132 on: December 05, 2017, 11:04:29 pm »
Best results so far for me now everything works. For SSE 2 I have found the best compiler flags are:

Quote
-CfSSE3
-Sv
-dUSE_ASM

Others seem to have no difference or make things worse (esp COREAVX avoid like the plague)

Some initial results, not final report style yet

Code: Pascal  [Select][+][-]
  1. TimeAddNative:  : 0.222999695688486 seconds
  2. TimeAddAsm:     : 0.0509998993948102 seconds
  3.  
  4. TimeSubNative:  : 0.219000270590186 seconds
  5. TimeSubAsm:     : 0.0520000699907541 seconds
  6.  
  7. TimeMulNative:  : 0.220999983139336 seconds
  8. TimeMulAsm:     : 0.0520000699907541 seconds
  9.  

not bad speedups for such simple routines.
« Last Edit: December 05, 2017, 11:07:37 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #133 on: December 06, 2017, 01:12:42 am »
I'v finished all tests on Win64 sse/3/4 and avx (i've also updated a little bit avx, synchronized Distance and length with SSE4 instructions)
I've also finished tests on win32  SSE. Need to check SSE3/4 and I'll do AVX tests tomorrow if i can and i'll post the updated code. In waiting

Testunit result for 64bit
Code: Pascal  [Select][+][-]
  1. 00:56:38 - Running All Tests
  2. 00:56:38 - Number of executed tests: 68  Time elapsed: 00:00:00.124
  3.  

Testunit result for 32bit
Code: Pascal  [Select][+][-]
  1. 01:05:10 - Running All Tests
  2. 01:05:11 - Number of executed tests: 68  Time elapsed: 00:00:00.155
  3.  

much better now  :)

Just a thing i'm not understing well is your trick with "movhlps xmm1,xmm0 " it an issue with stack, but something escapes me. can you re-explain me ?

for
Code: Pascal  [Select][+][-]
  1. function TCutPlane.GetNorm(cen, up, left, down, right: PAffineVector): TAffineVector;

this is what i'm beginning :

Code: Pascal  [Select][+][-]
  1. function GetNormFromCutPlane(cen, up, left, down, right: TGLZVector4f): TGLZVector4f;
  2. //  s,t,u,v: xmm2,xmm3, xmm4, xmm5
  3. asm
  4.   movaps xmm2, [Cent] //s
  5.   movaps xmm3, xmm2   //t
  6.   movaps xmm4, xmm2   //u
  7.   movaps xmm5, xmm2   //v
  8.  
  9.   //VectorSubtract(up^,cen^,s{%H-});
  10.   movaps xmm1, [up]
  11.   subps xmm2, xmm1
  12.   //VectorSubtract(left^,cen^,t{%H-});
  13.   movaps xmm1, [left]
  14.   subps xmm3, xmm1
  15.   //VectorSubtract(down^,cen^,u{%H-});
  16.   movaps xmm1, [down]
  17.   subps xmm4, xmm1
  18.   //VectorSubtract(right^,cen^,v{%H-});
  19.   movaps xmm1, [right]
  20.   subps xmm5, xmm1
  21.  
  22.   andps xmm2, [RIP+cSSE_MASK_NO_W]
  23.   andps xmm3, [RIP+cSSE_MASK_NO_W]
  24.   andps xmm4, [RIP+cSSE_MASK_NO_W]
  25.   andps xmm5, [RIP+cSSE_MASK_NO_W]
  26.  
  27.   //------------------------------------
  28.   // X := s.Y*t.Z,
  29.   // Y := s.Z*t.X,
  30.   // Z := s.X*t.Y
  31.   // S =   w,z,y,x
  32.   // T = * -,x,z,y
  33.   shufps xmm6, xmm3, 11001001b
  34.   mulps xmm6,xmm2
  35.  
  36.   // X := s.Z*t.Y
  37.   // Y := s.X*t.Z
  38.   // Z := s.Y*t.X
  39.   // S =   w,z,y,x
  40.   // t = * -,y,x,z
  41.   shufps xmm7, xmm3, 11010010b
  42.   mulps xmm7,xmm2
  43.  
  44.   //xmm6 = w,x,z,y
  45.   //xmm7 = w,y,x,z
  46.   subps xmm6,xmm7
  47.   movaps xmm0, xmm6
  48.   //-------------------------------------
  49.  
  50.   //  xmm0        =      xmm6       +        xmm7         +         xmm8        +         xmm2
  51.   //Result.X := (s.Y*t.Z - s.Z*t.Y) + (t.Y*u.Z - t.Z*u.Y) + (u.Y*v.Z - u.Z*v.Y) + (v.Y*s.Z - v.Z*s.Y);
  52.   //Result.Y := (s.Z*t.X - s.X*t.Z) + (t.Z*u.X - t.x*u.Z) + (u.Z*v.X - u.X*v.Z) + (v.Z*s.X - v.X*s.Z);
  53.   //Result.Z := (s.X*t.Y - s.Y*t.X) + (t.X*u.Y - t.Y*u.X) + (u.X*v.Y - u.Y*v.X) + (v.X*s.Y - v.Y*s.X);
  54.  
  55.   addps xmm0,xmm7
  56.   addps xmm0,xmm8
  57.   addps xmm0,xmm2
  58.  
  59.   //NormalizeVector(Result);
  60. end;

EDIT : I'm also tried to make compare with sse4 PTEST instruction but don't say how without a jump. I found an interesting article
https://stackoverflow.com/questions/34951714/simd-instructions-for-floating-point-equality-comparison-with-nan-nan but not understand all very well  :-[


« Last Edit: December 06, 2017, 01:15:08 am by BeanzMaster »

CuriousKit

  • Jr. Member
  • **
  • Posts: 78
Re: AVX and SSE support question
« Reply #134 on: December 06, 2017, 04:51:21 am »
Yeah, it's a little complex, but I think what they're trying to get at is that they combine the results of an IEEE equality (i.e. floating-point "is equal to") and an integer equality (what they call bitwise-equal, but is actually just checking two 32-bit integers for identical values, which are the bit representations of the floating-point numbers, including NaNs).

Intuitively, the results would be combined with logical OR (actually bitwise OR because the results are either all 0s or all 1s), but because of the results of CMPNEQPS and PCMPEQD, they spell out the truth table to prove that the combining operation is ANDN instead.

I'm not certain, but there might be a slight performance penalty if you switch between floating-point and integer processing within the same vector processing unit - this is why there are different opcodes for MOVDQA and MOVAPS, for example, even though they both move 128 bits from aligned memory into an XMM register.

Whether you need a jump or not depends on the code.  If you just need to set a result based on the zero flag, then you can use SETZ or SETNZ. There's no straight answer.

 

TinyPortal © 2005-2018