Recent

Author Topic: AVX and SSE support question  (Read 89741 times)

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #105 on: December 03, 2017, 01:29:46 am »
Ok setting
Code: Pascal  [Select][+][-]
  1. {$CODEALIGN RECORDMIN=16}  

certainly breaks some things, so may be promising.

« Last Edit: December 03, 2017, 01:36:58 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

CuriousKit

  • Jr. Member
  • **
  • Posts: 78
Re: AVX and SSE support question
« Reply #106 on: December 03, 2017, 01:41:07 am »
Ummm, check that that isn't aligning it for individual fields (I'm not sure if 'packed' overrides it anyway, in which case it will coincidentally work since it will align the first field).

Actually, that is exactly what $ALIGN does.  Hmmm... what directive forces memory alignment for a particular type?
« Last Edit: December 03, 2017, 02:20:22 am by CuriousKit »

CuriousKit

  • Jr. Member
  • **
  • Posts: 78
Re: AVX and SSE support question
« Reply #107 on: December 03, 2017, 02:20:38 am »
I might have found the answer.  This topic might be of interest to you - it seems that there's a somewhat undocumented feature for records in Pascal that controls memory alignment: https://forum.lazarus.freepascal.org/index.php/topic,27400.msg169251.html#msg169251

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #108 on: December 03, 2017, 12:33:14 pm »
I have done some tests and whatever I do I cannot get xmm tramsfers to align. Unix 64 will pass Const vectors in two xmm registers and return in xmm0 xmm1

However in my attempts to get things aligned I found something very interesting. I went back to the small test app to play with just a single function.

One idea I had was the following, if I can't get records to align (test for this was to change
Code: Pascal  [Select][+][-]
  1.   MOVUPS XMM2, [V1]    // get in one hit V1
  2. // to
  3. MOVAPS XMM2, [V1]    // get in one hit V1
  4.  
If the test seg faulted then record was not aligned and inspecting the rdi reg in this case was always 0x0????[8|4] where we need the last digit to be always 0 for aligned transfer.

So lets try making and aligned array, arrays have been around forever so they must align.

Align flags used

Code: Pascal  [Select][+][-]
  1. {$CODEALIGN CONSTMIN=16}
  2. {$CODEALIGN VARMIN=16}
  3. {$CODEALIGN RECORDMAX=4}
  4. {$CODEALIGN LOCALMIN=16}
  5.  
  6. {$define USE_ARRAY}
  7. {.$define USE_RECORD_V}
  8.  
  9.   {$ifdef USE_ARRAY}
  10.   TGLZVector4f = packed array[0..3] of Single;
  11.   {$else}
  12.   TGLZVector4fType = packed array[0..3] of Single;
  13.   TGLZVector4f = record
  14.     case Byte of
  15.        0: (V: TGLZVector4fType);
  16.        1: (X, Y, Z, W: Single);
  17.       //2: (AsVector3f : TGLZVector3f);
  18.   End;
  19.   {$endif}                              
  20.  

That really suprised me in that it gave a 4* speedup across the board, native and SSE.  :D
On the right track here I thought, checked the calling regs, hmm still passing vectors in two regs both ways.

Ok then, back to ConstRef and try a register address movups / movaps, nope nothing aligned. Why the speedup? Also would this invalidate using records? Ok lets test moving data around as TGLZVector4fType thus keeping the record usage. That worked with simliar speedups to just using an array.

So here is the test harness, this code may not work on other platforms but you could just substitute the code with something that works on your abi.

I am going to try to find out why I get a four times improvement, I presume there must be code elsewhere which has changed to give this speedup.
« Last Edit: December 03, 2017, 12:50:22 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

CuriousKit

  • Jr. Member
  • **
  • Posts: 78
Re: AVX and SSE support question
« Reply #109 on: December 03, 2017, 01:29:18 pm »
Don't underestimate the power of aligned memory. There's a reason why a lot of code segments are aligned on 16-byte boundaries, and not just the tops of procedures - if you look at the disassembled code of a for-loop, for example, you'll find that the top of the loop is aligned to a 16-byte boundary, with preceding bytes filled with NOP instructions if necessary.

(And turns out that Free Pascal doesn't support the "align" modifier as specified in the link a few posts pack)
« Last Edit: December 03, 2017, 01:35:42 pm by CuriousKit »

photor

  • New Member
  • *
  • Posts: 49
Re: AVX and SSE support question
« Reply #110 on: December 03, 2017, 01:38:38 pm »
got it :)

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #111 on: December 03, 2017, 02:13:48 pm »
Ok just tested this on win7 64 and it goes the other way by a very small margin. Not enough difference to make a real call on. Time to get trunk and see what happens in linux 64 there.
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #112 on: December 03, 2017, 02:44:03 pm »
Ok looking a compiler sources it seems we will never get parameter support for single move. So trying to get alignment and using movaps with a ConstRef will be the quickest for larger structures.

How the compiler classifies s128 floattype arguments
Code: Pascal  [Select][+][-]
  1.  s128real:
  2.   begin
  3.     classes[0].typ:=X86_64_SSE_CLASS;
  4.     classes[0].def:=carraydef.getreusable_no_free(s32floattype,2);
  5.     classes[1].typ:=X86_64_SSEUP_CLASS;
  6.     classes[1].def:=carraydef.getreusable_no_free(s32floattype,2);
  7.     result:=2;
  8.  end;
  9.  

so I do not know how Jerome is getting good returns in Win10?

AVX moves are going to be interesting

Code: Pascal  [Select][+][-]
  1.           else
  2.             { 4 can only happen for _m256 vectors, not yet supported }
  3.             internalerror(2010021501);
  4.         end;
  5.       end;
« Last Edit: December 03, 2017, 02:50:14 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #113 on: December 03, 2017, 06:41:57 pm »
Right finally got movaps everywhere in my test app.

will need a lot of this sort of layout in any classes.

Code: Pascal  [Select][+][-]
  1.   Tform1 = Class(Tform)
  2.     Button1: TButton;
  3.     Label1 : Tlabel;
  4.     Label2 : Tlabel;
  5.     Label3 : Tlabel;
  6.     procedure Button1Click(Sender: TObject);
  7.     Procedure Formcreate(Sender : Tobject);
  8.     Procedure Formshow(Sender : Tobject);
  9.   Private
  10.   Public
  11.   {$CODEALIGN RECORDMIN=16}
  12.   vt1,vt2, vt3 : TGLZVector4f;
  13.   {$CODEALIGN RECORDMIN=4}
  14.    Fs1,Fs2 : Single;
  15.   {$CODEALIGN RECORDMIN=1}
  16.   // .... whatever here booleam etc
  17.   End;                                                    
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #114 on: December 03, 2017, 06:54:02 pm »
Hi to all


so I do not know how Jerome is getting good returns in Win10?

I can't answer, i'm just coding so.........

Anayway, i've splitted all asm code into 6 includes file (one for each case Linux/Windows 32/64bit SSE/AVX), better reading and better for debugging
plus i've added 2 of my others units, i've beginning some other little tests)
I've corrected some little spelling bugs and putted some little comments

I've tested 32bit with Lazarus 1.8rc3 but some errors occured :
1st the clamp functions work but raise a SIGSEV just after
2nd the function with single result. The result is stored in ST register, i tried to set it with FTSP intruction, but without success

I'm also add some conditionnals commands for alignment, replaced MOVUPS by MOVAPS and it work.  I've also added 2/3 others little functions, and added AngleBetween in asm but not tested yet
The performance varying and depends of the compiler's options and how record is set (packed or not)
The best results I've got are with SSE4/SSE3, not with AVX so i think they're will be better with matrix manipulation.

Peter i don't include your change for Unix, i can't test and don't know where exactly.

I've also tested your sample, it work in 32bit with Laz1.8rc3 but not in Laz1.8rc4 64bit. The better result i haved, was with {$define USE_RECORD_V}

Now I have a headache ! Next i'll begin some tests with Arrays,  matrix and quaternion

Request to BeanzMaster - as well as timing checks, can you also implement some verification in your benchmark program? I have a feeling that some functions return incorrect results. Failing that, I can possibly design something a little more in-depth once I've finished my current task.


Yes later, one of the first needed is check the divide by 0. Otherwise compared to the native code the results are good


dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #115 on: December 03, 2017, 07:05:13 pm »

I've tested 32bit with Lazarus 1.8rc3 but some errors occured :
1st the clamp functions work but raise a SIGSEV just after
2nd the function with single result. The result is stored in ST register, i tried to set it with FTSP intruction, but without success


That is usually a sign of stack corruption, such as moving a whole 128 bit mmx reg when there is only space for 32 or 64 bytes. Usually I have found that if the variable is on the stack in 32 bits the stack contains a pointer and not the variable so need a

mov eax, stackedvar
mov [eax], xmm reg
 
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #116 on: December 03, 2017, 08:32:09 pm »

That is usually a sign of stack corruption, such as moving a whole 128 bit mmx reg when there is only space for 32 or 64 bytes. Usually I have found that if the variable is on the stack in 32 bits the stack contains a pointer and not the variable so need a

mov eax, stackedvar
mov [eax], xmm reg

  mov ecx, RESULT
  mov [ecx], xmm0

not working : vectormath_vector_win32_sse_imp.inc(269,5) Error: Asm: [mov mem??,xmmreg] invalid combination of opcode and operands

and this is what i have in the S file :

Quote
.globl   GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE
GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE:
   # Register ebp allocated
# [258] Asm
   pushl   %ebp
   movl   %esp,%ebp
   leal   -4(%esp),%esp
# Var A located in register edx
# Var $self located in register eax
# Temp -4,4 allocated
# Var $result located at ebp-4, size=OS_F32
   # Register eax,ecx,edx allocated

another example this do not work too

Code: Pascal  [Select][+][-]
  1. class operator TGLZVector4f.+(constref A: TGLZVector4f; constref B:Single): TGLZVector4f;assembler; //nostackframe; register;
  2. asm
  3.   movups xmm0,[A]
  4.   movss  xmm1,[B]
  5.   shufps xmm1, xmm1, $00
  6.   addps  xmm0,xmm1
  7.   movaps [RESULT], xmm0
  8. end;

Quote
# Var A located in register eax
# Var B located in register edx
# Var $result located in register ecx
   # Register eax,ecx,edx allocated

Those errors are boring  >:D So perhaps by making and external object library with masm or  nasm/yasm, will be better than use internal asm ???

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #117 on: December 04, 2017, 02:55:07 am »
Quote
Those errors are boring  >:D So perhaps by making and external object library with masm or  nasm/yasm, will be better than use internal asm ???

Still have to conform to pascal calling conventions so not much gain in doing so probably spend more time trying to get your params to your lib correctly..

I am writing some test cases, mark what is bad carry on coding and I'll try to sort out the 'annoying' errors.
« Last Edit: December 04, 2017, 02:56:48 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #118 on: December 04, 2017, 03:06:29 am »
As for this I have got this in unix64 should work for win64 I think from previous testing.

Code: Pascal  [Select][+][-]
  1.   class operator TGLZVector4f.+(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0,[A]
  4.   movss  xmm1,[B]
  5.   shufps xmm1, xmm1, $00
  6.   addps  xmm0,xmm1
  7.   movhlps xmm1,xmm0
  8. end;              
  9.  
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #119 on: December 04, 2017, 03:30:33 am »
Re comparison operators, in the pure pascal code as I read it every element must pass the comparison test, that was not happening in the case that one element failed in the asm. So it passed my tests with the following which also avoids branching. Comments please before I change a lot of code.
Code: Pascal  [Select][+][-]
  1.  
  2.     cmpps  xmm0, xmm1, cSSE_OPERATOR_LESS_OR_EQUAL  
  3.     movmskps eax, xmm0     // copies a 4 bit mask to eax
  4.     xor eax, $f    // only 1111 should should be correct for anded compares.
  5.     setz al          // true if zero            
  6.  

Edit 1 Negate fails tests that mask is doing a multiply by -1 not setting all items negative as the pascal code. Though I suspect the pascal code is wrong. Never had a use for setting all negative whereas *-1 is vector reverse.
« Last Edit: December 04, 2017, 04:21:24 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

 

TinyPortal © 2005-2018