Recent

Author Topic: AVX and SSE support question  (Read 89920 times)

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #60 on: November 30, 2017, 09:57:46 am »
Ok I have looked a bit deeper into this and have found the following.

If the compiler puts any of the single refs on the stack then it returns garbage when we access it using movss.

If all the single refs are allocated to registers then it works perfectly.

Now if this is a bug or not I do not know, ( I took one look at 8086 assembly in 1985 and said f@*k this and have tried to stay away from it since. I have done a lot of assembly for other processors with cleaner instruction sets) so am not that up on exact memory addressing syntax, plus having to read Intel as written and gcc style generated gets quite confusing.

So if you check the assembler output of linux, which works

Code: Pascal  [Select][+][-]
  1. .Lc664:
  2.         leaq    -16(%rsp),%rsp
  3. # Var V2 located in register rsi
  4. # Var F1 located in register rdx
  5. # Var F2 located in register rcx
  6. # Var $self located in register rdi
  7. # Temp -16,16 allocated
  8. # Var $result located at rbp-16, size=OS_128
  9.         # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
  10.  
  11. # [3512] movss xmm3, [F2]  
  12.    movss        (%rcx),%xmm3
  13.  
  14.  

versus windows

Code: Pascal  [Select][+][-]
  1. # Var V2 located in register r8
  2. # Var F1 located in register r9
  3. # Var $self located in register rcx
  4. # Var $result located in register rdx
  5. .seh_endprologue
  6. # Var F2 located at rbp+48, size=OS_64
  7.         # Register rax,rcx,rdx,r8,r9,r10,r11 allocated
  8. .Ll1456:
  9. # [3495] vmovups xmm0,[RCX]
  10.         vmovups (%rcx),%xmm0
  11. .Ll1457:
  12. # [3500] vmovups xmm1, [V2]
  13.         vmovups (%r8),%xmm1
  14. .Ll1458:
  15. # [3502] movss xmm2, [F2]
  16.         movss   48(%rbp),%xmm2
  17. .Ll1459:                                              
  18.  

Maybe someone else has an idea why the latter falls down and puts garbage into the mmx register.

BTW not using constref was a bad idea, as in certain optimisation modes the compiler uses mmx registers for procedure parameters.

Peter
« Last Edit: November 30, 2017, 10:16:06 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: AVX and SSE support question
« Reply #61 on: November 30, 2017, 10:04:09 am »
You move the value to the stack. You should probably move to the pointer that is ON the stack.

so

Code: Pascal  [Select][+][-]
  1. mov  rax,48(%rbp)  // or whatever free register.
  2. movss (rax),%xmm2


dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #62 on: November 30, 2017, 10:31:55 am »
Hi marcov,

I was kind of coming to that solution myself, so we have to be a little inefficient in the case that the compiler has already put the pointer in a register, by copying to another, so that we ensure that we have a pointer in a register in the case that the compiler puts the pointer on the stack.

Peter
« Last Edit: November 30, 2017, 11:21:36 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #63 on: November 30, 2017, 10:53:30 am »
@Jerome

Here is  TGLZAVXVector4f.Combine2 reworked to be safe for previous post issues, test this and if it works then quite a bit of code jockeying to do :)

It gave the right answer on my win7 64 box.

Code: Pascal  [Select][+][-]
  1. function TGLZAVXVector4f.Combine2(constref V2: TGLZAVXVector4f; Constref F1: single; constref F2: Single): TGLZAVXVector4f;assembler;
  2. asm
  3. {$ifdef UNIX}
  4.   {$ifdef CPU64}
  5.      vmovups xmm0,[RDI]
  6.   {$else}
  7.      vmovups xmm0,[EDI]
  8.   {$endif}
  9. {$else}
  10.   {$ifdef CPU64}
  11.      vmovups xmm0,[RCX]
  12.   {$else}
  13.      vmovups xmm0,[ECX]
  14.   {$endif}
  15. {$endif}
  16. {$ifdef CPU64}
  17.  
  18.   mov RAX, V2
  19.   vmovups xmm1, [RAX]
  20.  
  21.   mov RAX, F1
  22.   movss xmm2, [RAX]
  23.  
  24.   mov RAX, F2
  25.   movss xmm3, [RAX]
  26.  
  27. {$else}
  28.  
  29.   mov EAX, V2
  30.   vmovups xmm1, [EAX]
  31.  
  32.   mov EAX, F1
  33.   movss xmm2, [EAX]
  34.  
  35.   mov EAX, F2
  36.   movss xmm3, [EAX]
  37. {$endif}
  38.  
  39.  
  40.   vshufps xmm2, xmm2, xmm2, $00 // replicate
  41.   vshufps xmm3, xmm3, xmm3, $00 // replicate
  42.  
  43.   vmulps xmm0, xmm0, xmm2  // Self * F1
  44.   vmulps xmm1, xmm1, xmm3  // V2 * F2
  45.  
  46.   vaddps xmm0, xmm0, xmm1  // (Self * F1) + (V2 * F2)
  47.  
  48.   vandps xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  49.   vmovups [RESULT], xmm0
  50. end
  51. {$ifdef CPU64}
  52.  ['RAX',                                                        
  53. {$else}
  54.  ['EAX',                                                    
  55. {$endif} 'xmm0', 'xmm1','xmm2','xmm3'];
  56.  


Edit: Made this 32/64 bit safe and be nice to compiler this is getting very messy very quickly! :(


« Last Edit: November 30, 2017, 12:35:47 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #64 on: November 30, 2017, 11:11:31 am »
Ok I did a quick google around and it would seem there is no guaranteed way to force the compiler to put the parameters into a register.

Could someone confirm this please? It would be really good if I was wrong.

Peter
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: AVX and SSE support question
« Reply #65 on: November 30, 2017, 11:27:44 am »
Ok I did a quick google around and it would seem there is no guaranteed way to force the compiler to put the parameters into a register.

Could someone confirm this please? It would be really good if I was wrong.

Freepascal has no create my own calling convention options. Newer architectures (x86_64 explicitely included) frown upon this anyway, doing so is something from 16-bit dos times.

I do notice that you use RAX outside of cpu64 statements. If you have a bunch of routines like this, maybe a few macros like

{$ifdef unix}
{$ifdef CPU64}
  {$define asmfirstparam:=rdi}
{$else}
  {$define asmfirstparam:=edi
{$endif}
{$else}
{$ifdef CPU64}
  {$define asmfirstparam:=rcx}
{$else}
  {$define asmfirstparam:=ecx}
{$endif}
{$endif}

would reduce the size. It is delphi incompat anyway because of constref.

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #66 on: November 30, 2017, 11:36:45 am »
Hi marcov,

A few pages back this was suggested but macros do not expand inside asm blocks from my testing  :(
« Last Edit: November 30, 2017, 11:46:16 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #67 on: November 30, 2017, 03:00:12 pm »
Thanks guys, so i took a look into the .S generated file. We can see

For Native Combine2 function :
Quote
.section .text.n_glzvectormath_new$_$tglznativevector4f_$__$$_combine2$crcf2601943,"x"
   .balign 16,0x90
.globl   GLZVECTORMATH_NEW$_$TGLZNATIVEVECTOR4F_$__$$_COMBINE2$crcF2601943
GLZVECTORMATH_NEW$_$TGLZNATIVEVECTOR4F_$__$$_COMBINE2$crcF2601943:
.Lc126:
# Temps allocated between rbp-16 and rbp+0
.seh_proc GLZVECTORMATH_NEW$_$TGLZNATIVEVECTOR4F_$__$$_COMBINE2$crcF2601943
   # Register rbp allocated
.Ll257:
# [960] begin
   pushq   %rbp
.seh_pushreg %rbp
.Lc128:
.Lc129:
   movq   %rsp,%rbp
.Lc130:
   leaq   -48(%rsp),%rsp
.seh_stackalloc 48
# Var V2 located in register r8
# Var F1 located in register r9
# Var F2 located in register rcx
# Var $self located in register rax
# Var $result located in register rdx
# Temp -16,16 allocated
.seh_endprologue
   # Register rcx,rdx,r8,r9,rax allocated

and For the SSE

Quote
.section .text.n_glzvectormath_new$_$tglzssevector4f_$__$$_combine2$tglzssevector4f$single$single$$tglzssevector4f,"x"
   .balign 16,0x90
.globl   GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F
GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F:
.Lc403:
.seh_proc GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F
   # Register rbp allocated
.Ll832:
# [2233] asm
   pushq   %rbp
.seh_pushreg %rbp
.Lc405:
.Lc406:
   movq   %rsp,%rbp
.Lc407:
   leaq   -32(%rsp),%rsp
.seh_stackalloc 32
# Var V2 located in register r8
# Var F1 located in register r9
# Var $self located in register rcx
# Var $result located in register rdx
.seh_endprologue
# Var F2 located at rbp+48, size=OS_64
   # Register rax,rcx,rdx,r8,r9,r10,r11 allocated

and now for the SSE whit nostackframe and register options :
function TGLZSSEVector4f.Combine2(constref V2: TGLZSSEVector4f;constref F1, F2: Single): TGLZSSEVector4f;assembler;nostackframe;register)
(same resutl without the register option)
Quote
.section .text.n_glzvectormath_new$_$tglzssevector4f_$__$$_combine2$tglzssevector4f$single$single$$tglzssevector4f,"x"
   .balign 16,0x90
.globl   GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F
GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F:
.Lc403:
# Var V2 located in register r8
# Var F1 located in register r9
# Var $self located in register rcx
# Var $result located in register rdx
# Var F2 located at rbp+48, size=OS_64
# [2233] asm
   # Register rax,rcx,rdx,r8,r9,r10,r11 allocated

note we'll see the stack size is the problem, and the Self is in RAX with native, we can see the difference of the allocated registers

by just adding Begin..End around the Asm..End solve the problem and the F2 var is now correctly assigned to the XMM register

My question it is a compiler issue ? It is possible to increase the stack alloc size manually ? I tried with $M but without success

And i confirm what Peter said :
Quote
macros do not expand inside asm blocks

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #68 on: November 30, 2017, 03:11:24 pm »
Jerome,

Interesting reading here http://wiki.lazarus.freepascal.org/Win64/AMD64_API

It would seem only 4 params are ever put in registers. Self being the first for object/adv struct / classes.

I am missing something on your stack size statement
Quote
note we'll see the stack size is the problem
don't quite understand that bit.
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #69 on: November 30, 2017, 04:09:19 pm »
Quote
note we'll see the stack size is the problem
don't quite understand that bit.

Hi, sorry for my bad english, or my misunderstanding  :-[
in native func we can see .seh_stackalloc 48 and in the sse version .seh_stackalloc 32

So i've made an another little test without Advanced record like this (check comments in)

Code: Pascal  [Select][+][-]
  1. Type
  2.  
  3.   { Tform1 }
  4.   TGLZVector4fType = packed array[0..3] of Single;
  5.   TGLZVector4f = packed record
  6.       case Byte of
  7.       0: (V: TGLZVector4fType);
  8.       1: (X, Y, Z, W: Single);
  9.       //2: (AsVector3f : TGLZVector3f);
  10.   End;
  11.  
  12.   Tform1 = Class(Tform)
  13.     Label1 : Tlabel;
  14.     Label2 : Tlabel;
  15.     Label3 : Tlabel;
  16.     Procedure Formcreate(Sender : Tobject);
  17.     Procedure Formshow(Sender : Tobject);
  18.   Private
  19.  
  20.   Public
  21.     vt1,vt2 : TGLZVector4f;
  22.     Fs1,Fs2 : Single;
  23.   End;
  24.  
  25.  
  26.  
  27. Var
  28.   Form1 : Tform1;
  29.  
  30. Implementation
  31.  
  32. {$R *.lfm}
  33.  
  34. Const cSSE_MASK_NO_W   : array [0..3] of UInt32 = ($FFFFFFFF, $FFFFFFFF, $FFFFFFFF, $00000000);
  35.  
  36. function CreateVector4f(Const aX,aY,aZ,aW : Single):TGLZVector4f;
  37. begin
  38.    Result.X := AX;
  39.    Result.Y := AY;
  40.    Result.Z := AZ;
  41.    Result.W := AW;
  42. end;
  43.  
  44. function Vector4fToString(aVector:TGLZVector4f) : String;
  45. begin
  46.    Result := '(X: '+FloattoStrF(aVector.X,fffixed,5,5)+
  47.             ' ,Y: '+FloattoStrF(aVector.Y,fffixed,5,5)+
  48.             ' ,Z: '+FloattoStrF(aVector.Z,fffixed,5,5)+
  49.             ' ,W: '+FloattoStrF(aVector.W,fffixed,5,5)+')';
  50. End;
  51.  
  52. function NativeCombine2(Const V1, V2: TGLZVector4f;Const  F1, F2: Single): TGLZVector4f;
  53. begin
  54.    Result.X:=( V1.X*F1) + (V2.X*F2);
  55.    Result.Y:=( V1.Y*F1) + (V2.Y*F2);
  56.    Result.Z:=( V2.Z*F1) + (V2.Z*F2);
  57.    Result.W:=0;
  58. end;
  59.  
  60. function SSECombine2(Const V1, V2: TGLZVector4f; Const F1,F2: Single): TGLZVector4f;assembler;
  61. asm
  62.  
  63.   movups xmm0,[V1]
  64.   movups xmm1, [V2]
  65.   movss xmm2, F1     //--->   unit1.pas(97,15) Warning: Check size of memory operand "movss: memory-operand-size is 32 bits, but expected [128 bits]"
  66.   //movlps xmm2, F1  //--->  unit1.pas(97,3) Error: Asm: [movlps xmmreg,xmmreg] invalid combination of opcode and operands
  67.  
  68.   //movlps xmm3, F2    //--> NO WARNING, NO ERROR , with MOVSS xmm2, F1. But wrong result
  69.   movss xmm3, F2   //--> unit1.pas(99,15) Warning: Check size of memory operand "movss: memory-operand-size is 32 bits, but expected [128 bits]"
  70.  
  71.   shufps xmm2, xmm2, $00 // replicate
  72.   shufps xmm3, xmm3, $00 // replicate
  73.  
  74.   mulps xmm0, xmm2  // Self * F1
  75.   mulps xmm1, xmm3  // V2 * F2
  76.  
  77.   addps xmm0, xmm1  // (Self * F1) + (V2 * F2)
  78.  
  79.   andps xmm0, [RIP+cSSE_MASK_NO_W]
  80.   movups [RESULT], xmm0
  81. end;
  82.  
  83. function AVXCombine2(Const V1, V2: TGLZVector4f;Const  F1, F2: Single): TGLZVector4f;assembler;
  84. asm
  85.   vmovups xmm0,[V1]
  86.   vmovups xmm1, [V2]
  87.   // vmovss xmm2, F1  //--> unit1.pas(118,3) Error: Asm: [vmovss xmmreg,xmmreg] invalid combination of opcode and operands
  88.   //vmovlps xmm2, F1    //--> unit1.pas(119,3) Error: Asm: [vmovlps xmmreg,xmmreg] invalid combination of opcode and operands
  89.   //vmovlps xmm3, F2 // Same error here, also with vmovss
  90.   //vmovups xmm3, F1   //--> unit1.pas(122,17) Warning: Check size of memory operand "vmovups: memory-operand-size is 32 bits, but expected [128 bits]"
  91.   //vmovups xmm3, F2   //--> Idem above
  92.   // ALL ABOVE GIVE WRONG RESULT
  93.  
  94.   movss xmm3, [F1]   //--> Using SSE instruction give good result but always warning
  95.   movss xmm3, [F2]
  96.  
  97.   vshufps xmm2, xmm2, xmm2, $00 // replicate
  98.   vshufps xmm3, xmm3, xmm3, $00 // replicate
  99.  
  100.   vmulps xmm0, xmm0, xmm2  // Self * F1
  101.   vmulps xmm1, xmm1, xmm3  // V2 * F2
  102.  
  103.   vaddps xmm0, xmm0, xmm1  // (Self * F1) + (V2 * F2)
  104.  
  105.   vandps xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  106.   vmovups [RESULT], xmm0
  107. end;
  108.  
  109. { Tform1 }
  110.  
  111. Procedure Tform1.Formcreate(Sender : Tobject);
  112. Begin
  113.   vt1:= CreateVector4f(5.850,-15.480,8.512,1.5);
  114.   vt2:= CreateVector4f(1.558,6.512,4.525,1.0);
  115.   Fs1 := 1.5;
  116.   Fs2 := 5.5;
  117. End;
  118.  
  119. Procedure Tform1.Formshow(Sender : Tobject);
  120. Begin
  121.   Label1.Caption := Vector4fToString(NativeCombine2(Vt1,Vt2,Fs1, Fs2));
  122.   Label2.Caption := Vector4fToString(SSECombine2(Vt1,Vt2,Fs1, Fs2));
  123.   Label3.Caption := Vector4fToString(AVXCombine2(Vt1,Vt2,Fs1, Fs2));
  124. End;  
  125.  

Now by surrounding asm..end block by begin..end solve problem but always Warnings.  Except for AVX VMOSS always return Error the solution is used MOVSS.
And now only WARNINGS but results are corrects

So i think i'll just surounding those 2 functions in "Advanced record" or simply make it inline without use ASM code and simply use operators "Result := (V1*F1) + (V2*F2)"; In all case those 2 functions are not really important and not used often in GLScene. So, I'll see later. And i keep in my mind the maximum args is 2 + the "Self"

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #70 on: November 30, 2017, 04:18:07 pm »
unit1.pas(97,15) Warning: Check size of memory operand "movss: memory-operand-size is 32 bits, but expected [128 bits]

This warning is just wrong there is nothing wrong with a movss only moving 32 bits this is a really bad warning from the compiler and should just be {$-h} away when it is encountered/
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #71 on: November 30, 2017, 06:13:07 pm »
Hi Jerome,

I have been playing around with this some more in Linux. I can get the SSECombine2 down to just the following (from your little test code)

Code: Pascal  [Select][+][-]
  1. function SSECombine2(constref V1, V2: TGLZVector4f; Const F1,F2: Single): TGLZVector4f;assembler;
  2. asm
  3.   movups xmm2, [V1]
  4.   movups xmm3, [V2]
  5.  
  6.   shufps xmm0, xmm0, $00 // replicate  F1
  7.   shufps xmm1, xmm1, $00 // replicate  F2
  8.  
  9.   mulps xmm2, xmm0  // Self * F1
  10.   mulps xmm3, xmm1  // V2 * F2
  11.  
  12.   addps xmm2, xmm3  // (Self * F1) + (V2 * F2)
  13.  
  14.   andps xmm2, [RIP+cSSE_MASK_NO_W]
  15.   movups [RESULT], xmm2
  16. end;
  17.  

whereas the optimum for windows would be

Code: Pascal  [Select][+][-]
  1. function SSECombine2(constref V1, V2: TGLZVector4f; Const F1,F2: Single): TGLZVector4f;assembler;
  2. asm
  3.   movups xmm0, [V1]
  4.   movups xmm1, [V2]
  5.   movss xmm2, [F2{%H-}]
  6.  
  7.   shufps xmm3, xmm3, $00 // replicate
  8.   shufps xmm2, xmm2, $00 // replicate
  9.  
  10.   mulps xmm0, xmm3  // Self * F1
  11.   mulps xmm1, xmm2  // V2 * F2
  12.  
  13.   addps xmm0, xmm1  // (Self * F1) + (V2 * F2)
  14.  
  15.   andps xmm0, [RIP+cSSE_MASK_NO_W]
  16.   movups [RESULT], xmm0
  17. end;
  18.  


Might it not be better to have two inc files for the implementation which are linux and win specific and can be optimized according to their respective abis?

Peter
« Last Edit: November 30, 2017, 07:38:43 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: AVX and SSE support question
« Reply #72 on: November 30, 2017, 07:43:40 pm »
FWIW, while this thread was running, I've been playing with SSE (albeit in Delphi, since for work) too in the past two weeks, so I thought I post some code.

It is more of an integer SSSE3 routine, rotating a block of 8x8 bytes with a loop around it for a bit of loop tiling.  See rot 8x8 here.

The related stackoverflow thread is at why does haswell+ suck?

CuriousKit

  • Jr. Member
  • **
  • Posts: 78
Re: AVX and SSE support question
« Reply #73 on: November 30, 2017, 09:46:48 pm »
On the subject, I wrote a load of SSE, AVX and FMA routines primarily for graphics programming, namely taking an array of vectors and transforming them by a 4x4 matrix, for example.  Would any of those be useful for your collection or for Lazarus in general?  There's still some room for improvement though, since I don't take advantage of memory alignment.

I know the topic is mostly on compiler support and optimisation, but is it worth having some standardised vector and matrix functions that make use of SSE and AVX if available? I know FPC has some 2, 3 and 4-component vector and matrix functions, but they're very generalised and not particularly fast when dealing with large datasets.

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #74 on: November 30, 2017, 10:50:35 pm »
Hi to all,

first
You move the value to the stack. You should probably move to the pointer that is ON the stack.

so

Code: Pascal  [Select][+][-]
  1. mov  rax,48(%rbp)  // or whatever free register.
  2. movss (rax),%xmm2

I tried
Code: Pascal  [Select][+][-]
  1.  
  2.   mov r12, [rbp+48] //GLZVectorMath_NEW.pas(2264,19) Warning: Use of +offset(%ebp) for parameters invalid here  movss xmm3,r12  
  3.   movss xmm3, r12 // GLZVectorMath_NEW.pas(2265,3) Error: Asm: [movss xmmreg,reg64] invalid combination of opcode and operands
  4.  
it doesn't work (i've tried also with rax)

and

Code: Pascal  [Select][+][-]
  1.  
  2.   movss xmm3,[RBP+48]  //GLZVectorMath_NEW.pas(2265,21) Warning: Use of +offset(%ebp) for parameters invalid here
  3.  
Compile but, result is wrong

Hi Jerome,

I have been playing around with this some more in Linux. I can get the SSECombine2 down to just the following (from your little test code)

whereas the optimum for windows would be

Code: Pascal  [Select][+][-]
  1. function SSECombine2(constref V1, V2: TGLZVector4f; Const F1,F2: Single): TGLZVector4f;assembler;
  2. asm
  3.   movups xmm0, [V1]
  4.   movups xmm1, [V2]
  5.   movss xmm2, [F2{%H-}]
  6.  
  7.   shufps xmm3, xmm3, $00 // replicate
  8.   shufps xmm2, xmm2, $00 // replicate
  9.  
  10.   mulps xmm0, xmm3  // Self * F1
  11.   mulps xmm1, xmm2  // V2 * F2
  12.  
  13.   addps xmm0, xmm1  // (Self * F1) + (V2 * F2)
  14.  
  15.   andps xmm0, [RIP+cSSE_MASK_NO_W]
  16.   movups [RESULT], xmm0
  17. end;
  18.  

It work but not in the  Advanced Record :

Quote
# Var V2 located in register r8
# Var F1 located in register r9
# Var $self located in register rcx
# Var $result located in register rdx
.seh_endprologue
# Var F2 located at rbp+48, size=OS_64

Actually the only thing that solve the problem is by surrounding Asm..End block by a Begin..End  :'(

Code: Pascal  [Select][+][-]
  1. function TGLZSSEVector4f.Combine2(constref V2: TGLZSSEVector4f;constref F1, F2: Single): TGLZSSEVector4f;
  2. Begin
  3.   asm
  4.   {$ifdef UNIX}
  5.     {$ifdef CPU64}
  6.        movups xmm0,[RDI]
  7.     {$else}
  8.        movups xmm0,[EDI]
  9.     {$endif}
  10.   {$else}
  11.     {$ifdef CPU64}
  12.        movups xmm0,[RCX]
  13.     {$else}
  14.        movups xmm0,[EAX]
  15.     {$endif}
  16.   {$endif}
  17.     movups xmm1, [V2]
  18.  
  19.     movlps xmm2,[F1]
  20.     movlps xmm3,[F2]
  21.  
  22.     shufps xmm2, xmm2, $00 // replicate
  23.     shufps xmm3, xmm3, $00 // replicate
  24.  
  25.     mulps xmm0, xmm2  // Self * F1
  26.     mulps xmm1, xmm3  // V2 * F2
  27.  
  28.     addps xmm0, xmm1  // (Self * F1) + (V2 * F2)
  29.     {$IFDEF CPU64}
  30.       andps xmm0, [RIP+cSSE_MASK_NO_W]
  31.     {$ELSE}
  32.       andps xmm0, [cSSE_MASK_NO_W]
  33.     {$ENDIF}
  34.     movups [RESULT], xmm0 // If i'm remember my last test this line is not needed in 32bit, because result is store in xmm0
  35.   end;
  36. End;  
  37.  

Might it not be better to have two inc files for the implementation which are linux and win specific and can be optimized according to their respective abis?

Yes i think too, i'll probably make 2 inc in the final unit

FWIW, while this thread was running, I've been playing with SSE (albeit in Delphi, since for work) too in the past two weeks, so I thought I post some code.

It is more of an integer SSSE3 routine, rotating a block of 8x8 bytes with a loop around it for a bit of loop tiling.  See rot 8x8 here.

The related stackoverflow thread is at why does haswell+ suck?

Very interesting, but i don't understand all yet  :-[
It will be very intersting making some test with Bitmap

On the subject, I wrote a load of SSE, AVX and FMA routines primarily for graphics programming, namely taking an array of vectors and transforming them by a 4x4 matrix, for example.  Would any of those be useful for your collection or for Lazarus in general?  There's still some room for improvement though, since I don't take advantage of memory alignment.

I know the topic is mostly on compiler support and optimisation, but is it worth having some standardised vector and matrix functions that make use of SSE and AVX if available? I know FPC has some 2, 3 and 4-component vector and matrix functions, but they're very generalised and not particularly fast when dealing with large datasets.

Yes it's welcome your, code could be help a lot. (and not only me, i'm sure) Perhaps if you are agree i'll can try to implement yours functions in GLScene and my own project (a new GLScene, with it's own fast bitmap management. And will support opengl core, and vulkan  8) )

Cheers


 

TinyPortal © 2005-2018