Recent

Author Topic: AVX and SSE support question  (Read 89738 times)

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #150 on: December 09, 2017, 03:17:04 pm »
Anyway my priority is to finsh the test, I see you have added one of the features I have planned, ( Gain in %  ) the other I want to add is report the accuracy to how many dp. Probably more important with larger routines than we are doing now.

Yes it's clear. So i haven't make test with distance and lenght you have right

this

Code: Pascal  [Select][+][-]
  1. function TGLZVector4f.Distance(constref A: TGLZVector4f):Single;assembler; nostackframe; register;
  2. Asm
  3.   vmovaps xmm0,[RCX]
  4.   vsubps xmm0, xmm0, [A]   //xmm1
  5.   vmulps xmm0, xmm0, xmm0
  6.   vmovss xmm1, [RCX]8
  7.   vmovss xmm2, [A]8
  8.   vsubps xmm1, xmm1, xmm2
  9.   vmulps xmm1, xmm1, xmm1
  10.   vaddps xmm0, xmm0, xmm1
  11.   vhaddps xmm0, xmm0, xmm0
  12.   vsqrtss xmm0, xmm0, xmm0
  13. end;

Vector Op Distance, 0.201000, 0.073001, 63.681 %

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #151 on: December 09, 2017, 04:38:39 pm »
That might not pass the test for win64 I would try

Code: Pascal  [Select][+][-]
  1.   function TGLZVector4f.Distance(constref A: TGLZVector4f):Single;assembler; nostackframe; register;
  2.   Asm
  3.   {$ifdef TEST}
  4.     vmovq xmm0, [rcx]         // move 64 bits and clear top  x,y,0,0
  5.     vmovq xmm1, [A]           // move 64 bits and clear top  x1,y1,0,0
  6.     vsubps xmm0, xmm0, xmm1   // x-x1,y-y1,0,0
  7.     vmulps xmm0, xmm0, xmm0   // (x-x1)^2,(y-y1)^2,0,0
  8.     vmovss xmm1, [rcx]8      // z,0,0,0
  9.     vmovss xmm2, [A]8        //z1,0,0,0
  10.     vsubps xmm1, xmm1, xmm2   //z-z1,0,0,0
  11.     vmulps xmm1, xmm1, xmm1   //(z-z1)^2,0,0,0
  12.     vaddps xmm0, xmm0, xmm1   //(x-x1)^2+(z-z1)^2, (y-y1)^2, 0, 0
  13.     vhaddps xmm0, xmm0, xmm0  //(x-x1)^2+(z-z1)^2 + (y-y1)^2, 0, 0
  14.     vsqrtss xmm0, xmm0, xmm0            
  15.  

vmovq should be quicker as only moving 64 bits. It is one trick to not having to load a no w mask which is what I was trying to do in the first place. Does not matter so much for tthis routine but where we need to return a 0 in the W might be a quicker option than loading a mask. And maybe has to do with swapping pipeline between integer and float, I think I read there are some penalties there.
« Last Edit: December 09, 2017, 04:57:26 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

CuriousKit

  • Jr. Member
  • **
  • Posts: 78
Re: AVX and SSE support question
« Reply #152 on: December 09, 2017, 09:33:07 pm »
I have a question about the comparison operators. Should they return true only if all of the elements are true? (e.g. Input1 = Input2 only if all the elements match, and Input1 < Input2 only if all the elements in Input1 are smaller than Input2) Or are they designed to return true if at least one of the elements are equal, for example.
« Last Edit: December 09, 2017, 09:37:34 pm by CuriousKit »

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #153 on: December 09, 2017, 10:13:20 pm »
I have a question about the comparison operators. Should they return true only if all of the elements are true? (e.g. Input1 = Input2 only if all the elements match, and Input1 < Input2 only if all the elements in Input1 are smaller than Input2) Or are they designed to return true if at least one of the elements are equal, for example.

From what I can see in the Pascal code all must pass the test, so in linux64 and 32 they all do.
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

CuriousKit

  • Jr. Member
  • **
  • Posts: 78
Re: AVX and SSE support question
« Reply #154 on: December 09, 2017, 10:17:44 pm »
I haven't finished my test kit yet and I still need to make the output a bit friendlier, but currently it fails for the = operator when compiled for SSE2 and SSE4.1, hence why I asked (haven't tested AVX yet).  Find attached said test kit.

(I just hope it compiles!)
« Last Edit: December 09, 2017, 10:27:44 pm by CuriousKit »

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #155 on: December 09, 2017, 10:29:12 pm »
Ok here is the latest version. It combines functional and timing tests along with a framework for testing code which you are trying to improve.

Enclose your new test code with

Code: Pascal  [Select][+][-]
  1. {$ifdef TEST}
  2. newcode here
  3. {$else}
  4. leave old code here
  5. {$endif}

There are a set of build modes, each comes with a _TEST variant, which should have the same flags as the buildmode you are testing / developing for, along with a -dTEST flag to trigger the new code.

There are a set of string values in config.inc which will report the flags used if set properly. It should be fairly self explanatory if you take a look in config,inc. This may look like a hassle but remembering where those numbers you have came from is even worse. You should be able to get reproducable numbers and have some confidence in any improvements. If you are working in just one codebase (linux64) you could leave your test code in until it has been tested/transfered to other platforms.

What is there at the moment is just a base line. Create new build modes as required.

Full functional testing ( first three groups in the test harness) takes only a few tens of millisecs ~57 on my machine, Timing tests obviously a bit longer, but they are all selectable so timing the routine you are working on along with all functional tests will be half a second at the most.


Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #156 on: December 10, 2017, 02:37:43 pm »
Ok this is now getting to where it should be usable for development.

I really like advanced records, they have allowed me to eat my own dog food and get to an environment where I can code my itches without polluting Beanz code base, and offer code back that may or may not be used but I can still use it alongside.

Using record helpers has meant I can do all coding for a function in one source file and test that functionality using one test file. This works in the unit test environment. Included in this release is a template.

Intended workflow is :
Copy and rename both the glz_template_code.pas and glz_template_test_cases.pas to whatever floats your boat.
Reflect the filename to the unitname (Ensure same case for us unix folks)
Decide what your function is going to be called and replace all the YOUR_FUNCTION_NAME_HERE placeholders along with the parameters needed.

Write your function in pascal for the TNativeGLZVector4f variant at the bottom of the file.
Write some functional tests if this  is new code and not just some of your old favorite working routine routines.
When happy with the pascal code copy to the TGLZVector4f variant just above.
Write a compare test in the test file. At this point you can run the comparison test using one of the native config build modes.
Write the timing test while in this build mode.
That's it for test coding everything is ready to start work on assembler.
Select another build mode you are wanting to code and the relevant function will become un-greyed.
Hit F9 and do a test build in this mode. there should now be a .s file in the output dir where you can work out what registers  your parameters are. Using this small file is much easier than looking through reams of output code from the main code base. And the sources are organised so the assembler call is right at the beginning of the .s file.

Code and test.

And hey if noone else is interested in you specialist function you still get all the advantages as if it were part of the core code base.

As I said, I have dogfooded this approach myself and included in this dist is the code I am working on which shows what a working env looks like. Toss it out when you have had a look.

Other changes I have moved the results files to a results sub dir.

Best to put this in a new clean dir alongside Beanz code.

Oh by the way did I say I like advanced records :D
Peter
 


Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #157 on: December 10, 2017, 07:36:47 pm »
And the good news is when you convert something more compicated the speedups are dramatic.
Code: Pascal  [Select][+][-]
  1. Compiler Flags: -CfSSE3, -Sv, -O3, -dUSE_ASM, -dSSE_CONFIG_1
  2. Test,         Native,   Assembler,   Speedup
  3. AverageNormal4, 1.554000, 0.076000, 20.447260
  4.  

20x faster  8-)
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #158 on: December 11, 2017, 12:57:25 pm »
I am getting around to looking at pipeline  optimisations through reordering of naive working code.

It certainly makes a difference reordering and in going through this process I wanted a measure of worse or better that was a little more certain than a 20m x run time.

I read the intel doc here https://www.intel.co.uk/content/www/uk/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html. But a bit of overkill for comparitive timing.

I condensed this down the the following simple code, which is not 'perfect' but for the task of checking if changes make things better or worse I think it will do.
Code: Pascal  [Select][+][-]
  1. // hacky we just take the low 32 bits of the cpu counter
  2. // compiler protects regs no need to repilcate here.
  3. // we are only interested in the min value really
  4. // so a loop of 100 will always get us the min
  5. function ASMTick: int64; assembler;
  6. asm
  7. {$ifdef CPU64}
  8.   XOR rax, rax
  9.   CPUID
  10.   RDTSC  //Get the CPU's time stamp counter.
  11.   mov [Result], RAX
  12. {$else}
  13.   XOR eax, eax
  14.   CPUID
  15.   RDTSC  //Get the CPU's time stamp counter.
  16.   mov [Result], eax
  17. {$endif}
  18. end;  
  19.  

Comments please before I release something as quick and dirty as this.
It gives results such as

Proc Tick AverageNormal4:, 203, 316, 205.40

Which is Min, Max, Average. As we are only really interested in seeing code changes which affect the Min the odd bad number from a wrap wround is of no concern.


Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #159 on: December 11, 2017, 02:56:51 pm »
Hi Peter, i'm working on the UnitTest i've made some little improvments. I'll post an update later.
By the way, in waiting for profiling/benchmarking you can take a look to the attached zip. It's a part of my project. It need some little updates, but normally, it's work well as is

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #160 on: December 12, 2017, 09:38:14 am »
Jerome,
Now I am well impressed with SSE et al.
While awaiting your changes, I got the routine I was talking about before, backported into my test code for real time engraving. It spends most of its time calculating normals so the screen representation looks good during this call. Calculating the mesh (250,000 vertices) takes a fraction of the time. So pure pascal does it in ~32sec, replace the normal calc with SSE and it does it in ~21 secs. For one call not a bad improvement in speed.

Time to call grind again and see where the next pinch point is, though from memory it's Point In Volume.
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #161 on: December 17, 2017, 08:21:59 pm »
Hi to all, i was very busy last week, so....
I took some time to code and post the new update of our VectorMath UnitTest.

What's new :
- I reorganized  and make some  little changes (split in two project  32bit/64bit), adding a BaseTimingTest Class, modified Config.inc...
- I added DistanceSquare, LengthSquare and Spacing SSE functions for TGLZVector4f  (native only have spacing)
- I added some {$ifdef TEST} in SSE functions
- I made some little update in my Profiler units (enough at this time, but not totally finished) and added for timing test
- I introduced Matrix and Quaternion with some SSE functions (sorry, only for win64 at this time)
- I introduced Vector Helper (including HmgPlane) and Matrix Helper
- I added Quaternion, Matrix, Test  and Timing  Case
- I added VectorAndHmgPlaneHelper Test Case
- I added a clean and full HTML output,. Just click on the HTML file to see result in your navigator

Now we have around 170 tests !

By reading the code you'll discovered some web links i found during my research. One of the most cool is
https://gcc.godbolt.org/ .This help me a lot for Matrix Invert function.

You'll also find  some little optimization of the SSE code (the most between ($ifdef TEST}) , but not everywhere yet

Bugs :
- Some VectorHelper TestCase are Wrong for SSE : CreatePlane , NormalizePlane and AverageNormal4 (wrong and not finished. I'm little tired. I'll restart later)

Note : Use Profiler in loop  is not advisable. Our tests are not enough complex. The call of RTDSC disturbing and decrease a lot
the performances,  and timing results are not really good. So i let profiler outside the loops

One thing i discovered it's impossible to create more than 1 helper per RECORD. The last declared, is the only  take in charge. sniff :(  :'(

Peter : Your ASMTick function not work on Win10 64bit. Mine in GLZCpuID (GetClockCycleTickCount) is ok

Cheers
« Last Edit: December 17, 2017, 08:29:05 pm by BeanzMaster »

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #162 on: December 17, 2017, 11:41:30 pm »
I come back i do some test and i've notice the sqrtss instruction is very slow it's better to use sqrtps in the function Quaternion.Magnitude my is less speed than native so
replace by this (it's for SSE3)

Code: Pascal  [Select][+][-]
  1. function TGLZQuaternion.Magnitude : Single; assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0, [RCX]
  4.   mulps  xmm0, xmm0
  5.   movshdup    xmm1, xmm0
  6.   addps       xmm0, xmm1
  7.   movhlps     xmm1, xmm0
  8.   addss       xmm0, xmm1
  9.   sqrtps xmm0, xmm0
  10. end;
 

it's the best optimized code for Length/Magnitude



dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #163 on: December 18, 2017, 08:15:32 pm »
Hi Jerome,

Just downloaded this, had a busy weekend with family pre christmas get together so not had much time this weekend.

Works fine in win64 for SSE (I like the html results) but as to getting started on a Linux version getting one of the Native_CONFIG_X working first would have to be a priority so I can then have something to work against.

It is now getting a bit complicated to use the forum as a source sharing device, do you have any sort of github or other source server where collaboration would be a little easier?
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #164 on: December 19, 2017, 02:26:33 pm »
It is now getting a bit complicated to use the forum as a source sharing device, do you have any sort of github or other source server where collaboration would be a little easier?

Hi Peter, yes it would be better. I have and account on Github. After Christmas i'm in hollidays. I'll config the git and send you the url  :)

Merry Christmas !  O:-)

 

TinyPortal © 2005-2018