Lazarus

Programming => General => Topic started by: totya on June 09, 2019, 06:53:37 pm

Title: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: totya on June 09, 2019, 06:53:37 pm
Hi!

I know, asm isn't very popular, but I'd like to access to stream data (TMemoryStream) via (inline) assembler (x86). Because I want execute a simple (byte!) operation on the whole stream (stream.size), and I think this is much faster with asm.

So I'd like a similar asm example for that (only the for cycle):

Code: Pascal  [Select]
  1. procedure TForm1.Operation(const StreamIn, StreamOut: TMemoryStream);
  2. var
  3.   i: integer;
  4.   B: byte;
  5. begin
  6.   StreamIn.Position:=0;
  7.   StreamOut.Clear;
  8.  
  9.   for i := 0 to StreamIn.Size - 1 do
  10.   begin
  11.     B := StreamIn.ReadByte;
  12.     B := B + 1; // operation example...
  13.     StreamOut.WriteByte(B);
  14.   end;
  15.  
  16.   StreamOut.Position := 0;
  17. end;      

I guess the StreamIn.Memory pointed to the stream memory...

Thanks...
Title: Re: Stream read/write via inline assembler
Post by: LemonParty on June 09, 2019, 07:18:49 pm
Use method Read (https://www.freepascal.org/docs-html/rtl/classes/tstream.read.html).
Then write a separate function that proceed the array of readed bytes (may be written in pure assembler).
Then put back proceeded bytes using TStream.Write.

This algorithm have to really increase the performance.

(inlining not work with functions that contains loops)
Title: Re: Stream read/write via inline assembler
Post by: rvk on June 09, 2019, 07:26:10 pm
As you mentioned StreamIn.Memory contains the array of bytes so just loop that in assembler and change the byte. After that the InStream will have your changed stream. No need for read and write.

But I wonder how much performance gain you will get.

How large is the stream you are talking about?
Title: Re: Stream read/write via inline assembler
Post by: totya on June 09, 2019, 07:30:40 pm
Hi rvk master! :)

Well, the StreamIn and StreamOut size is different, so I need read from StreamIn and write to StreamOut (Streamout has header).

Quote
so just loop that in assembler and change the byte

I understand, but please show me an example :) I used assembler long time ago... and the google not my friend in this case...

Edit for new question.:
How large is the stream you are talking about?

The sources are files, and total file size about 20MB (at the moment). The speed is not very bad with pascal, but I want to see it with asm... :)
Title: Re: Stream read/write via inline assembler
Post by: jamie on June 09, 2019, 08:01:47 pm
there is a property "Memory" which returns the pointer of the memory block

So if you know assembler you can do this..

I suppose I can code up an example but why? Unless you are doing a lot of short setups
I can't see a reason for it but who am I. Maybe I'll feel generous and code up an example.


Title: Re: Stream read/write via inline assembler
Post by: rvk on June 09, 2019, 08:03:10 pm
I don't think it will be much faster in assembler but you can try.
For me it has been even longer since I worked with assembler (around 1988).

But for move() in fpc, it is already in assembler. So you can create a stream (set size) and just do a move. Or don't work with streams at all and just work with arrays.

Code: Pascal  [Select]
  1. procedure Move(const source;var dest;count:SizeInt);[public, alias: 'FPC_MOVE'];assembler;nostackframe;
  2. asm
  3.   cmp     ecx,SMALLMOVESIZE
  4.   ja      @Large
  5.   cmp     eax,edx
  6.   lea     eax,[eax+ecx]
  7.   jle     @SmallCheck
  8. @SmallForward:
  9.   add     edx,ecx
  10.   jmp     SmallForwardMove_3
  11. @SmallCheck:
  12.   je      @Done {For Compatibility with Delphi's move for Source = Dest}
  13.   sub     eax,ecx
  14.   jmp     SmallBackwardMove_3
  15. @Large:
  16.   jng     @Done {For Compatibility with Delphi's move for Count < 0}
  17.   cmp     eax,edx
  18.   jg      @moveforward
  19.   je      @Done {For Compatibility with Delphi's move for Source = Dest}
  20.   push    eax
  21.   add     eax,ecx
  22.   cmp     eax,edx
  23.   pop     eax
  24.   jg      @movebackward
  25. @moveforward:
  26.   jmp     dword ptr fastmoveproc_forward
  27. @movebackward:
  28.   jmp     dword ptr fastmoveproc_backward {Source/Dest Overlap}
  29. @Done:
  30. end;
  31.  
  32. {Move ECX Bytes from EAX to EDX, where EAX > EDX and ECX > 36 (SMALLMOVESIZE)}
  33. procedure Forwards_SSE_3;assembler;nostackframe;
  34. const
  35.   LARGESIZE = 2048;
  36. asm
  37.   cmp     ecx,LARGESIZE
  38.   jge     @FwdLargeMove
  39.   cmp     ecx,SMALLMOVESIZE+32
  40.   movups  xmm0,[eax]
  41.   jg      @FwdMoveSSE
  42.   movups  xmm1,[eax+16]
  43.   movups  [edx],xmm0
  44.   movups  [edx+16],xmm1
  45.   add     eax,ecx
  46.   add     edx,ecx
  47.   sub     ecx,32
  48.   jmp     SmallForwardMove_3
  49. @FwdMoveSSE:
  50.   push    ebx
  51.   mov     ebx,edx
  52.   {Align Writes}
  53.   add     eax,ecx
  54.   add     ecx,edx
  55.   add     edx,15
  56.   and     edx,-16
  57.   sub     ecx,edx
  58.   add     edx,ecx
  59.   {Now Aligned}
  60.   sub     ecx,32
  61.   neg     ecx
  62. @FwdLoopSSE:
  63.   movups  xmm1,[eax+ecx-32]
  64.   movups  xmm2,[eax+ecx-16]
  65.   movaps  [edx+ecx-32],xmm1
  66.   movaps  [edx+ecx-16],xmm2
  67.   add     ecx,32
  68.   jle     @FwdLoopSSE
  69.   movups  [ebx],xmm0 {First 16 Bytes}
  70.   neg     ecx
  71.   add     ecx,32
  72.   pop     ebx
  73.   jmp     SmallForwardMove_3
  74. @FwdLargeMove:
  75.   push    ebx
  76.   mov     ebx,ecx
  77.   test    edx,15
  78.   jz      @FwdLargeAligned
  79.   {16 byte Align Destination}
  80.   mov     ecx,edx
  81.   add     ecx,15
  82.   and     ecx,-16
  83.   sub     ecx,edx
  84.   add     eax,ecx
  85.   add     edx,ecx
  86.   sub     ebx,ecx
  87.   {Destination now 16 Byte Aligned}
  88.   call    SmallForwardMove_3
  89.   mov     ecx,ebx
  90. @FwdLargeAligned:
  91.   and     ecx,-16
  92.   sub     ebx,ecx {EBX = Remainder}
  93.   push    edx
  94.   push    eax
  95.   push    ecx
  96.   call    AlignedFwdMoveSSE_3
  97.   pop     ecx
  98.   pop     eax
  99.   pop     edx
  100.   add     ecx,ebx
  101.   add     eax,ecx
  102.   add     edx,ecx
  103.   mov     ecx,ebx
  104.   pop     ebx
  105.   jmp     SmallForwardMove_3
  106. end; {Forwards_SSE}
This can be shortened but you will only gain a few cycles because this procedure first determines the best way to do the move and then jumps to the appropriate function.

But I wouldn't focus on the move procedure because it's already in assembler. So create an array, first add the header, then do the move from instream.memory to your array and loop through it to perform your action.

If you look at the assembler "debug view" when you run the following snippet:
Code: Pascal  [Select]
  1. for i := 0 to 20 * 1024 * 1024 do
  2. begin
  3.   a[i] := a[i] + 1;
  4. end;

You see something like this:
Code: Pascal  [Select]
  1. asm
  2.   @back:
  3.   movl   $0xffffffff,-0xdc(%ebp)
  4.   mov    %esi,%esi
  5.   mov    -0xdc(%ebp),%eax
  6.   add    $0x1,%eax
  7.   mov    %eax,-0xdc(%ebp)
  8.   movzbl -0x70(%ebp,%eax,1),%eax
  9.   add    $0x1,%eax
  10.   mov    -0xdc(%ebp),%edx
  11.   mov    %al,-0x70(%ebp,%edx,1)
  12.   cmpl   $0x1400000,-0xdc(%ebp)
  13.   jl     back
  14. end;
What would you like to change for performance wise?

It would all depend on your "B := B + 1;" calculation because I don't think you can make the loop any more efficient. (But don't use the stream.read and stream.write because they do add some overhead)

Note: Back in the time I did assembler we only had ax, ah and al and such (8086 processor  8))
Title: Re: Stream read/write via inline assembler
Post by: totya on June 09, 2019, 08:11:23 pm
there is a property "Memory" which returns the pointer of the memory block So if you know assembler you can do this..

If I see a code which read data from the stream/buffer to a register one by one and byte-steps, and write this to the other stream/buffer, I think it's a good start for the beginng.
Title: Re: Stream read/write via inline assembler
Post by: rvk on June 09, 2019, 08:17:07 pm
The last snippet in my previous post shows the for loop to manipulate an array.

So first create the header in an array, then move the instream.memory after it and do the for loop in assembler.

But even if you don't do the for loop in assembler... You can just only do your b := b +1 in assembler (assuming.it does something different than just adding 1).
Title: Re: Stream read/write via inline assembler
Post by: ASerge on June 09, 2019, 08:22:29 pm
If I see a code which read data from the stream/buffer to a register one by one and byte-steps, and write this to the other stream/buffer, I think it's a good start for the beginng.
Code: Pascal  [Select]
  1. {$ASMMODE INTEL}
  2. procedure Operation(const StreamIn, StreamOut: TMemoryStream);
  3. var
  4.   LSize: SizeInt;
  5. begin
  6.   LSize := StreamIn.Size;
  7.   StreamOut.Size := LSize;
  8.   //repeat
  9.   //  Dec(LSize);
  10.   //  if LSize < 0 then
  11.   //    Break;
  12.   //  PByte(StreamOut.Memory)[LSize] := PByte(StreamIn.Memory)[LSize] + 1;
  13.   //until False;
  14.   asm
  15.     mov  rsi, StreamIn
  16.     mov  rdi, StreamOut
  17.     mov  rcx, LSize
  18.     @@StartLoop:
  19.     dec  rcx
  20.     jl   @@EndLoop
  21.     mov  al, [rsi+rcx].TMemoryStream.Memory
  22.     inc  al
  23.     mov  [rdi+rcx].TMemoryStream.Memory, al
  24.     jmp  @@StartLoop
  25.     @@EndLoop:
  26.   end ['rsi', 'rdi', 'rcx', 'rax'];
  27.   StreamOut.Position := 0;
  28. end;
When using "asm" inserts, FPC stops optimizing the surrounding code, so without asm it will be faster.
Title: Re: Stream read/write via inline assembler
Post by: LemonParty on June 09, 2019, 08:56:29 pm
Steroids (https://www.felixcloutier.com/x86/paddb:paddw:paddd:paddq) gotta make the cycle running with the speed of light (at least 4 times faster than classic instructions).
But do you need the speed of light?
Title: Re: Stream read/write via inline assembler
Post by: totya on June 09, 2019, 09:10:53 pm
Code: Pascal  [Select]
  1. {$ASMMODE INTEL}...

Thank you for this very readable code! It's enough for me the start... Seems to me these x64 registers, but seems to me its not a big problem (rsi->esi).

Thank you too: LemonParty, rvk master, jamie for answers, and informations.
Title: Re: Stream read/write via inline assembler
Post by: totya on June 09, 2019, 10:10:26 pm

Hi!

My operation is more complicated than inc(), but unfortunatelly I got sigsev (StreamOut.Position := 0;) with this untouched simple test code:

Code: Pascal  [Select]
  1. unit Unit1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.   Classes, SysUtils, Forms, Controls, Graphics, Dialogs, StdCtrls;
  9.  
  10. type
  11.  
  12.   { TForm1 }
  13.  
  14.   TForm1 = class(TForm)
  15.     Button1: TButton;
  16.     Memo1: TMemo;
  17.     procedure Button1Click(Sender: TObject);
  18.   private
  19.     procedure Operation(const StreamIn, StreamOut: TMemoryStream);
  20.  
  21.   end;
  22.  
  23. var
  24.   Form1: TForm1;
  25.  
  26. implementation
  27.  
  28. {$R *.lfm}
  29.  
  30. { TForm1 }
  31.  
  32. {$ASMMODE INTEL}
  33. procedure TForm1.Operation(const StreamIn, StreamOut: TMemoryStream);
  34. var
  35.   LSize: SizeInt;
  36. begin
  37.   LSize := StreamIn.Size;
  38.   StreamOut.Size := LSize;
  39.  
  40.   //repeat
  41.   //  Dec(LSize);
  42.   //  if LSize < 0 then
  43.   //    Break;
  44.   //  PByte(StreamOut.Memory)[LSize] := PByte(StreamIn.Memory)[LSize] + 1;
  45.   //until False;
  46.   asm
  47.     mov  rsi, StreamIn
  48.     mov  rdi, StreamOut
  49.     mov  rcx, LSize
  50.     @@StartLoop:
  51.     dec  rcx
  52.     jl   @@EndLoop
  53.     mov  al, [rsi+rcx].TMemoryStream.Memory
  54.     inc  al
  55.     mov  [rdi+rcx].TMemoryStream.Memory, al
  56.     jmp  @@StartLoop
  57.     @@EndLoop:
  58.   end ['rsi', 'rdi', 'rcx', 'rax'];
  59.  
  60.   StreamOut.Position := 0;
  61. end;
  62.  
  63. procedure TForm1.Button1Click(Sender: TObject);
  64. var
  65.   StreamIn, StreamOut: TMemoryStream;
  66. begin
  67.   StreamIn := TMemoryStream.Create;
  68.   StreamOut := TMemoryStream.Create;
  69.   try
  70.     StreamIn.WriteByte(100);
  71.     StreamIn.WriteByte(100);
  72.  
  73.     Operation(StreamIn, StreamOut);
  74.  
  75.     Memo1.Lines.Add(IntToStr(StreamOut.ReadByte));
  76.     Memo1.Lines.Add(IntToStr(StreamOut.ReadByte));
  77.   finally
  78.     StreamIn.Free;
  79.     StreamOut.Free;
  80.   end;
  81. end;
  82.  
  83. end.

If I comment
//inc  al
then this code run without error, but the result is garbage... (120, 204).
Title: Re: Stream read/write via inline assembler
Post by: jamie on June 09, 2019, 10:36:43 pm
Or, you can use the MOVE

Move(SourceStream.Memory^, DestinationStream.Pointer^,Memory);

Reset your Seek back to zero or what ever.

The MOVE is system level and should be closer optimized over using the Methods of the
streams.
Title: Re: Stream read/write via inline assembler
Post by: ASerge on June 09, 2019, 11:57:03 pm
My operation is more complicated than inc(), but unfortunatelly I got sigsev (StreamOut.Position := 0;) with this untouched simple test code:
That's additional danger with assembler - easy to make a mistake. I forgot to dereference Memory and that the property should be accessed directly through the field, otherwise the offset 0 is used.
Code: Pascal  [Select]
  1. procedure Operation(const StreamIn, StreamOut: TMemoryStream);
  2. var
  3.   LSize: SizeInt;
  4. begin
  5.   LSize := StreamIn.Size;
  6.   StreamOut.Size := LSize;
  7.   //repeat
  8.   //  Dec(LSize);
  9.   //  if LSize < 0 then
  10.   //    Break;
  11.   //  PByte(StreamOut.Memory)[LSize] := PByte(StreamIn.Memory)[LSize] + 1;
  12.   //until False;
  13.   asm
  14.     mov  rsi, StreamIn
  15.     mov  rdi, StreamOut
  16.     mov  rcx, LSize
  17.     @@StartLoop:
  18.     dec  rcx
  19.     jl   @@EndLoop
  20.     mov  rdx, [rsi].TMemoryStream.FMemory
  21.     mov  al, [rdx+rcx]
  22.     inc  al
  23.     mov  rdx, [rdi].TMemoryStream.FMemory
  24.     mov  BYTE PTR [rdx+rcx], al
  25.     jmp  @@StartLoop
  26.     @@EndLoop:
  27.   end ['rsi', 'rdi', 'rcx', 'rax', 'rdx'];
  28.   StreamOut.Position := 0;
  29. end;
Title: Re: Stream read/write via inline assembler
Post by: totya on June 10, 2019, 09:50:41 am
...

Big thanks to you, this sample code result is okay now... But now I will have one less register what can I use for operations ;)
Title: Re: Stream read/write via inline assembler
Post by: totya on June 10, 2019, 01:07:05 pm
...

Thank you again!

With your sample, I was able to do what I wanted. As I remember, the assembler much faster than any high-level language.

Algorithm sleftest (hash compared!) with 300MB data:

pascal implementation: 35,32 MB/s
assembler implementation: 333,88 MB/s

It's about 10 times faster...

But I have a question, to tell the truth, I don't know exactly why don't work your first sample (you wrote you forget dereference pointers), so in my algorithm I use constant array values. My question is, I can access these array constant safely (ADD_ARRAY_SAMPLE) with my code?

Sample (isn't the real algorithm), x86 version:

Code: Pascal  [Select]
  1. procedure TForm1.Operation2(const Stream: TMemoryStream);
  2. const
  3.   ADD_ARRAY_SAMPLE: array[0..2] of byte = ($01, $02, $03);
  4. var
  5.   LSize, Counter: SizeInt;
  6.   ADD_ARRAY_SAMPLE_PTR: pointer;
  7. begin
  8.   LSize := Stream.Size;
  9.   ADD_ARRAY_SAMPLE_PTR:= @ADD_ARRAY_SAMPLE;
  10.  
  11.   Counter:=0;
  12.  
  13.   {$ASMMODE INTEL}
  14.   asm
  15.      mov  esi, Stream
  16.      mov  ecx, Counter // ADD_ARRAY_SAMPLE index
  17.      mov  ebx, 0 //Stream index
  18.  
  19.      @@StartLoop:
  20.      mov  edx, [esi].TMemoryStream.FMemory
  21.      mov  al, [edx+ebx]
  22.  
  23.      mov  edx, ADD_ARRAY_SAMPLE_PTR
  24.      mov  ah, [edx+ecx]
  25.  
  26.      add al, ah
  27.  
  28.      mov  edx, [esi].TMemoryStream.FMemory
  29.      mov  BYTE PTR [edx+ebx], al
  30.  
  31.      cmp  ecx, 2
  32.      je   @@ARRAY_MAX
  33.      inc  ecx
  34.      jmp  @@Next
  35.  
  36.      @@ARRAY_MAX:
  37.      mov ecx, 0
  38.  
  39.      @@Next:
  40.      inc  ebx
  41.      cmp ebx, LSize
  42.      je @@EndLoop
  43.  
  44.      jmp  @@StartLoop
  45.      @@EndLoop:
  46.    end ['esi', 'eax', 'ebx', 'ecx', 'edx'];
  47.  
  48.   Stream.Position := 0;
  49. end;

... and I have a simple question ctrl-d (jedi code format) why doesn't work if asm code available in the source?

Thank you!
Title: Re: Stream read/write via inline assembler
Post by: rvk on June 10, 2019, 01:17:08 pm
pascal implementation: 35,32 MB/s
assembler implementation: 333,88 MB/s
It's about 10 times faster...
Can we see your pascal implementation?
Did it also work with tstream.memory as array or did you use tstream.read and write?

Did you try the repeat/until ASerge showed as commented code?
Because I don't think you should get that much of a difference.

Because I suspect using repeat/until with just your algoritme in assembler will be just as fast or maybe slightly slower. But not a factor 10.

Title: Re: Stream read/write via inline assembler
Post by: totya on June 10, 2019, 02:29:02 pm
Can we see your pascal implementation?

rvk master, I knew it, I knew it :)

So, the algorithm autohor sent me the algorithm to use (he app doesn't work correctly), but I haven't permit to show the original code, but something about this:

Code: Pascal  [Select]
  1.  
  2. Size:=MemStreamIn.Size;
  3. MemStreamOut.Size:=Size+HeaderSize;
  4. Counter:=0;
  5.  
  6. for I := 0 to Size - 1 do
  7.   begin
  8.     B := MemStreamIn.ReadByte
  9.  
  10.     { simple bitwise operations with an const array... array element choice: see: Counter }
  11.  
  12.     MemStreamOut.WriteByte(B);
  13.  
  14.     if Counter < X then
  15.       Inc(Counter)
  16.     else
  17.       Counter := 0;
  18.   end;

Edit, because I forget to answer your more questions:
Did it also work with tstream.memory as array or did you use tstream.read and write?

As you see I use simple Stream ReadByte/WriteByte.

Did you try the repeat/until ASerge showed as commented code?

I tried it, after I see the first assembler code is unusable, but i don't saw/tried it deeper.
Title: Re: Stream read/write via inline assembler
Post by: rvk on June 10, 2019, 02:42:04 pm
... but something about this:
There is the problem.
Your pascal code is still using MemStreamIn.ReadByte and MemStreamOut.WriteByte which slows down your code considerably. No wonder it is 10 times slower than the assembler implementation.

You need to do something like this:

Code: Pascal  [Select]
  1. Size := MemStreamIn.Size;
  2. MemStreamOut.Size := Size + HeaderSize;
  3. pIn := 0;
  4. pOut := HeaderSize;
  5. pTo := MemStreamOut.Size;
  6. repeat
  7.   B :=  PByte(MemStreamIn.Memory)[pIn];
  8.  
  9.   asm
  10.     // put here your assembler code of just the algoritme.
  11.   end;
  12.  
  13.   PByte(MemStreamOut.Memory)[pOut] := B;
  14.   Inc(pIn);
  15.   Inc(pOut);
  16. until pOut > pTo;
(just typed out of my head so the > and begin and end values might be slightly off)

But you will see with this implementation, the pascal code will still be very fast because you don't use the .readbyte and writebyte functions.

(And even this probably can be more optimized)
Title: Re: Stream read/write via inline assembler
Post by: totya on June 10, 2019, 03:01:40 pm
(And even this probably can be more optimized)

Hi, I tried your code, and I got 192,74 MB/s. Inside the pascal code. Its not bad, much faster than original. But slower than 333,88 MB/s :) "more optimized"
Title: Re: Stream read/write via inline assembler
Post by: rvk on June 10, 2019, 03:04:33 pm
Ok, I thought it would be almost as fast because there are no extra calls.

(I take it you do your testing outside the ide without debugger)
Title: Re: Stream read/write via inline assembler
Post by: totya on June 10, 2019, 03:06:56 pm
(I take it you do your testing outside the ide without debugger)

No speed difference with or withut debugger (MB/s):

193,74
194,12
Title: Re: Stream read/write via inline assembler
Post by: totya on June 10, 2019, 03:15:08 pm
But what a surprise, if I compiled same code (rvk, inside the pascal operation) to x64 I got 389,05 MB/s
Title: Re: Stream read/write via inline assembler
Post by: totya on June 12, 2019, 10:38:10 pm
Code: Pascal  [Select]
  1. Size := MemStreamIn.Size;
  2. MemStreamOut.Size := Size + HeaderSize;
  3. pIn := 0;
  4. pOut := HeaderSize;
  5. pTo := MemStreamOut.Size -1; // orig: pTo := MemStreamOut.Size;
  6. repeat
  7.   B :=  PByte(MemStreamIn.Memory)[pIn];
  8.  
  9.   asm
  10.     // put here your assembler code of just the algoritme.
  11.   end;
  12.  
  13.   PByte(MemStreamOut.Memory)[pOut] := B;
  14.   Inc(pIn);
  15.   Inc(pOut);
  16. until pOut > pTo;

Small bug corrected ;) Se: //
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: engkin on June 12, 2019, 11:59:09 pm
If you are about speed, try using SIMD instructions. Or unroll the loop.
Title: Re: Stream read/write via inline assembler
Post by: rvk on June 13, 2019, 12:49:43 am
Small bug corrected ;) Se: //
Not a 'bug'... That's why I added the extra note  :P

(just typed out of my head so the > and begin and end values might be slightly off)
Title: Re: Stream read/write via inline assembler
Post by: totya on June 13, 2019, 05:11:05 pm
and begin and end values might be slightly off)

Ha master! :)

With my weak english I didn't understand this sentence, but now I can imagine what this it mean :)
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: totya on June 13, 2019, 05:14:52 pm
If you are about speed, try using SIMD instructions. Or unroll the loop.

Hi, thanks, the speed is okay for me now, but if you show me workable sample (like as ASerge asm code) I can to try it.
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: engkin on June 13, 2019, 06:52:59 pm
With big size like:
The sources are files, and total file size about 20MB (at the moment).

and the type of instructions you want:
Code: Pascal  [Select]
  1. ...
  2.     { simple bitwise operations with an const array... array element choice: see: Counter }

It makes it sound like a perfect candidate for using SIMD instructions.

if you show me workable sample (like as ASerge asm code) I can to try it.

It is exactly the same code, but instead of using XOR you use its SIMD counterpart PXOR. You don't deal with normal CPU registers like EAX,EDX..etc. You have a different set of registers like XMM1.. the size of these registers is bigger. EAX is 4 bytes while XMM1 is 16 bytes. 64bit CPUs have even bigger SIMD registers.

Here is an example (https://forum.lazarus.freepascal.org/index.php/topic,33761.msg219682.html#msg219682).
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: totya on June 13, 2019, 11:27:24 pm
Here is an example (https://forum.lazarus.freepascal.org/index.php/topic,33761.msg219682.html#msg219682).

Thank you! I will see it on weekend.. if I my old intel core2duo support it...
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: totya on June 15, 2019, 04:40:05 pm
Here is an example (https://forum.lazarus.freepascal.org/index.php/topic,33761.msg219682.html#msg219682).

Hi!

Thanks for this sample, but I got warnings at the begining (compiled to 64 bit).

Quote
project1.lpr(21,1) Warning: Object file "unit1.o" contains 32-bit absolute relocation to symbol ".data.n_tc_$unit1$_$tform1_$_convert$tmemorystream$tmemorystream_$$_onemask".

Code: Pascal  [Select]
  1. {$asmmode intel}
  2. procedure TForm1.Convert(const StreamIn, StreamOut: TMemoryStream);
  3. const
  4.  ONEMASK: array[0..15] of byte=($01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01);
  5. begin
  6.   asm
  7.     MOVDQA xmm0, ONEMASK
  8.   end;
  9. end;
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: engkin on June 15, 2019, 05:24:14 pm
How about this (https://forum.lazarus.freepascal.org/index.php/topic,39098.msg267861.html#msg267861).
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: totya on June 15, 2019, 11:18:23 pm
How about this (https://forum.lazarus.freepascal.org/index.php/topic,39098.msg267861.html#msg267861).

About this don't help for me. But not a big problem, because with ASerge very usable code, I can use the TMemoryStream as const array. And it's better, because Stream is more flexible than const array... (etc: variable parameter).

The asm code under development (partially working yet), because if the "key array" is shorter than 16, and not divisible, then as I see, I need create an array table... but it will only tomorrow...

for example if 16 mod KeyArray.Size  > 0, it's a problem... :)

sample keyarray:
$EA  $FA $AA $22 $11

then I need a hash table for 16 byte operation, similar of this:
$EA  $FA $AA $22 $11 $EA  $FA $AA $22 $11 $EA  $FA $AA $22 $11 $EA
$FA $AA $22 $11 $EA  $FA $AA $22 $11 $EA  $FA $AA $22 $11 $EA  $FA
$AA $22 $11 $EA  $FA $AA $22 $11 $EA  $FA $AA $22 $11 $EA  $FA $AA
$22 .. and so on. :)
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: engkin on June 15, 2019, 11:24:21 pm
didn't you mention 20MB?  But yes, it is asm after all.
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: totya on June 15, 2019, 11:50:10 pm
With rvk "pascal" code the speed is okay, similar to asm, few houndred MB/s. It's more than enough, but I'd like to see the speed with SIMD (MMX) registers :)

Otherwise I don't see the puhs/pop (I know its need alternative way) for the mmx registers in your code... I can do it this way:

Code: Pascal  [Select]
  1.            @@EndLoop:
  2.            MOV     RDI, RemainStartIndex
  3.            MOV     [RDI], RAX
  4.  
  5.   end ['rsi', 'rdi', 'rax', 'rbx', 'r10', 'r11', 'xmm1'];

Unfortunatelly this last line kill the Jedi Code Format function (but finally I know why don't work JEDI if asm code available), I need a bug report...
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: marcov on June 16, 2019, 03:25:21 am
Thanks for this sample, but I got warnings at the begining (compiled to 64 bit).

In 64-bit, you need to work via RIP

    MOVDQA xmm0, [rip+ONEMASK]

movdqu is better though. Older processors might not like it if onemask is unaligned.
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: totya on June 16, 2019, 08:45:12 am
Thanks for this sample, but I got warnings at the begining (compiled to 64 bit).

In 64-bit, you need to work via RIP

    MOVDQA xmm0, [rip+ONEMASK]

movdqu is better though. Older processors might not like it if onemask is unaligned.

OMG, I rewrote my code on yesterday, because as I saw this operation can't handle the offset parameter (if not divisible by 16)... and as you wrote, unaligned version available... thanks for this valuable information!  :)
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: marcov on June 16, 2019, 09:37:20 am
Note that unaligned access in a loop doubles the needed bandwidth. (unaligned typically reads two aligned 16-byte areas, and gives you the needed part from it)

So it is better to

1. check if your count is large enough (don't bother <256 or so)
2. process  a few bytes on normal cPU to align
3. process with SSE till the remainer <16
4. process  rest bytes on normal cPU again
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: totya on June 16, 2019, 05:19:45 pm
Note that unaligned access in a loop doubles the needed bandwidth. (unaligned typically reads two aligned 16-byte areas, and gives you the needed part from it)

Your original idea is fantastic, I use that. The asm/pas code is more simple that way, thanks again! Typical filesize is about few MB, so double memory req doesn't matter.  The asm code about finished and result is okay with the own test app, as soon I try it with the reall app. 
Title: Re: Stream read/write via inline assembler [SOLVED by ASerge]
Post by: totya on June 16, 2019, 09:49:26 pm
After the storm is about go away, I got my computer, and I can try the new code...

With inital "unoptimized" pascal code, in the past

I got 35 MB/s

So, with rvk master code (https://forum.lazarus.freepascal.org/index.php/topic,45672.msg323295.html#msg323295)

I got 180 MB/s (compiled to x86) (300MB test size)
I got 370 MB/s (compiled to x64) (300MB test size)

Nice speed increase from x64...

So, now these are very nice speeds...

Now I got idea from engkin (https://forum.lazarus.freepascal.org/index.php/topic,45672.msg323582.html#msg323582)

Shortly I can use 128 bit register. After many sucking, I created a workable asm code. With this

I got 860 MB/s (compiled to x64) (300MB test size) (but compile destination doesn't matter really)

I excepted higher speed than this, but the truth is about as rvk master said, the compiler works very well... (from ugly, but fast code).

Thanks to everyone!