I don't think it will be much faster in assembler but you can try.
For me it has been even longer since I worked with assembler (around 1988).
But for move() in fpc, it is already in assembler. So you can create a stream (set size) and just do a move. Or don't work with streams at all and just work with arrays.
procedure Move(const source;var dest;count:SizeInt);[public, alias: 'FPC_MOVE'];assembler;nostackframe;
asm
cmp ecx,SMALLMOVESIZE
ja @Large
cmp eax,edx
lea eax,[eax+ecx]
jle @SmallCheck
@SmallForward:
add edx,ecx
jmp SmallForwardMove_3
@SmallCheck:
je @Done {For Compatibility with Delphi's move for Source = Dest}
sub eax,ecx
jmp SmallBackwardMove_3
@Large:
jng @Done {For Compatibility with Delphi's move for Count < 0}
cmp eax,edx
jg @moveforward
je @Done {For Compatibility with Delphi's move for Source = Dest}
push eax
add eax,ecx
cmp eax,edx
pop eax
jg @movebackward
@moveforward:
jmp dword ptr fastmoveproc_forward
@movebackward:
jmp dword ptr fastmoveproc_backward {Source/Dest Overlap}
@Done:
end;
{Move ECX Bytes from EAX to EDX, where EAX > EDX and ECX > 36 (SMALLMOVESIZE)}
procedure Forwards_SSE_3;assembler;nostackframe;
const
LARGESIZE = 2048;
asm
cmp ecx,LARGESIZE
jge @FwdLargeMove
cmp ecx,SMALLMOVESIZE+32
movups xmm0,[eax]
jg @FwdMoveSSE
movups xmm1,[eax+16]
movups [edx],xmm0
movups [edx+16],xmm1
add eax,ecx
add edx,ecx
sub ecx,32
jmp SmallForwardMove_3
@FwdMoveSSE:
push ebx
mov ebx,edx
{Align Writes}
add eax,ecx
add ecx,edx
add edx,15
and edx,-16
sub ecx,edx
add edx,ecx
{Now Aligned}
sub ecx,32
neg ecx
@FwdLoopSSE:
movups xmm1,[eax+ecx-32]
movups xmm2,[eax+ecx-16]
movaps [edx+ecx-32],xmm1
movaps [edx+ecx-16],xmm2
add ecx,32
jle @FwdLoopSSE
movups [ebx],xmm0 {First 16 Bytes}
neg ecx
add ecx,32
pop ebx
jmp SmallForwardMove_3
@FwdLargeMove:
push ebx
mov ebx,ecx
test edx,15
jz @FwdLargeAligned
{16 byte Align Destination}
mov ecx,edx
add ecx,15
and ecx,-16
sub ecx,edx
add eax,ecx
add edx,ecx
sub ebx,ecx
{Destination now 16 Byte Aligned}
call SmallForwardMove_3
mov ecx,ebx
@FwdLargeAligned:
and ecx,-16
sub ebx,ecx {EBX = Remainder}
push edx
push eax
push ecx
call AlignedFwdMoveSSE_3
pop ecx
pop eax
pop edx
add ecx,ebx
add eax,ecx
add edx,ecx
mov ecx,ebx
pop ebx
jmp SmallForwardMove_3
end; {Forwards_SSE}
This can be shortened but you will only gain a few cycles because this procedure first determines the best way to do the move and then jumps to the appropriate function.
But I wouldn't focus on the move procedure because it's already in assembler. So create an array, first add the header, then do the move from instream.memory to your array and loop through it to perform your action.
If you look at the assembler "debug view" when you run the following snippet:
for i := 0 to 20 * 1024 * 1024 do
begin
a[i] := a[i] + 1;
end;
You see something like this:
asm
@back:
movl $0xffffffff,-0xdc(%ebp)
mov %esi,%esi
mov -0xdc(%ebp),%eax
add $0x1,%eax
mov %eax,-0xdc(%ebp)
movzbl -0x70(%ebp,%eax,1),%eax
add $0x1,%eax
mov -0xdc(%ebp),%edx
mov %al,-0x70(%ebp,%edx,1)
cmpl $0x1400000,-0xdc(%ebp)
jl back
end;
What would you like to change for performance wise?
It would all depend on your "B := B + 1;" calculation because I don't think you can make the loop any more efficient. (But don't use the stream.read and stream.write because they do add some overhead)
Note: Back in the time I did assembler we only had ax, ah and al and such (8086 processor
)