Lazarus

Free Pascal => General => Topic started by: Seenkao on September 16, 2021, 10:04:39 am

Title: Zeroing the array.
Post by: Seenkao on September 16, 2021, 10:04:39 am
Всем привет!
Rus: Допустим, мы имеем массив из Byte (n: array[1..20] of Byte). Когда я хочу его обнулить
Eng (yandex translate): Let's say we have an array of Byte (n: array[1..20] of Byte). When I want to reset it
Code: Pascal  [Select][+][-]
  1. for i := 1 to 20 do
  2.   n[i] := 0;
Rus: мы получаем (ассемблер x86):
Eng: we get (x86 assembler):
Code: Text  [Select][+][-]
  1. movb $0x0,(%rsp)
  2. movb $0x0,0x1(%rsp)
  3. movb $0x0,0x2(%rsp)
  4. ...
  5. movb $0x0,0x13(%rsp)

Rus: понимаем, что в 32-х/64-х битных системах, это не очень хороший подход.
  Добавляю в код переменную-указатель (Pn: PLongWord), меняю код в паскале
Eng: we understand that in 32-bit / 64-bit systems, this is not a very good approach.
  I add a pointer variable (Pn: PLongWord) to the code, I change the code in pascal
Code: Pascal  [Select][+][-]
  1. Pn := @n;
  2. for i := 0 to 4 do
  3.   Pn[i] := 0;
Rus: конечный код в ассемблере
Eng: the final code in the assembler
Code: Text  [Select][+][-]
  1. mov %rsp,%rax
  2. movl $0x0,(%rax)
  3. movl $0x0,0x4(%rax)
  4. movl $0x0,0x8(%rax)
  5. movl $0x0,0xc(%rax)
  6. movl $0x0,0x10(%rax)

Rus: Есть ли способы более быстрого очищения массива? Есть ли способы, которые автоматизированы в паскале и не надо самому писать дополнительный код, для ускорения работы программы? С какими подводными камнями мы можем встретиться кроме того, что при неосторожности вылезем за пределы массива и очистим какую-то рабочую область?
Eng: Are there any ways to clear the array faster? Are there any ways that are automated in pascal and you don't need to write additional code yourself to speed up the program? What pitfalls can we encounter besides the fact that, if we are careless, we will get out of the array and clear some working area?
Title: Re: Zeroing the array.
Post by: Zvoni on September 16, 2021, 10:07:07 am
FillByte-Function?
https://www.freepascal.org/docs-html/rtl/system/fillbyte.html

Default-Function?
https://www.freepascal.org/docs-html/rtl/system/default.html
Title: Re: Zeroing the array.
Post by: Seenkao on September 16, 2021, 10:28:17 am
Rus: Нет, функция FillByte работает дольше даже простой очистки массива байтов. А установку по умолчанию, мы не можем всегда использовать, есть вызываемые функции, которые так же будут изначально очищать эти массивы, что так же будет дольше.
Eng: No, the FillByte function works longer than even a simple cleaning of an array of bytes. And we can't always use the default setting, there are called functions that will also initially clear these arrays, which will also take longer.

P.S. чем больше я занимаюсь паскалем, тем больше понимаю, что достаточно много не совершенно даже в исполняемом коде...
P.S. the more I study pascal, the more I understand that quite a lot is not perfect even in executable code...
Title: Re: Zeroing the array.
Post by: y.ivanov on September 16, 2021, 12:00:26 pm
Всем привет!
Rus: Допустим, мы имеем массив из Byte (n: array[1..20] of Byte). Когда я хочу его обнулить
Eng (yandex translate): Let's say we have an array of Byte (n: array[1..20] of Byte). When I want to reset it
Code: Pascal  [Select][+][-]
  1. for i := 1 to 20 do
  2.   n[i] := 0;
Rus: мы получаем (ассемблер x86):
Eng: we get (x86 assembler):
Code: Text  [Select][+][-]
  1. movb $0x0,(%rsp)
  2. movb $0x0,0x1(%rsp)
  3. movb $0x0,0x2(%rsp)
  4. ...
  5. movb $0x0,0x13(%rsp)

Rus: понимаем, что в 32-х/64-х битных системах, это не очень хороший подход.
  Добавляю в код переменную-указатель (Pn: PLongWord), меняю код в паскале
Eng: we understand that in 32-bit / 64-bit systems, this is not a very good approach.

In the particular case, that is the result from the loop unrolling optimization. If you use larger array (perhaps >32) it will not be unrolled. Use -OoNOLOOPUNROLL option to disable unrolling if you don't like the resulting code for some reason.

  I add a pointer variable (Pn: PLongWord) to the code, I change the code in pascal
Code: Pascal  [Select][+][-]
  1. Pn := @n;
  2. for i := 0 to 4 do
  3.   Pn[i] := 0;
Rus: конечный код в ассемблере
Eng: the final code in the assembler
Code: Text  [Select][+][-]
  1. mov %rsp,%rax
  2. movl $0x0,(%rax)
  3. movl $0x0,0x4(%rax)
  4. movl $0x0,0x8(%rax)
  5. movl $0x0,0xc(%rax)
  6. movl $0x0,0x10(%rax)

Rus: Есть ли способы более быстрого очищения массива? Есть ли способы, которые автоматизированы в паскале и не надо самому писать дополнительный код, для ускорения работы программы? С какими подводными камнями мы можем встретиться кроме того, что при неосторожности вылезем за пределы массива и очистим какую-то рабочую область?
Eng: Are there any ways to clear the array faster? Are there any ways that are automated in pascal and you don't need to write additional code yourself to speed up the program? What pitfalls can we encounter besides the fact that, if we are careless, we will get out of the array and clear some working area?

You can always write your own inline assembly procedure FillByte() to exploit the longword assembly instructions, e.g. REP STOSW/D/Q.
Title: Re: Zeroing the array.
Post by: marcov on September 16, 2021, 12:11:02 pm
Some notes:

Title: Re: Zeroing the array.
Post by: Warfley on September 16, 2021, 02:10:16 pm
Sadly the FPC is simply not very good at optimizing. If you compare it with the libc memset function, it is much much slower:

Even when reimplementing the LIBC code line by line it  is much  slower:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   SysUtils, cmem;
  7.  
  8. {$IfDef WINDOWS}
  9. const LIBC = 'msvcrt';
  10. {$Else}    
  11. const LIBC = 'c';
  12. {$ENDIF WINDOWS}
  13.  
  14. // LibC memset linking
  15. function libc_memset(destpp: pointer; c: Integer; len: SizeUInt): Pointer; external LIBC name 'memset';
  16.  
  17. // exact reimplementation of GLIBC memset function: https://github.com/lattera/glibc/blob/master/string/memset.c
  18. procedure memset(dstpp: Pointer; c: Integer; len: SizeInt);
  19. type
  20.   TOP = SizeUInt;
  21.   POP = ^TOP;
  22. const
  23.   OPSIZE = sizeof(TOP);
  24. var
  25.   dstp, xlen: SizeInt;
  26.   cccc: TOP;
  27. begin
  28.   dstp := SizeInt(dstpp);
  29.   if len >= 8 then
  30.   begin
  31.     cccc := Byte(c);
  32.     cccc := cccc or (cccc shl 8);
  33.     cccc := cccc or (cccc shl 16);
  34.     if OPSIZE > 4 then
  35.       cccc := cccc or ((cccc shl 16) shl 16);
  36.     while dstp mod OPSIZE <> 0 do
  37.     begin
  38.       PByte(dstp)[0] := c;
  39.       Inc(dstp);
  40.       Dec(len);
  41.     end;
  42.  
  43.     xlen := len div (OPSIZE * 8);
  44.     while xlen > 0 do
  45.     begin
  46.       POP(dstp)[0] := cccc;
  47.       POP(dstp)[1] := cccc;
  48.       POP(dstp)[2] := cccc;
  49.       POP(dstp)[3] := cccc;
  50.       POP(dstp)[4] := cccc;
  51.       POP(dstp)[5] := cccc;
  52.       POP(dstp)[6] := cccc;
  53.       POP(dstp)[7] := cccc;
  54.       Inc(dstp, OPSIZE * 8);
  55.       Dec(xlen);
  56.     end;
  57.     len := len mod (OPSIZE * 8);
  58.  
  59.     xlen := len div OPSIZE;
  60.     while xlen > 0 do
  61.     begin
  62.       POP(dstp)[0] := cccc;
  63.       Inc(dstp, OPSIZE);
  64.       Dec(xlen);
  65.     end;
  66.     len := len mod OPSIZE;
  67.   end;
  68.     len := len mod (OPSIZE * 8);
  69.   while len > 0 do
  70.   begin
  71.     PByte(dstp)[0] := c;
  72.     Inc(dstp);
  73.     Dec(len);
  74.   end;
  75. end;
  76.  
  77. procedure TestFillChar(Buff: PByte; Size: SizeInt; Iterations: Integer);
  78. var
  79.   i: Integer;
  80.   sidx, eidx: Int64;
  81. begin
  82.   for i:=1 to Iterations do
  83.   begin
  84.     sidx := Random(64);
  85.     eidx := size - 1 - Random(64);
  86.     FillChar(Buff[sidx], eidx - sidx, Random(256));
  87.   end;
  88. end;
  89.  
  90. procedure TestMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  91. var
  92.   i: Integer;
  93.   sidx, eidx: Int64;
  94. begin
  95.   for i:=1 to Iterations do
  96.   begin
  97.     sidx := Random(64);
  98.     eidx := size - 1 - Random(64);
  99.     memset(@Buff[sidx], Random(256), eidx - sidx);
  100.   end;
  101. end;
  102.  
  103. procedure TestCMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  104. var
  105.   i: Integer;
  106.   sidx, eidx: Int64;
  107. begin
  108.   for i:=1 to Iterations do
  109.   begin
  110.     sidx := Random(64);
  111.     eidx := size - 1 - Random(64);
  112.     libc_memset(@Buff[sidx], Random(256), eidx - sidx);
  113.   end;
  114. end;
  115.  
  116. const
  117.   bsize = 1024*32;
  118.   iters = 1000000;
  119. var
  120.   start: QWord;
  121.   buff: array[0..bsize-1] of byte;
  122. begin
  123.   Randomize;
  124.   start := GetTickCount64;
  125.   TestFillChar(@buff, bsize, iters);
  126.   WriteLn('FillChar: ', GetTickCount64 - start);
  127.   start := GetTickCount64;
  128.   TestMemset(@buff, bsize, iters);
  129.   WriteLn('Memset:  ', GetTickCount64 - start);
  130.   start := GetTickCount64;
  131.   TestCMemset(@buff, bsize, iters);
  132.   WriteLn('LibC Memset:  ', GetTickCount64 - start);
  133.   ReadLn;
  134. end.
  135.  
Results (in ms):
Quote
FillChar: 1156
Memset:  1188
LibC Memset:  594
Title: Re: Zeroing the array.
Post by: marcov on September 16, 2021, 02:13:23 pm
Sadly the FPC is simply not very good at optimizing. If you compare it with the libc memset function, it is much much slower:

libc or msvcrt ?  And does whatever you choose implement libc_s like behaviour, iow selecting functions in a processor specific manner?
Title: Re: Zeroing the array.
Post by: Warfley on September 16, 2021, 03:29:19 pm
libc or msvcrt ?
Both. The results are pretty much identical on windows with msvcrt and on my 64 bit Arch Linux with the GLIBC (just generally a little slower I guess because I am running it in a VM, but the order of magnitude and differences are the same)
Sadly the FPC is simply not very good at optimizing. If you compare it with the libc memset function, it is much much slower:
And does whatever you choose implement libc_s like behaviour, iow selecting functions in a processor specific manner?
I don't think so. AFAIK the glibc used by Arch is just a binary package compiled for x86_64. But to verify this I would need to compile it myself, but as this would take at least 5 hours in a VM I would need to try this on my native installation with all my CPU cores available (and even then it would be at least 2 hours) and I can't try this right now.
But AFAIK the glibc memset code is literally just the C code (I translated to pascal in my example) compiled with a high degree of optimization.

Something that might also be interesting with this regard is that, as the FPC has an LLVM backend, this could provide a great benefit in this regard. LLVM has a memset intrinsic that will be compiled to the most efficient code (either portably or processor specific, depending on the compiling flags). And the best thing is, you don't need to implement anything fancy, LLVMs optimization passes automatically recognize memset loops (as long as  they aren't too complicated) and replaces them with the intrinsic
Title: Re: Zeroing the array.
Post by: ASerge on September 16, 2021, 06:22:17 pm
Results (in ms):
Quote
FillChar: 1156
Memset:  1188
LibC Memset:  594
Windows 7 x64, FPC 3.2.2 x64, -O1
Code: Text  [Select][+][-]
  1. FillChar: 1560
  2. Memset:  3667
  3. LibC Memset:  1575
Same with -O3
Code: Text  [Select][+][-]
  1. FillChar: 1575
  2. Memset:  1716
  3. LibC Memset:  1607

A lot depends on the random number generator, but at the 3rd level of optimization, all three functions are comparable in speed.
Title: Re: Zeroing the array.
Post by: Warfley on September 16, 2021, 06:43:53 pm
My results where all O3. But it's interesting that on your machine the libc version is an orders of magnitude slower than on my machine.
Also my results where with no debug information and running without a debugger.
Title: Re: Zeroing the array.
Post by: marcov on September 17, 2021, 12:45:25 pm
I missed 64-bit in my first comment. IIRC there is some bugreport about the 64-bit primitives, though I wouldn't know how to find it in gitlab :-)

I tried to compile the test using 32-bit, and then fillchar is better. libc doesn't work:

Quote from: 32bit
FillChar: 844
Memset:  2500

Quote from: 64bit
FillChar: 1218
Memset:  1250
LibC Memset:  735

So then fillchar and memset are close. It is indeed the 64-bit fill that is lacking.
Title: Re: Zeroing the array.
Post by: Warfley on September 17, 2021, 03:15:30 pm
I tried to compile the test using 32-bit, and then fillchar is better. libc doesn't work:
Thats because I'm an idiot and forgot the cdecl calling convention for the memset (and on 64 bit where everything fit the registers this didn't matter), here is the working version:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   SysUtils;
  7.  
  8. {$IfDef WINDOWS}
  9. const LIBC = 'msvcrt';
  10. {$Else}    
  11. const LIBC = 'c';
  12. {$ENDIF WINDOWS}
  13.  
  14. // LibC memset linking
  15. function libc_memset(destpp: pointer; c: Integer; len: SizeUInt): Pointer; cdecl; external LIBC name 'memset';
  16.  
  17. // exact reimplementation of GLIBC memset function: https://github.com/lattera/glibc/blob/master/string/memset.c
  18. procedure memset(dstpp: Pointer; c: Integer; len: SizeInt);
  19. type
  20.   TOP = SizeUInt;
  21.   POP = ^TOP;
  22. const
  23.   OPSIZE = sizeof(TOP);
  24. var
  25.   dstp, xlen: SizeInt;
  26.   cccc: TOP;
  27. begin
  28.   dstp := SizeInt(dstpp);
  29.   if len >= 8 then
  30.   begin
  31.     cccc := Byte(c);
  32.     cccc := cccc or (cccc shl 8);
  33.     cccc := cccc or (cccc shl 16);
  34.     if OPSIZE > 4 then
  35.       cccc := cccc or ((cccc shl 16) shl 16);
  36.     while dstp mod OPSIZE <> 0 do
  37.     begin
  38.       PByte(dstp)[0] := c;
  39.       Inc(dstp);
  40.       Dec(len);
  41.     end;
  42.  
  43.     xlen := len div (OPSIZE * 8);
  44.     while xlen > 0 do
  45.     begin
  46.       POP(dstp)[0] := cccc;
  47.       POP(dstp)[1] := cccc;
  48.       POP(dstp)[2] := cccc;
  49.       POP(dstp)[3] := cccc;
  50.       POP(dstp)[4] := cccc;
  51.       POP(dstp)[5] := cccc;
  52.       POP(dstp)[6] := cccc;
  53.       POP(dstp)[7] := cccc;
  54.       Inc(dstp, OPSIZE * 8);
  55.       Dec(xlen);
  56.     end;
  57.     len := len mod (OPSIZE * 8);
  58.  
  59.     xlen := len div OPSIZE;
  60.     while xlen > 0 do
  61.     begin
  62.       POP(dstp)[0] := cccc;
  63.       Inc(dstp, OPSIZE);
  64.       Dec(xlen);
  65.     end;
  66.     len := len mod OPSIZE;
  67.   end;
  68.     len := len mod (OPSIZE * 8);
  69.   while len > 0 do
  70.   begin
  71.     PByte(dstp)[0] := c;
  72.     Inc(dstp);
  73.     Dec(len);
  74.   end;
  75. end;
  76.  
  77. procedure TestFillChar(Buff: PByte; Size: SizeInt; Iterations: Integer);
  78. var
  79.   i: Integer;
  80.   sidx, eidx: Int64;
  81. begin
  82.   for i:=1 to Iterations do
  83.   begin
  84.     sidx := Random(64);
  85.     eidx := size - 1 - Random(64);
  86.     FillChar(Buff[sidx], eidx - sidx, Random(256));
  87.   end;
  88. end;
  89.  
  90. procedure TestMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  91. var
  92.   i: Integer;
  93.   sidx, eidx: Int64;
  94. begin
  95.   for i:=1 to Iterations do
  96.   begin
  97.     sidx := Random(64);
  98.     eidx := size - 1 - Random(64);
  99.     memset(@Buff[sidx], Random(256), eidx - sidx);
  100.   end;
  101. end;
  102.  
  103. procedure TestCMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  104. var
  105.   i: Integer;
  106.   sidx, eidx: Int64;
  107. begin
  108.   for i:=1 to Iterations do
  109.   begin
  110.     sidx := Random(64);
  111.     eidx := size - 1 - Random(64);
  112.     libc_memset(@Buff[sidx], Random(256), eidx - sidx);
  113.   end;
  114. end;
  115.  
  116. const
  117.   bsize = 1024*32;
  118.   iters = 1000000;
  119. var
  120.   start: QWord;
  121.   buff: array[0..bsize-1] of byte;
  122. begin
  123.   Randomize;
  124.   start := GetTickCount64;
  125.   TestFillChar(@buff, bsize, iters);
  126.   WriteLn('FillChar: ', GetTickCount64 - start);
  127.   start := GetTickCount64;
  128.   TestMemset(@buff, bsize, iters);
  129.   WriteLn('Memset:  ', GetTickCount64 - start);
  130.   start := GetTickCount64;
  131.   TestCMemset(@buff, bsize, iters);
  132.   WriteLn('LibC Memset:  ', GetTickCount64 - start);
  133.   ReadLn;
  134. end.
  135.  

My results for 32 bit windows:
Quote
FillChar: 1079
Memset:  2281
LibC Memset:  672
I can't test on linux because I don't have a 32bit libc on my Linux.

To me this result suggests that on 32 bit the optimization isn't as good as on 64 bit, as  memset is much slower. I've seen that FillChar in the RTL is already implemented in assembly (probably for optimization reasons), so it looks to me as if  the compiled memset on 64 bit is much closer to such an assembly implementation than the 32 bit one is
Title: Re: Zeroing the array.
Post by: marcov on September 17, 2021, 03:37:04 pm
My results for 32 bit windows:
Quote
FillChar: 1079
Memset:  2281
LibC Memset:  672

i7-3770:

Quote from: 32bit
FillChar: 828
Memset:  2469
LibC Memset:  797

A what more modern CPU, a  Ryzen 7 4800H:

Quote from: 32bit
FillChar: 484
Memset:  1953
LibC Memset:  281

Suddenly a factor 2, so I suspect msvcrt doing CPU dependent memset. 
Title: Re: Zeroing the array.
Post by: MathMan on September 17, 2021, 03:43:54 pm
@Warfly - you are aware that your measurement may be disproportional due to the numerous calls to 'Random()'? 'Random()' is quite time consuming, being based on Mersenne Twister.

I'd rather prepare an array holding the parameter values and pick from that on each call to the respective variants of memory setting.

The differences between the three variants will come out more pronounced I'd guess.

Cheers,
MathMan
Title: Re: Zeroing the array.
Post by: olly on September 17, 2021, 03:52:55 pm
I guess it averages out, but with each iteration of each test calling Random(something) for start/end + value it can't be ideal for such a test. Also a different seed each time you run.

I redeclared Random to just return a static value:

Code: Pascal  [Select][+][-]
  1. function Random(i: Integer): Integer;
  2. begin
  3.   Result := 50;
  4. end;

i7 7600k.
(Would RAM speed affect this?)

Code: [Select]
FillChar: 1016
Memset:  1062
LibC Memset:  532
Title: Re: Zeroing the array.
Post by: JdeHaan on September 17, 2021, 05:26:32 pm
On my MacBook Pro (2,6 GHz 6-Core Intel Core i7):

FillChar: 364
Memset:  1777
LibC Memset:  361

edit:
using Warfly's program
Title: Re: Zeroing the array.
Post by: Warfley on September 17, 2021, 05:51:43 pm
@Warfly - you are aware that your measurement may be disproportional due to the numerous calls to 'Random()'? 'Random()' is quite time consuming, being based on Mersenne Twister.

I'd rather prepare an array holding the parameter values and pick from that on each call to the respective variants of memory setting.

The differences between the three variants will come out more pronounced I'd guess.

Cheers,
MathMan
As I am not interested in the absolute numbers but only in the relative differences, the overhead should average out. But sure, I also have thought about that but simply was too lazy

I guess it averages out, but with each iteration of each test calling Random(something) for start/end + value it can't be ideal for such a test. Also a different seed each time you run.

I redeclared Random to just return a static value:

The overhead of random  should average out over all tests, and to have a different random seed is important to do multiple trials, as I don't want to use the same random values everywhere (maybe one of the tests is just incredibly lucky).

The problem with using a static value rather than random values is twofold. First the Fillchar is dependent on the alignment of the data,  which is why the start and the end should be random to cover all possible cases, second modern CPUs use predictive caching, meaning when they reach a slow instruction (e.g. a branch) the CPU assumes what will happen based  on earlier encounters of the same instruction and continues computing with that estimate. If that turns out to be false, it does a rollback, if not it saved a lot of time not having to compute the instruction. So if you have multiple runs with the exact same input values, the CPU will simply learn what happens and skips doing the actual work.
So you don't actually measure the performance of the algorithms but rather how well the CPU does predictive caching.
Title: Re: Zeroing the array.
Post by: ASerge on September 17, 2021, 11:14:22 pm
I tried storing random numbers separately so that the tests for each function were the same, as well as reinitializing the buffer after each test:
Code: Pascal  [Select][+][-]
  1. {$MODE OBJFPC}
  2. {$LONGSTRINGS ON}
  3. {$APPTYPE CONSOLE}
  4.  
  5. uses SysUtils;
  6.  
  7. {$IfDEF WINDOWS}
  8. const LIBC = 'msvcrt';
  9. {$ELSE}
  10. const LIBC = 'c';
  11. {$ENDIF WINDOWS}
  12.  
  13. // LibC memset linking
  14. function libc_memset(destpp: pointer; c: Integer; len: SizeUInt): Pointer;
  15.   cdecl; external LIBC name 'memset';
  16.  
  17. // Exact reimplementation of GLIBC memset function:
  18. // https://github.com/lattera/glibc/blob/master/string/memset.c
  19. procedure memset(dstpp: Pointer; c: Integer; len: SizeInt);
  20. type
  21.   TOP = SizeUInt;
  22.   POP = ^TOP;
  23. const
  24.   OPSIZE = SizeOf(TOP);
  25. var
  26.   dstp, xlen: SizeInt;
  27.   cccc: TOP;
  28. begin
  29.   dstp := SizeInt(dstpp);
  30.   if len >= 8 then
  31.   begin
  32.     cccc := Byte(c);
  33.     cccc := cccc or (cccc shl 8);
  34.     cccc := cccc or (cccc shl 16);
  35.     if OPSIZE > 4 then
  36.       cccc := cccc or ((cccc shl 16) shl 16);
  37.     while dstp mod OPSIZE <> 0 do
  38.     begin
  39.       PByte(dstp)[0] := c;
  40.       Inc(dstp);
  41.       Dec(len);
  42.     end;
  43.  
  44.     xlen := len div (OPSIZE * 8);
  45.     while xlen > 0 do
  46.     begin
  47.       POP(dstp)[0] := cccc;
  48.       POP(dstp)[1] := cccc;
  49.       POP(dstp)[2] := cccc;
  50.       POP(dstp)[3] := cccc;
  51.       POP(dstp)[4] := cccc;
  52.       POP(dstp)[5] := cccc;
  53.       POP(dstp)[6] := cccc;
  54.       POP(dstp)[7] := cccc;
  55.       Inc(dstp, OPSIZE * 8);
  56.       Dec(xlen);
  57.     end;
  58.     len := len mod (OPSIZE * 8);
  59.  
  60.     xlen := len div OPSIZE;
  61.     while xlen > 0 do
  62.     begin
  63.       POP(dstp)[0] := cccc;
  64.       Inc(dstp, OPSIZE);
  65.       Dec(xlen);
  66.     end;
  67.     len := len mod OPSIZE;
  68.   end;
  69.     len := len mod (OPSIZE * 8);
  70.   while len > 0 do
  71.   begin
  72.     PByte(dstp)[0] := c;
  73.     Inc(dstp);
  74.     Dec(len);
  75.   end;
  76. end;
  77.  
  78. const
  79.   CTestCount = 1000000;
  80.  
  81. type
  82.   TRandomIndexes = array[0..CTestCount - 1] of record
  83.     StartIndex: SizeInt;
  84.     Count: SizeInt;
  85.     FillValue: Byte;
  86.   end;
  87.  
  88. procedure TestFillChar(Buff: PByte; constref Randoms: TRandomIndexes);
  89. var
  90.   i: SizeInt;
  91. begin
  92.   for i := Low(Randoms) to High(Randoms) do
  93.     FillChar(
  94.       Buff[Randoms[i].StartIndex],
  95.       Randoms[i].Count,
  96.       Randoms[i].FillValue);
  97. end;
  98.  
  99. procedure TestMemset(Buff: PByte; constref Randoms: TRandomIndexes);
  100. var
  101.   i: SizeInt;
  102. begin
  103.   for i := Low(Randoms) to High(Randoms) do
  104.     memset(
  105.       @Buff[Randoms[i].StartIndex],
  106.       Randoms[i].FillValue,
  107.       Randoms[i].Count);
  108. end;
  109.  
  110. procedure TestCMemset(Buff: PByte; constref Randoms: TRandomIndexes);
  111. var
  112.   i: SizeInt;
  113. begin
  114.   for i := Low(Randoms) to High(Randoms) do
  115.     libc_memset(
  116.       @Buff[Randoms[i].StartIndex],
  117.       Randoms[i].FillValue,
  118.       Randoms[i].Count);
  119. end;
  120.  
  121. procedure Test;
  122. const
  123.   CBufSize = 1024 * 32;
  124.  
  125. var
  126.   DefBuffer, WorkBuffer: array[0..CBufSize] of Byte;
  127.   Indexes: ^TRandomIndexes;
  128.  
  129.   procedure InitValues;
  130.   var
  131.     i: SizeInt;
  132.   begin
  133.     Randomize;
  134.     for i := Low(DefBuffer) to High(DefBuffer) do
  135.       DefBuffer[i] := Random(256);
  136.     for i := Low(Indexes^) to High(Indexes^) do
  137.     begin
  138.       Indexes^[i].FillValue := Random(256);
  139.       Indexes^[i].StartIndex := Random(64);
  140.       Indexes^[i].Count := (CBufSize - 1) - Random(64) - Indexes^[i].StartIndex;
  141.     end;
  142.   end;
  143.  
  144. var
  145.   TimeStart: QWord;
  146. begin
  147.   New(Indexes);
  148.   try
  149.     InitValues;
  150.  
  151.     WorkBuffer := DefBuffer;
  152.     TimeStart := GetTickCount64;
  153.     TestFillChar(@WorkBuffer, Indexes^);
  154.     WriteLn('FillChar: ', GetTickCount64 - TimeStart);
  155.  
  156.     WorkBuffer := DefBuffer;
  157.     TimeStart := GetTickCount64;
  158.     TestMemset(@WorkBuffer, Indexes^);
  159.     WriteLn('Memset:  ', GetTickCount64 - TimeStart);
  160.  
  161.     WorkBuffer := DefBuffer;
  162.     TimeStart := GetTickCount64;
  163.     TestCMemset(@WorkBuffer, Indexes^);
  164.     WriteLn('LibC Memset:  ', GetTickCount64 - TimeStart);
  165.   finally
  166.     Dispose(Indexes);
  167.   end;
  168. end;
  169.  
  170. begin
  171.   Test;
  172.   ReadLn;
  173. end.
But the result on my machine has not changed.
With -O1 (without debugging)
Code: Text  [Select][+][-]
  1. FillChar: 1482
  2. Memset:  3619
  3. LibC Memset:  1529
With -O3
Code: Text  [Select][+][-]
  1. FillChar: 1482
  2. Memset:  1623
  3. LibC Memset:  1497
Title: Re: Zeroing the array.
Post by: MathMan on September 18, 2021, 12:37:40 pm
@Warfly, ASerge, marcov - first, I have to apologize. I somehow missed the fact that the buffer to be filled is 32k in size and only up to 64 bytes are nibbled away from beginning and end. So actually mid-sized fills are tested, and in that range the contribution of 'Random()' execution time will be minor.

That said, the results are explainable from my point of view

* c-memset has received a lot of attention from the community wrt optimizing execution times. It uses processor specific variants exploiting GPR, SSE, AVX (-2, -512) capabilities to reduce execution time. That explains, why e.g. marcov has seen such difference between the i7-3770 (Sandy Bridge without VMOVDQA/U instruction and therefore using SSE based fills) and the more modern AMD 4880H (using said).

* Intel even experimented with some store elimination schemes to further increase memset execution speed. You can read the whole story here https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html and here https://travisdowns.github.io/blog/2020/05/18/icelake-zero-opt.html. Caveat - that's deep down stuff and inbetween Intel seems to have reverted this with recent microcode upgrades.

* there is still some discussion ongoing in the community to scrap all nifty memset / memcopy implementations and replace with simple 'rep stosb' (or similar) as these opcodes have received decent improvement in most recent architectures from Intel / AMD.

* that the pascal source based memset is slower than the RTL fillchar is no surprise too - in fact I think it is doing quite good when compared on architectures which have no AVX or SSE support (comparing apples to apples).

Cheers,
MathMan
Title: Re: Zeroing the array.
Post by: marcov on September 18, 2021, 05:32:09 pm
Also random numbers might the result a bit hard to interpret. If one fill routine has a different cut-off wrt just doing rep stosb and go the alignment and multiple words route, that gives a jumbled number, and no real indication what is wrong.

So running the benchmarks multiple times on fresh memory for a profile of sizes (now the maximum seems 64-byte, which is a bit low) would give much more info. (if you want pointers, see the old unit tests of the fastcode project)

I searched for the bugreport I talked about in early messages, and it turns out to be only for larger sizes: https://gitlab.com/freepascal.org/fpc/source/-/issues/32637  (fillword etc)
Title: Re: Zeroing the array.
Post by: Seenkao on September 18, 2021, 10:28:48 pm
Some notes:

  • I don't see any numbers, just a blanket statement that fillbyte/dwordis no good. It is possible that for low bytecounts it just does this.
  • In the generate case this is asking for a  vectorizing compiler. An loop with array operation in single elements is implemented as a loop using an instruction that does multiple elements (4) at once.
  • the proposed solution does this vectorizing without any alignment checks, the badness of this is CPU dependent. As inline assembler it might also affect spilling if it is in a larger loop (?) 
  • On very modern (10th core and up) rep stosb has been accelerated, and it might be that fillbyte is faster then
  • Two ways to work around this
    • is some peephole that detects this filling construct, but that is a lot of work
    • An intrinsic maybe that for short constant fills inlines code, and otherwise calls fillbyte?

Rus: В Windows предлагаю вашему вниманию изменённый код (на ассемблере) для x86_64 и i 386:
Eng:  In Windows, I offer you the modified code (in assembler) for x86_64 and i386:
Code: Pascal  [Select][+][-]
  1. procedure memset(dstpp: Pointer; c: Integer; len: SizeInt);
  2. begin
  3.   {$IfDef cpui386}
  4.   asm
  5.     movl dstpp,%edi
  6.     movl len,%ecx
  7.     dec %ecx
  8.     cld
  9.     movb c,%al
  10.     rep stosb
  11.   end;
  12.   {$Else}
  13.   asm
  14.     movq dstpp,%rdi
  15.     movq len, %rcx
  16.     dec %ecx
  17.     cld
  18.     movb c,%al
  19.     rep stosb
  20.   end;
  21.   {$EndIf}
  22. end;    
Rus: у меня данный код выдал (на Windows) ускорение работы кода почти в 10 раз! Я считаю, что это далеко не лучший результат. Но вот для Linux, он не даёт такого результата, а лишь приближает к "memset" из libc. Но как я и писал выше, я даже не ускорял данный код. Тестируйте, возможно у вас будет по другому! Наверняка у нас разные компьютеры.
Eng: I have this code issued (on Windows) an acceleration of the code almost 10 times! I think that this is not the best result. But for Linux, it does not give such a result, but only brings it closer to the "memset" from libc. But as I wrote above, I didn't even speed up this code. Test it, it may be different for you! We probably have different computers.

marcov
rus: всё, что я писал в изначальном посте, я лишь показывал код, и при простом сравнении 20 команд в любом случае будут дольше выполняться чем 5 подобных команд.
eng: everything I wrote in the original post, I just showed the code, and with a simple comparison, 20 commands will take longer to execute than 5 similar commands in any case.  :)

Amendment:
Code: Pascal  [Select][+][-]
  1. procedure memset(dstpp: Pointer; c: Integer; len: SizeInt);
  2. begin
  3.   {$IfDef cpui386}
  4.   asm
  5.     movl dstpp,%edi
  6.     movl len,%ecx
  7.     dec %ecx
  8.     cld
  9.     movb c,%al
  10.     rep stosb
  11.   end;
  12.   {$Else}
  13.   asm
  14.     movq dstpp,%rdi
  15.     movq len, %rcx;    // not ecx
  16.     dec %ecx
  17.     cld
  18.     movb c,%al
  19.     rep stosb
  20.   end;
  21.   {$EndIf}
  22. end;    
Title: Re: Zeroing the array.
Post by: Seenkao on September 18, 2021, 10:36:17 pm
Rus: Забыл, я отключил все рандомные значения в программе предложенной Warfley. При включении случайных чисел, очень сильно падает разница, видимо рандом очень сильную нагрузку даёт. Но всё равно процедура на чистом ассемблере быстрее для Windows.
Eng: I forgot, I disabled all random values in the program proposed by Waffle. When you turn on random numbers, the difference drops very much, apparently random gives a very strong load. But still the procedure in pure assembler is faster for Windows.
Title: Re: Zeroing the array.
Post by: Seenkao on September 20, 2021, 07:25:22 am
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$LONGSTRINGS ON}
  5. {$APPTYPE CONSOLE}
  6.  
  7. uses
  8.   SysUtils, {$IfDef UNIX}UnixType{$EndIf} {$IfDef Windows} Windows{$EndIf};
  9. {$IfDef WINDOWS}
  10. const LIBC = 'msvcrt';
  11. {$IFDEF FPC}
  12.   {$LINKLIB libmsvcrt.a}        // Taken from ZenGL. The program on Windows did not start without them.
  13. {$ENDIF}
  14. {$Else}
  15. const LIBC = 'libc';
  16. {$ENDIF WINDOWS}
  17.  
  18. var
  19.   timeStart, timeEnd, timerStart: Double;
  20.   {$IfDef UNIX}{$IfNDef MAC_COCOA}
  21.   timerTimeVal: TimeVal;
  22.   {$Else}
  23.   timerTimebaseInfo: mach_timebase_info_t;
  24.   {$ENDIF}
  25.   {$ENDIF}
  26.   {$IFDEF WINDOWS}
  27.   timerFrequency: Int64;
  28.   {$ENDIF}
  29.   timeOld, timeNew: Double;
  30.  
  31.   {$IfDef UNIX}{$IfNDef MAC_COCOA}
  32.   function fpGetTimeOfDay(val: PTimeVal; tzp: Pointer): Integer; cdecl; external 'libc' name 'gettimeofday';
  33.   {$Else}
  34.   type
  35.     mach_timebase_info_t = record
  36.       numer: LongWord;
  37.       denom: LongWord;
  38.     end;
  39.  
  40.     function mach_timebase_info(var info: mach_timebase_info_t): Integer; cdecl; external 'libc';
  41.     function mach_absolute_time: QWORD; cdecl; external 'libc';
  42.   {$ENDIF}{$EndIf}
  43.  
  44. // LibC memset linking
  45. function libc_memset(destpp: pointer; c: Integer; len: SizeUInt): Pointer; external LIBC name 'memset';
  46.  
  47. // запрос таймера, оставляю для всех систем
  48. function timer_GetTicks: Double;
  49. {$IFDEF WINDOWS}
  50. var
  51.   t: int64;
  52.   m: LongWord;
  53. {$EndIf}
  54. begin
  55.   {$IfDef UNIX}{$IfNDef MAC_COCOA}
  56.   fpGetTimeOfDay(@timerTimeVal, nil);
  57.   {$Q-}
  58.   // FIXME: почему-то overflow вылетает с флагом -Co
  59.   Result := timerTimeVal.tv_sec * 1000 + timerTimeVal.tv_usec / 1000 - timerStart;
  60.   {$Q+}
  61. {$Else}
  62.   Result := mach_absolute_time() * timerTimebaseInfo.numer / timerTimebaseInfo.denom / 1000000 - timerStart;
  63. {$ENDIF}{$EndIf}
  64. {$IFDEF WINDOWS}
  65.   m := SetThreadAffinityMask(GetCurrentThread(), 1);
  66.   QueryPerformanceCounter(t);
  67.   Result := 1000 * t / timerFrequency - timerStart;
  68.   SetThreadAffinityMask(GetCurrentThread(), m);
  69. {$ENDIF}
  70. end;
  71.  
  72. procedure memset(dstpp: Pointer; c: byte; len: SizeInt);   // replaced, only for i386/x86_64 processors!!!
  73. begin
  74.   {$IfDef cpui386}
  75.   asm
  76.     movl dstpp,%edi
  77.     movl len,%ecx
  78.     dec %ecx
  79.     cld
  80.     movb c,%al
  81.     rep stosb
  82.   end;
  83.   {$Else}
  84.   asm
  85.     movq dstpp,%rdi
  86.     movq len,%rcx
  87.     dec %ecx
  88.     cld
  89.     movb c,%al
  90.     rep stosb
  91.   end;
  92.   {$EndIf}
  93. end;
  94.  
  95. procedure TestFillChar(Buff: PByte; Size: SizeInt; Iterations: Integer);
  96. var
  97.   i: Integer;
  98.   sidx, eidx: Int64;
  99. begin
  100.   for i:=1 to Iterations do
  101.   begin
  102.     sidx := 33; // Random(64);
  103.     eidx := size - 1;// - Random(64);
  104.     FillChar(Buff[sidx], eidx - sidx, {Random(256)}67);
  105.   end;
  106. end;
  107.  
  108. procedure TestMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  109. var
  110.   i: Integer;
  111.   sidx, eidx: Int64;
  112. begin
  113.   for i:=1 to Iterations do
  114.   begin
  115.     sidx := 33; // Random(64);
  116.     eidx := size - 1;// - Random(64);
  117.     memset(@Buff[sidx], {Random(256)}67, eidx - sidx);
  118.   end;
  119. end;
  120.  
  121. procedure TestCMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  122. var
  123.   i: Integer;
  124.   sidx, eidx: Int64;
  125. begin
  126.   for i:=1 to Iterations do
  127.   begin
  128.     sidx := 33; // Random(64);
  129.     eidx := size - 1;// - Random(64);
  130.     libc_memset(@Buff[sidx], {Random(256)}67, eidx - sidx);
  131.   end;
  132. end;
  133.  
  134. const
  135.   bsize = 1024*32;
  136.   iters = 1000000;
  137. var
  138.   start: Double;
  139.   buff: array[0..bsize-1] of byte;
  140.   i: Integer;
  141.  
  142. begin
  143.   {$IFDEF WINDOWS}
  144.   SetThreadAffinityMask(GetCurrentThread(), 1);
  145.   QueryPerformanceFrequency(timerFrequency);
  146.   {$ENDIF}
  147.   timerStart := timer_GetTicks();
  148.  
  149.   Randomize;
  150.   timerStart := timer_GetTicks();
  151.   timeOld := 0;
  152.   timeNew := 0;
  153.  
  154.   start := timer_GetTicks;
  155.   TestFillChar(@buff, bsize, iters);
  156.   WriteLn('FillChar: ', timer_GetTicks - start);
  157.   start := timer_GetTicks;
  158.   TestMemset(@buff, bsize, iters);
  159.   WriteLn('Memset:  ', timer_GetTicks - start);
  160.   start := timer_GetTicks;
  161.   TestCMemset(@buff, bsize, iters);
  162.   WriteLn('LibC Memset:  ', timer_GetTicks - start);
  163.   ReadLn;
  164. end.

Tests Linux:
1388
1020(1020.9)
1020(1020.5)

Tests Windows:
1425
174.9 !!!!  ???  I checked, the data is being transferred.
1394.9

Intel(R) Core(TM) i5-3230M CPU @ 2.60 2.60
Title: Re: Zeroing the array.
Post by: bytebites on September 20, 2021, 08:04:17 am
Linux
FillChar:  9.1158398437500000E+002
Memset:   2.9922192382812500E+002
LibC Memset:   2.3490307617187500E+002

Ryzen 3900X
Title: Re: Zeroing the array.
Post by: Zvoni on September 20, 2021, 08:44:03 am
Win10-Pro 64-Bit
FPC3.2.0/LAZ2.0.12 - 32-Bit
FillChar:  4.7453450012207031E+002
Memset:   4.8254630017280579E+002
LibC Memset:   3.5183000564575195E+000

i5-8365U @ 1.6GHZ - 16GB RAM
Title: Re: Zeroing the array.
Post by: Seenkao on September 20, 2021, 11:23:22 pm
FPC3.2.0/LAZ2.0.12 - 32-Bit
LibC Memset:   3.5183000564575195E+000
Build a 64-bit version for the test. Or find the msvcrt 32-bit version.

AMD Athlon II X2 240 (Linux)
FillChar:  7.7355908203125000E+002
Memset:   7.7813085937500000E+002 (assembler!!!)
LibC Memset:   1.0655258789062500E+003

As you can see, there is no absolute solution for different machines.
TinyPortal © 2005-2018