Recent

Author Topic: Zeroing the array.  (Read 5829 times)

JdeHaan

  • Full Member
  • ***
  • Posts: 115
Re: Zeroing the array.
« Reply #15 on: September 17, 2021, 05:26:32 pm »
On my MacBook Pro (2,6 GHz 6-Core Intel Core i7):

FillChar: 364
Memset:  1777
LibC Memset:  361

edit:
using Warfly's program
« Last Edit: September 17, 2021, 05:28:19 pm by JdeHaan »

Warfley

  • Hero Member
  • *****
  • Posts: 1499
Re: Zeroing the array.
« Reply #16 on: September 17, 2021, 05:51:43 pm »
@Warfly - you are aware that your measurement may be disproportional due to the numerous calls to 'Random()'? 'Random()' is quite time consuming, being based on Mersenne Twister.

I'd rather prepare an array holding the parameter values and pick from that on each call to the respective variants of memory setting.

The differences between the three variants will come out more pronounced I'd guess.

Cheers,
MathMan
As I am not interested in the absolute numbers but only in the relative differences, the overhead should average out. But sure, I also have thought about that but simply was too lazy

I guess it averages out, but with each iteration of each test calling Random(something) for start/end + value it can't be ideal for such a test. Also a different seed each time you run.

I redeclared Random to just return a static value:

The overhead of random  should average out over all tests, and to have a different random seed is important to do multiple trials, as I don't want to use the same random values everywhere (maybe one of the tests is just incredibly lucky).

The problem with using a static value rather than random values is twofold. First the Fillchar is dependent on the alignment of the data,  which is why the start and the end should be random to cover all possible cases, second modern CPUs use predictive caching, meaning when they reach a slow instruction (e.g. a branch) the CPU assumes what will happen based  on earlier encounters of the same instruction and continues computing with that estimate. If that turns out to be false, it does a rollback, if not it saved a lot of time not having to compute the instruction. So if you have multiple runs with the exact same input values, the CPU will simply learn what happens and skips doing the actual work.
So you don't actually measure the performance of the algorithms but rather how well the CPU does predictive caching.
« Last Edit: September 17, 2021, 05:55:29 pm by Warfley »

ASerge

  • Hero Member
  • *****
  • Posts: 2212
Re: Zeroing the array.
« Reply #17 on: September 17, 2021, 11:14:22 pm »
I tried storing random numbers separately so that the tests for each function were the same, as well as reinitializing the buffer after each test:
Code: Pascal  [Select][+][-]
  1. {$MODE OBJFPC}
  2. {$LONGSTRINGS ON}
  3. {$APPTYPE CONSOLE}
  4.  
  5. uses SysUtils;
  6.  
  7. {$IfDEF WINDOWS}
  8. const LIBC = 'msvcrt';
  9. {$ELSE}
  10. const LIBC = 'c';
  11. {$ENDIF WINDOWS}
  12.  
  13. // LibC memset linking
  14. function libc_memset(destpp: pointer; c: Integer; len: SizeUInt): Pointer;
  15.   cdecl; external LIBC name 'memset';
  16.  
  17. // Exact reimplementation of GLIBC memset function:
  18. // https://github.com/lattera/glibc/blob/master/string/memset.c
  19. procedure memset(dstpp: Pointer; c: Integer; len: SizeInt);
  20. type
  21.   TOP = SizeUInt;
  22.   POP = ^TOP;
  23. const
  24.   OPSIZE = SizeOf(TOP);
  25. var
  26.   dstp, xlen: SizeInt;
  27.   cccc: TOP;
  28. begin
  29.   dstp := SizeInt(dstpp);
  30.   if len >= 8 then
  31.   begin
  32.     cccc := Byte(c);
  33.     cccc := cccc or (cccc shl 8);
  34.     cccc := cccc or (cccc shl 16);
  35.     if OPSIZE > 4 then
  36.       cccc := cccc or ((cccc shl 16) shl 16);
  37.     while dstp mod OPSIZE <> 0 do
  38.     begin
  39.       PByte(dstp)[0] := c;
  40.       Inc(dstp);
  41.       Dec(len);
  42.     end;
  43.  
  44.     xlen := len div (OPSIZE * 8);
  45.     while xlen > 0 do
  46.     begin
  47.       POP(dstp)[0] := cccc;
  48.       POP(dstp)[1] := cccc;
  49.       POP(dstp)[2] := cccc;
  50.       POP(dstp)[3] := cccc;
  51.       POP(dstp)[4] := cccc;
  52.       POP(dstp)[5] := cccc;
  53.       POP(dstp)[6] := cccc;
  54.       POP(dstp)[7] := cccc;
  55.       Inc(dstp, OPSIZE * 8);
  56.       Dec(xlen);
  57.     end;
  58.     len := len mod (OPSIZE * 8);
  59.  
  60.     xlen := len div OPSIZE;
  61.     while xlen > 0 do
  62.     begin
  63.       POP(dstp)[0] := cccc;
  64.       Inc(dstp, OPSIZE);
  65.       Dec(xlen);
  66.     end;
  67.     len := len mod OPSIZE;
  68.   end;
  69.     len := len mod (OPSIZE * 8);
  70.   while len > 0 do
  71.   begin
  72.     PByte(dstp)[0] := c;
  73.     Inc(dstp);
  74.     Dec(len);
  75.   end;
  76. end;
  77.  
  78. const
  79.   CTestCount = 1000000;
  80.  
  81. type
  82.   TRandomIndexes = array[0..CTestCount - 1] of record
  83.     StartIndex: SizeInt;
  84.     Count: SizeInt;
  85.     FillValue: Byte;
  86.   end;
  87.  
  88. procedure TestFillChar(Buff: PByte; constref Randoms: TRandomIndexes);
  89. var
  90.   i: SizeInt;
  91. begin
  92.   for i := Low(Randoms) to High(Randoms) do
  93.     FillChar(
  94.       Buff[Randoms[i].StartIndex],
  95.       Randoms[i].Count,
  96.       Randoms[i].FillValue);
  97. end;
  98.  
  99. procedure TestMemset(Buff: PByte; constref Randoms: TRandomIndexes);
  100. var
  101.   i: SizeInt;
  102. begin
  103.   for i := Low(Randoms) to High(Randoms) do
  104.     memset(
  105.       @Buff[Randoms[i].StartIndex],
  106.       Randoms[i].FillValue,
  107.       Randoms[i].Count);
  108. end;
  109.  
  110. procedure TestCMemset(Buff: PByte; constref Randoms: TRandomIndexes);
  111. var
  112.   i: SizeInt;
  113. begin
  114.   for i := Low(Randoms) to High(Randoms) do
  115.     libc_memset(
  116.       @Buff[Randoms[i].StartIndex],
  117.       Randoms[i].FillValue,
  118.       Randoms[i].Count);
  119. end;
  120.  
  121. procedure Test;
  122. const
  123.   CBufSize = 1024 * 32;
  124.  
  125. var
  126.   DefBuffer, WorkBuffer: array[0..CBufSize] of Byte;
  127.   Indexes: ^TRandomIndexes;
  128.  
  129.   procedure InitValues;
  130.   var
  131.     i: SizeInt;
  132.   begin
  133.     Randomize;
  134.     for i := Low(DefBuffer) to High(DefBuffer) do
  135.       DefBuffer[i] := Random(256);
  136.     for i := Low(Indexes^) to High(Indexes^) do
  137.     begin
  138.       Indexes^[i].FillValue := Random(256);
  139.       Indexes^[i].StartIndex := Random(64);
  140.       Indexes^[i].Count := (CBufSize - 1) - Random(64) - Indexes^[i].StartIndex;
  141.     end;
  142.   end;
  143.  
  144. var
  145.   TimeStart: QWord;
  146. begin
  147.   New(Indexes);
  148.   try
  149.     InitValues;
  150.  
  151.     WorkBuffer := DefBuffer;
  152.     TimeStart := GetTickCount64;
  153.     TestFillChar(@WorkBuffer, Indexes^);
  154.     WriteLn('FillChar: ', GetTickCount64 - TimeStart);
  155.  
  156.     WorkBuffer := DefBuffer;
  157.     TimeStart := GetTickCount64;
  158.     TestMemset(@WorkBuffer, Indexes^);
  159.     WriteLn('Memset:  ', GetTickCount64 - TimeStart);
  160.  
  161.     WorkBuffer := DefBuffer;
  162.     TimeStart := GetTickCount64;
  163.     TestCMemset(@WorkBuffer, Indexes^);
  164.     WriteLn('LibC Memset:  ', GetTickCount64 - TimeStart);
  165.   finally
  166.     Dispose(Indexes);
  167.   end;
  168. end;
  169.  
  170. begin
  171.   Test;
  172.   ReadLn;
  173. end.
But the result on my machine has not changed.
With -O1 (without debugging)
Code: Text  [Select][+][-]
  1. FillChar: 1482
  2. Memset:  3619
  3. LibC Memset:  1529
With -O3
Code: Text  [Select][+][-]
  1. FillChar: 1482
  2. Memset:  1623
  3. LibC Memset:  1497

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: Zeroing the array.
« Reply #18 on: September 18, 2021, 12:37:40 pm »
@Warfly, ASerge, marcov - first, I have to apologize. I somehow missed the fact that the buffer to be filled is 32k in size and only up to 64 bytes are nibbled away from beginning and end. So actually mid-sized fills are tested, and in that range the contribution of 'Random()' execution time will be minor.

That said, the results are explainable from my point of view

* c-memset has received a lot of attention from the community wrt optimizing execution times. It uses processor specific variants exploiting GPR, SSE, AVX (-2, -512) capabilities to reduce execution time. That explains, why e.g. marcov has seen such difference between the i7-3770 (Sandy Bridge without VMOVDQA/U instruction and therefore using SSE based fills) and the more modern AMD 4880H (using said).

* Intel even experimented with some store elimination schemes to further increase memset execution speed. You can read the whole story here https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html and here https://travisdowns.github.io/blog/2020/05/18/icelake-zero-opt.html. Caveat - that's deep down stuff and inbetween Intel seems to have reverted this with recent microcode upgrades.

* there is still some discussion ongoing in the community to scrap all nifty memset / memcopy implementations and replace with simple 'rep stosb' (or similar) as these opcodes have received decent improvement in most recent architectures from Intel / AMD.

* that the pascal source based memset is slower than the RTL fillchar is no surprise too - in fact I think it is doing quite good when compared on architectures which have no AVX or SSE support (comparing apples to apples).

Cheers,
MathMan

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: Zeroing the array.
« Reply #19 on: September 18, 2021, 05:32:09 pm »
Also random numbers might the result a bit hard to interpret. If one fill routine has a different cut-off wrt just doing rep stosb and go the alignment and multiple words route, that gives a jumbled number, and no real indication what is wrong.

So running the benchmarks multiple times on fresh memory for a profile of sizes (now the maximum seems 64-byte, which is a bit low) would give much more info. (if you want pointers, see the old unit tests of the fastcode project)

I searched for the bugreport I talked about in early messages, and it turns out to be only for larger sizes: https://gitlab.com/freepascal.org/fpc/source/-/issues/32637  (fillword etc)
« Last Edit: September 18, 2021, 05:39:15 pm by marcov »

Seenkao

  • Hero Member
  • *****
  • Posts: 545
    • New ZenGL.
Re: Zeroing the array.
« Reply #20 on: September 18, 2021, 10:28:48 pm »
Some notes:

  • I don't see any numbers, just a blanket statement that fillbyte/dwordis no good. It is possible that for low bytecounts it just does this.
  • In the generate case this is asking for a  vectorizing compiler. An loop with array operation in single elements is implemented as a loop using an instruction that does multiple elements (4) at once.
  • the proposed solution does this vectorizing without any alignment checks, the badness of this is CPU dependent. As inline assembler it might also affect spilling if it is in a larger loop (?) 
  • On very modern (10th core and up) rep stosb has been accelerated, and it might be that fillbyte is faster then
  • Two ways to work around this
    • is some peephole that detects this filling construct, but that is a lot of work
    • An intrinsic maybe that for short constant fills inlines code, and otherwise calls fillbyte?

Rus: В Windows предлагаю вашему вниманию изменённый код (на ассемблере) для x86_64 и i 386:
Eng:  In Windows, I offer you the modified code (in assembler) for x86_64 and i386:
Code: Pascal  [Select][+][-]
  1. procedure memset(dstpp: Pointer; c: Integer; len: SizeInt);
  2. begin
  3.   {$IfDef cpui386}
  4.   asm
  5.     movl dstpp,%edi
  6.     movl len,%ecx
  7.     dec %ecx
  8.     cld
  9.     movb c,%al
  10.     rep stosb
  11.   end;
  12.   {$Else}
  13.   asm
  14.     movq dstpp,%rdi
  15.     movq len, %rcx
  16.     dec %ecx
  17.     cld
  18.     movb c,%al
  19.     rep stosb
  20.   end;
  21.   {$EndIf}
  22. end;    
Rus: у меня данный код выдал (на Windows) ускорение работы кода почти в 10 раз! Я считаю, что это далеко не лучший результат. Но вот для Linux, он не даёт такого результата, а лишь приближает к "memset" из libc. Но как я и писал выше, я даже не ускорял данный код. Тестируйте, возможно у вас будет по другому! Наверняка у нас разные компьютеры.
Eng: I have this code issued (on Windows) an acceleration of the code almost 10 times! I think that this is not the best result. But for Linux, it does not give such a result, but only brings it closer to the "memset" from libc. But as I wrote above, I didn't even speed up this code. Test it, it may be different for you! We probably have different computers.

marcov
rus: всё, что я писал в изначальном посте, я лишь показывал код, и при простом сравнении 20 команд в любом случае будут дольше выполняться чем 5 подобных команд.
eng: everything I wrote in the original post, I just showed the code, and with a simple comparison, 20 commands will take longer to execute than 5 similar commands in any case.  :)

Amendment:
Code: Pascal  [Select][+][-]
  1. procedure memset(dstpp: Pointer; c: Integer; len: SizeInt);
  2. begin
  3.   {$IfDef cpui386}
  4.   asm
  5.     movl dstpp,%edi
  6.     movl len,%ecx
  7.     dec %ecx
  8.     cld
  9.     movb c,%al
  10.     rep stosb
  11.   end;
  12.   {$Else}
  13.   asm
  14.     movq dstpp,%rdi
  15.     movq len, %rcx;    // not ecx
  16.     dec %ecx
  17.     cld
  18.     movb c,%al
  19.     rep stosb
  20.   end;
  21.   {$EndIf}
  22. end;    
« Last Edit: September 20, 2021, 07:03:43 am by Seenkao »
Rus: Стремлюсь к созданию минимальных и достаточно быстрых приложений.

Eng: I strive to create applications that are minimal and reasonably fast.
Working on ZenGL

Seenkao

  • Hero Member
  • *****
  • Posts: 545
    • New ZenGL.
Re: Zeroing the array.
« Reply #21 on: September 18, 2021, 10:36:17 pm »
Rus: Забыл, я отключил все рандомные значения в программе предложенной Warfley. При включении случайных чисел, очень сильно падает разница, видимо рандом очень сильную нагрузку даёт. Но всё равно процедура на чистом ассемблере быстрее для Windows.
Eng: I forgot, I disabled all random values in the program proposed by Waffle. When you turn on random numbers, the difference drops very much, apparently random gives a very strong load. But still the procedure in pure assembler is faster for Windows.
Rus: Стремлюсь к созданию минимальных и достаточно быстрых приложений.

Eng: I strive to create applications that are minimal and reasonably fast.
Working on ZenGL

Seenkao

  • Hero Member
  • *****
  • Posts: 545
    • New ZenGL.
Re: Zeroing the array.
« Reply #22 on: September 20, 2021, 07:25:22 am »
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$LONGSTRINGS ON}
  5. {$APPTYPE CONSOLE}
  6.  
  7. uses
  8.   SysUtils, {$IfDef UNIX}UnixType{$EndIf} {$IfDef Windows} Windows{$EndIf};
  9. {$IfDef WINDOWS}
  10. const LIBC = 'msvcrt';
  11. {$IFDEF FPC}
  12.   {$LINKLIB libmsvcrt.a}        // Taken from ZenGL. The program on Windows did not start without them.
  13. {$ENDIF}
  14. {$Else}
  15. const LIBC = 'libc';
  16. {$ENDIF WINDOWS}
  17.  
  18. var
  19.   timeStart, timeEnd, timerStart: Double;
  20.   {$IfDef UNIX}{$IfNDef MAC_COCOA}
  21.   timerTimeVal: TimeVal;
  22.   {$Else}
  23.   timerTimebaseInfo: mach_timebase_info_t;
  24.   {$ENDIF}
  25.   {$ENDIF}
  26.   {$IFDEF WINDOWS}
  27.   timerFrequency: Int64;
  28.   {$ENDIF}
  29.   timeOld, timeNew: Double;
  30.  
  31.   {$IfDef UNIX}{$IfNDef MAC_COCOA}
  32.   function fpGetTimeOfDay(val: PTimeVal; tzp: Pointer): Integer; cdecl; external 'libc' name 'gettimeofday';
  33.   {$Else}
  34.   type
  35.     mach_timebase_info_t = record
  36.       numer: LongWord;
  37.       denom: LongWord;
  38.     end;
  39.  
  40.     function mach_timebase_info(var info: mach_timebase_info_t): Integer; cdecl; external 'libc';
  41.     function mach_absolute_time: QWORD; cdecl; external 'libc';
  42.   {$ENDIF}{$EndIf}
  43.  
  44. // LibC memset linking
  45. function libc_memset(destpp: pointer; c: Integer; len: SizeUInt): Pointer; external LIBC name 'memset';
  46.  
  47. // запрос таймера, оставляю для всех систем
  48. function timer_GetTicks: Double;
  49. {$IFDEF WINDOWS}
  50. var
  51.   t: int64;
  52.   m: LongWord;
  53. {$EndIf}
  54. begin
  55.   {$IfDef UNIX}{$IfNDef MAC_COCOA}
  56.   fpGetTimeOfDay(@timerTimeVal, nil);
  57.   {$Q-}
  58.   // FIXME: почему-то overflow вылетает с флагом -Co
  59.   Result := timerTimeVal.tv_sec * 1000 + timerTimeVal.tv_usec / 1000 - timerStart;
  60.   {$Q+}
  61. {$Else}
  62.   Result := mach_absolute_time() * timerTimebaseInfo.numer / timerTimebaseInfo.denom / 1000000 - timerStart;
  63. {$ENDIF}{$EndIf}
  64. {$IFDEF WINDOWS}
  65.   m := SetThreadAffinityMask(GetCurrentThread(), 1);
  66.   QueryPerformanceCounter(t);
  67.   Result := 1000 * t / timerFrequency - timerStart;
  68.   SetThreadAffinityMask(GetCurrentThread(), m);
  69. {$ENDIF}
  70. end;
  71.  
  72. procedure memset(dstpp: Pointer; c: byte; len: SizeInt);   // replaced, only for i386/x86_64 processors!!!
  73. begin
  74.   {$IfDef cpui386}
  75.   asm
  76.     movl dstpp,%edi
  77.     movl len,%ecx
  78.     dec %ecx
  79.     cld
  80.     movb c,%al
  81.     rep stosb
  82.   end;
  83.   {$Else}
  84.   asm
  85.     movq dstpp,%rdi
  86.     movq len,%rcx
  87.     dec %ecx
  88.     cld
  89.     movb c,%al
  90.     rep stosb
  91.   end;
  92.   {$EndIf}
  93. end;
  94.  
  95. procedure TestFillChar(Buff: PByte; Size: SizeInt; Iterations: Integer);
  96. var
  97.   i: Integer;
  98.   sidx, eidx: Int64;
  99. begin
  100.   for i:=1 to Iterations do
  101.   begin
  102.     sidx := 33; // Random(64);
  103.     eidx := size - 1;// - Random(64);
  104.     FillChar(Buff[sidx], eidx - sidx, {Random(256)}67);
  105.   end;
  106. end;
  107.  
  108. procedure TestMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  109. var
  110.   i: Integer;
  111.   sidx, eidx: Int64;
  112. begin
  113.   for i:=1 to Iterations do
  114.   begin
  115.     sidx := 33; // Random(64);
  116.     eidx := size - 1;// - Random(64);
  117.     memset(@Buff[sidx], {Random(256)}67, eidx - sidx);
  118.   end;
  119. end;
  120.  
  121. procedure TestCMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  122. var
  123.   i: Integer;
  124.   sidx, eidx: Int64;
  125. begin
  126.   for i:=1 to Iterations do
  127.   begin
  128.     sidx := 33; // Random(64);
  129.     eidx := size - 1;// - Random(64);
  130.     libc_memset(@Buff[sidx], {Random(256)}67, eidx - sidx);
  131.   end;
  132. end;
  133.  
  134. const
  135.   bsize = 1024*32;
  136.   iters = 1000000;
  137. var
  138.   start: Double;
  139.   buff: array[0..bsize-1] of byte;
  140.   i: Integer;
  141.  
  142. begin
  143.   {$IFDEF WINDOWS}
  144.   SetThreadAffinityMask(GetCurrentThread(), 1);
  145.   QueryPerformanceFrequency(timerFrequency);
  146.   {$ENDIF}
  147.   timerStart := timer_GetTicks();
  148.  
  149.   Randomize;
  150.   timerStart := timer_GetTicks();
  151.   timeOld := 0;
  152.   timeNew := 0;
  153.  
  154.   start := timer_GetTicks;
  155.   TestFillChar(@buff, bsize, iters);
  156.   WriteLn('FillChar: ', timer_GetTicks - start);
  157.   start := timer_GetTicks;
  158.   TestMemset(@buff, bsize, iters);
  159.   WriteLn('Memset:  ', timer_GetTicks - start);
  160.   start := timer_GetTicks;
  161.   TestCMemset(@buff, bsize, iters);
  162.   WriteLn('LibC Memset:  ', timer_GetTicks - start);
  163.   ReadLn;
  164. end.

Tests Linux:
1388
1020(1020.9)
1020(1020.5)

Tests Windows:
1425
174.9 !!!!  ???  I checked, the data is being transferred.
1394.9

Intel(R) Core(TM) i5-3230M CPU @ 2.60 2.60
« Last Edit: September 20, 2021, 07:38:13 am by Seenkao »
Rus: Стремлюсь к созданию минимальных и достаточно быстрых приложений.

Eng: I strive to create applications that are minimal and reasonably fast.
Working on ZenGL

bytebites

  • Hero Member
  • *****
  • Posts: 624
Re: Zeroing the array.
« Reply #23 on: September 20, 2021, 08:04:17 am »
Linux
FillChar:  9.1158398437500000E+002
Memset:   2.9922192382812500E+002
LibC Memset:   2.3490307617187500E+002

Ryzen 3900X

Zvoni

  • Hero Member
  • *****
  • Posts: 2300
Re: Zeroing the array.
« Reply #24 on: September 20, 2021, 08:44:03 am »
Win10-Pro 64-Bit
FPC3.2.0/LAZ2.0.12 - 32-Bit
FillChar:  4.7453450012207031E+002
Memset:   4.8254630017280579E+002
LibC Memset:   3.5183000564575195E+000

i5-8365U @ 1.6GHZ - 16GB RAM
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Seenkao

  • Hero Member
  • *****
  • Posts: 545
    • New ZenGL.
Re: Zeroing the array.
« Reply #25 on: September 20, 2021, 11:23:22 pm »
FPC3.2.0/LAZ2.0.12 - 32-Bit
LibC Memset:   3.5183000564575195E+000
Build a 64-bit version for the test. Or find the msvcrt 32-bit version.

AMD Athlon II X2 240 (Linux)
FillChar:  7.7355908203125000E+002
Memset:   7.7813085937500000E+002 (assembler!!!)
LibC Memset:   1.0655258789062500E+003

As you can see, there is no absolute solution for different machines.
Rus: Стремлюсь к созданию минимальных и достаточно быстрых приложений.

Eng: I strive to create applications that are minimal and reasonably fast.
Working on ZenGL

 

TinyPortal © 2005-2018