Recent

Author Topic: Zeroing the array.  (Read 5831 times)

Seenkao

  • Hero Member
  • *****
  • Posts: 545
    • New ZenGL.
Zeroing the array.
« on: September 16, 2021, 10:04:39 am »
Всем привет!
Rus: Допустим, мы имеем массив из Byte (n: array[1..20] of Byte). Когда я хочу его обнулить
Eng (yandex translate): Let's say we have an array of Byte (n: array[1..20] of Byte). When I want to reset it
Code: Pascal  [Select][+][-]
  1. for i := 1 to 20 do
  2.   n[i] := 0;
Rus: мы получаем (ассемблер x86):
Eng: we get (x86 assembler):
Code: Text  [Select][+][-]
  1. movb $0x0,(%rsp)
  2. movb $0x0,0x1(%rsp)
  3. movb $0x0,0x2(%rsp)
  4. ...
  5. movb $0x0,0x13(%rsp)

Rus: понимаем, что в 32-х/64-х битных системах, это не очень хороший подход.
  Добавляю в код переменную-указатель (Pn: PLongWord), меняю код в паскале
Eng: we understand that in 32-bit / 64-bit systems, this is not a very good approach.
  I add a pointer variable (Pn: PLongWord) to the code, I change the code in pascal
Code: Pascal  [Select][+][-]
  1. Pn := @n;
  2. for i := 0 to 4 do
  3.   Pn[i] := 0;
Rus: конечный код в ассемблере
Eng: the final code in the assembler
Code: Text  [Select][+][-]
  1. mov %rsp,%rax
  2. movl $0x0,(%rax)
  3. movl $0x0,0x4(%rax)
  4. movl $0x0,0x8(%rax)
  5. movl $0x0,0xc(%rax)
  6. movl $0x0,0x10(%rax)

Rus: Есть ли способы более быстрого очищения массива? Есть ли способы, которые автоматизированы в паскале и не надо самому писать дополнительный код, для ускорения работы программы? С какими подводными камнями мы можем встретиться кроме того, что при неосторожности вылезем за пределы массива и очистим какую-то рабочую область?
Eng: Are there any ways to clear the array faster? Are there any ways that are automated in pascal and you don't need to write additional code yourself to speed up the program? What pitfalls can we encounter besides the fact that, if we are careless, we will get out of the array and clear some working area?
Rus: Стремлюсь к созданию минимальных и достаточно быстрых приложений.

Eng: I strive to create applications that are minimal and reasonably fast.
Working on ZenGL

Zvoni

  • Hero Member
  • *****
  • Posts: 2300
« Last Edit: September 16, 2021, 10:11:06 am by Zvoni »
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Seenkao

  • Hero Member
  • *****
  • Posts: 545
    • New ZenGL.
Re: Zeroing the array.
« Reply #2 on: September 16, 2021, 10:28:17 am »
Rus: Нет, функция FillByte работает дольше даже простой очистки массива байтов. А установку по умолчанию, мы не можем всегда использовать, есть вызываемые функции, которые так же будут изначально очищать эти массивы, что так же будет дольше.
Eng: No, the FillByte function works longer than even a simple cleaning of an array of bytes. And we can't always use the default setting, there are called functions that will also initially clear these arrays, which will also take longer.

P.S. чем больше я занимаюсь паскалем, тем больше понимаю, что достаточно много не совершенно даже в исполняемом коде...
P.S. the more I study pascal, the more I understand that quite a lot is not perfect even in executable code...
Rus: Стремлюсь к созданию минимальных и достаточно быстрых приложений.

Eng: I strive to create applications that are minimal and reasonably fast.
Working on ZenGL

alpine

  • Hero Member
  • *****
  • Posts: 1032
Re: Zeroing the array.
« Reply #3 on: September 16, 2021, 12:00:26 pm »
Всем привет!
Rus: Допустим, мы имеем массив из Byte (n: array[1..20] of Byte). Когда я хочу его обнулить
Eng (yandex translate): Let's say we have an array of Byte (n: array[1..20] of Byte). When I want to reset it
Code: Pascal  [Select][+][-]
  1. for i := 1 to 20 do
  2.   n[i] := 0;
Rus: мы получаем (ассемблер x86):
Eng: we get (x86 assembler):
Code: Text  [Select][+][-]
  1. movb $0x0,(%rsp)
  2. movb $0x0,0x1(%rsp)
  3. movb $0x0,0x2(%rsp)
  4. ...
  5. movb $0x0,0x13(%rsp)

Rus: понимаем, что в 32-х/64-х битных системах, это не очень хороший подход.
  Добавляю в код переменную-указатель (Pn: PLongWord), меняю код в паскале
Eng: we understand that in 32-bit / 64-bit systems, this is not a very good approach.

In the particular case, that is the result from the loop unrolling optimization. If you use larger array (perhaps >32) it will not be unrolled. Use -OoNOLOOPUNROLL option to disable unrolling if you don't like the resulting code for some reason.

  I add a pointer variable (Pn: PLongWord) to the code, I change the code in pascal
Code: Pascal  [Select][+][-]
  1. Pn := @n;
  2. for i := 0 to 4 do
  3.   Pn[i] := 0;
Rus: конечный код в ассемблере
Eng: the final code in the assembler
Code: Text  [Select][+][-]
  1. mov %rsp,%rax
  2. movl $0x0,(%rax)
  3. movl $0x0,0x4(%rax)
  4. movl $0x0,0x8(%rax)
  5. movl $0x0,0xc(%rax)
  6. movl $0x0,0x10(%rax)

Rus: Есть ли способы более быстрого очищения массива? Есть ли способы, которые автоматизированы в паскале и не надо самому писать дополнительный код, для ускорения работы программы? С какими подводными камнями мы можем встретиться кроме того, что при неосторожности вылезем за пределы массива и очистим какую-то рабочую область?
Eng: Are there any ways to clear the array faster? Are there any ways that are automated in pascal and you don't need to write additional code yourself to speed up the program? What pitfalls can we encounter besides the fact that, if we are careless, we will get out of the array and clear some working area?

You can always write your own inline assembly procedure FillByte() to exploit the longword assembly instructions, e.g. REP STOSW/D/Q.
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: Zeroing the array.
« Reply #4 on: September 16, 2021, 12:11:02 pm »
Some notes:

  • I don't see any numbers, just a blanket statement that fillbyte/dwordis no good. It is possible that for low bytecounts it just does this.
  • In the generate case this is asking for a  vectorizing compiler. An loop with array operation in single elements is implemented as a loop using an instruction that does multiple elements (4) at once.
  • the proposed solution does this vectorizing without any alignment checks, the badness of this is CPU dependent. As inline assembler it might also affect spilling if it is in a larger loop (?) 
  • On very modern (10th core and up) rep stosb has been accelerated, and it might be that fillbyte is faster then
  • Two ways to work around this
    • is some peephole that detects this filling construct, but that is a lot of work
    • An intrinsic maybe that for short constant fills inlines code, and otherwise calls fillbyte?

Warfley

  • Hero Member
  • *****
  • Posts: 1499
Re: Zeroing the array.
« Reply #5 on: September 16, 2021, 02:10:16 pm »
Sadly the FPC is simply not very good at optimizing. If you compare it with the libc memset function, it is much much slower:

Even when reimplementing the LIBC code line by line it  is much  slower:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   SysUtils, cmem;
  7.  
  8. {$IfDef WINDOWS}
  9. const LIBC = 'msvcrt';
  10. {$Else}    
  11. const LIBC = 'c';
  12. {$ENDIF WINDOWS}
  13.  
  14. // LibC memset linking
  15. function libc_memset(destpp: pointer; c: Integer; len: SizeUInt): Pointer; external LIBC name 'memset';
  16.  
  17. // exact reimplementation of GLIBC memset function: https://github.com/lattera/glibc/blob/master/string/memset.c
  18. procedure memset(dstpp: Pointer; c: Integer; len: SizeInt);
  19. type
  20.   TOP = SizeUInt;
  21.   POP = ^TOP;
  22. const
  23.   OPSIZE = sizeof(TOP);
  24. var
  25.   dstp, xlen: SizeInt;
  26.   cccc: TOP;
  27. begin
  28.   dstp := SizeInt(dstpp);
  29.   if len >= 8 then
  30.   begin
  31.     cccc := Byte(c);
  32.     cccc := cccc or (cccc shl 8);
  33.     cccc := cccc or (cccc shl 16);
  34.     if OPSIZE > 4 then
  35.       cccc := cccc or ((cccc shl 16) shl 16);
  36.     while dstp mod OPSIZE <> 0 do
  37.     begin
  38.       PByte(dstp)[0] := c;
  39.       Inc(dstp);
  40.       Dec(len);
  41.     end;
  42.  
  43.     xlen := len div (OPSIZE * 8);
  44.     while xlen > 0 do
  45.     begin
  46.       POP(dstp)[0] := cccc;
  47.       POP(dstp)[1] := cccc;
  48.       POP(dstp)[2] := cccc;
  49.       POP(dstp)[3] := cccc;
  50.       POP(dstp)[4] := cccc;
  51.       POP(dstp)[5] := cccc;
  52.       POP(dstp)[6] := cccc;
  53.       POP(dstp)[7] := cccc;
  54.       Inc(dstp, OPSIZE * 8);
  55.       Dec(xlen);
  56.     end;
  57.     len := len mod (OPSIZE * 8);
  58.  
  59.     xlen := len div OPSIZE;
  60.     while xlen > 0 do
  61.     begin
  62.       POP(dstp)[0] := cccc;
  63.       Inc(dstp, OPSIZE);
  64.       Dec(xlen);
  65.     end;
  66.     len := len mod OPSIZE;
  67.   end;
  68.     len := len mod (OPSIZE * 8);
  69.   while len > 0 do
  70.   begin
  71.     PByte(dstp)[0] := c;
  72.     Inc(dstp);
  73.     Dec(len);
  74.   end;
  75. end;
  76.  
  77. procedure TestFillChar(Buff: PByte; Size: SizeInt; Iterations: Integer);
  78. var
  79.   i: Integer;
  80.   sidx, eidx: Int64;
  81. begin
  82.   for i:=1 to Iterations do
  83.   begin
  84.     sidx := Random(64);
  85.     eidx := size - 1 - Random(64);
  86.     FillChar(Buff[sidx], eidx - sidx, Random(256));
  87.   end;
  88. end;
  89.  
  90. procedure TestMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  91. var
  92.   i: Integer;
  93.   sidx, eidx: Int64;
  94. begin
  95.   for i:=1 to Iterations do
  96.   begin
  97.     sidx := Random(64);
  98.     eidx := size - 1 - Random(64);
  99.     memset(@Buff[sidx], Random(256), eidx - sidx);
  100.   end;
  101. end;
  102.  
  103. procedure TestCMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  104. var
  105.   i: Integer;
  106.   sidx, eidx: Int64;
  107. begin
  108.   for i:=1 to Iterations do
  109.   begin
  110.     sidx := Random(64);
  111.     eidx := size - 1 - Random(64);
  112.     libc_memset(@Buff[sidx], Random(256), eidx - sidx);
  113.   end;
  114. end;
  115.  
  116. const
  117.   bsize = 1024*32;
  118.   iters = 1000000;
  119. var
  120.   start: QWord;
  121.   buff: array[0..bsize-1] of byte;
  122. begin
  123.   Randomize;
  124.   start := GetTickCount64;
  125.   TestFillChar(@buff, bsize, iters);
  126.   WriteLn('FillChar: ', GetTickCount64 - start);
  127.   start := GetTickCount64;
  128.   TestMemset(@buff, bsize, iters);
  129.   WriteLn('Memset:  ', GetTickCount64 - start);
  130.   start := GetTickCount64;
  131.   TestCMemset(@buff, bsize, iters);
  132.   WriteLn('LibC Memset:  ', GetTickCount64 - start);
  133.   ReadLn;
  134. end.
  135.  
Results (in ms):
Quote
FillChar: 1156
Memset:  1188
LibC Memset:  594

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: Zeroing the array.
« Reply #6 on: September 16, 2021, 02:13:23 pm »
Sadly the FPC is simply not very good at optimizing. If you compare it with the libc memset function, it is much much slower:

libc or msvcrt ?  And does whatever you choose implement libc_s like behaviour, iow selecting functions in a processor specific manner?

Warfley

  • Hero Member
  • *****
  • Posts: 1499
Re: Zeroing the array.
« Reply #7 on: September 16, 2021, 03:29:19 pm »
libc or msvcrt ?
Both. The results are pretty much identical on windows with msvcrt and on my 64 bit Arch Linux with the GLIBC (just generally a little slower I guess because I am running it in a VM, but the order of magnitude and differences are the same)
Sadly the FPC is simply not very good at optimizing. If you compare it with the libc memset function, it is much much slower:
And does whatever you choose implement libc_s like behaviour, iow selecting functions in a processor specific manner?
I don't think so. AFAIK the glibc used by Arch is just a binary package compiled for x86_64. But to verify this I would need to compile it myself, but as this would take at least 5 hours in a VM I would need to try this on my native installation with all my CPU cores available (and even then it would be at least 2 hours) and I can't try this right now.
But AFAIK the glibc memset code is literally just the C code (I translated to pascal in my example) compiled with a high degree of optimization.

Something that might also be interesting with this regard is that, as the FPC has an LLVM backend, this could provide a great benefit in this regard. LLVM has a memset intrinsic that will be compiled to the most efficient code (either portably or processor specific, depending on the compiling flags). And the best thing is, you don't need to implement anything fancy, LLVMs optimization passes automatically recognize memset loops (as long as  they aren't too complicated) and replaces them with the intrinsic

ASerge

  • Hero Member
  • *****
  • Posts: 2212
Re: Zeroing the array.
« Reply #8 on: September 16, 2021, 06:22:17 pm »
Results (in ms):
Quote
FillChar: 1156
Memset:  1188
LibC Memset:  594
Windows 7 x64, FPC 3.2.2 x64, -O1
Code: Text  [Select][+][-]
  1. FillChar: 1560
  2. Memset:  3667
  3. LibC Memset:  1575
Same with -O3
Code: Text  [Select][+][-]
  1. FillChar: 1575
  2. Memset:  1716
  3. LibC Memset:  1607

A lot depends on the random number generator, but at the 3rd level of optimization, all three functions are comparable in speed.

Warfley

  • Hero Member
  • *****
  • Posts: 1499
Re: Zeroing the array.
« Reply #9 on: September 16, 2021, 06:43:53 pm »
My results where all O3. But it's interesting that on your machine the libc version is an orders of magnitude slower than on my machine.
Also my results where with no debug information and running without a debugger.
« Last Edit: September 16, 2021, 06:50:03 pm by Warfley »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: Zeroing the array.
« Reply #10 on: September 17, 2021, 12:45:25 pm »
I missed 64-bit in my first comment. IIRC there is some bugreport about the 64-bit primitives, though I wouldn't know how to find it in gitlab :-)

I tried to compile the test using 32-bit, and then fillchar is better. libc doesn't work:

Quote from: 32bit
FillChar: 844
Memset:  2500

Quote from: 64bit
FillChar: 1218
Memset:  1250
LibC Memset:  735

So then fillchar and memset are close. It is indeed the 64-bit fill that is lacking.

Warfley

  • Hero Member
  • *****
  • Posts: 1499
Re: Zeroing the array.
« Reply #11 on: September 17, 2021, 03:15:30 pm »
I tried to compile the test using 32-bit, and then fillchar is better. libc doesn't work:
Thats because I'm an idiot and forgot the cdecl calling convention for the memset (and on 64 bit where everything fit the registers this didn't matter), here is the working version:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   SysUtils;
  7.  
  8. {$IfDef WINDOWS}
  9. const LIBC = 'msvcrt';
  10. {$Else}    
  11. const LIBC = 'c';
  12. {$ENDIF WINDOWS}
  13.  
  14. // LibC memset linking
  15. function libc_memset(destpp: pointer; c: Integer; len: SizeUInt): Pointer; cdecl; external LIBC name 'memset';
  16.  
  17. // exact reimplementation of GLIBC memset function: https://github.com/lattera/glibc/blob/master/string/memset.c
  18. procedure memset(dstpp: Pointer; c: Integer; len: SizeInt);
  19. type
  20.   TOP = SizeUInt;
  21.   POP = ^TOP;
  22. const
  23.   OPSIZE = sizeof(TOP);
  24. var
  25.   dstp, xlen: SizeInt;
  26.   cccc: TOP;
  27. begin
  28.   dstp := SizeInt(dstpp);
  29.   if len >= 8 then
  30.   begin
  31.     cccc := Byte(c);
  32.     cccc := cccc or (cccc shl 8);
  33.     cccc := cccc or (cccc shl 16);
  34.     if OPSIZE > 4 then
  35.       cccc := cccc or ((cccc shl 16) shl 16);
  36.     while dstp mod OPSIZE <> 0 do
  37.     begin
  38.       PByte(dstp)[0] := c;
  39.       Inc(dstp);
  40.       Dec(len);
  41.     end;
  42.  
  43.     xlen := len div (OPSIZE * 8);
  44.     while xlen > 0 do
  45.     begin
  46.       POP(dstp)[0] := cccc;
  47.       POP(dstp)[1] := cccc;
  48.       POP(dstp)[2] := cccc;
  49.       POP(dstp)[3] := cccc;
  50.       POP(dstp)[4] := cccc;
  51.       POP(dstp)[5] := cccc;
  52.       POP(dstp)[6] := cccc;
  53.       POP(dstp)[7] := cccc;
  54.       Inc(dstp, OPSIZE * 8);
  55.       Dec(xlen);
  56.     end;
  57.     len := len mod (OPSIZE * 8);
  58.  
  59.     xlen := len div OPSIZE;
  60.     while xlen > 0 do
  61.     begin
  62.       POP(dstp)[0] := cccc;
  63.       Inc(dstp, OPSIZE);
  64.       Dec(xlen);
  65.     end;
  66.     len := len mod OPSIZE;
  67.   end;
  68.     len := len mod (OPSIZE * 8);
  69.   while len > 0 do
  70.   begin
  71.     PByte(dstp)[0] := c;
  72.     Inc(dstp);
  73.     Dec(len);
  74.   end;
  75. end;
  76.  
  77. procedure TestFillChar(Buff: PByte; Size: SizeInt; Iterations: Integer);
  78. var
  79.   i: Integer;
  80.   sidx, eidx: Int64;
  81. begin
  82.   for i:=1 to Iterations do
  83.   begin
  84.     sidx := Random(64);
  85.     eidx := size - 1 - Random(64);
  86.     FillChar(Buff[sidx], eidx - sidx, Random(256));
  87.   end;
  88. end;
  89.  
  90. procedure TestMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  91. var
  92.   i: Integer;
  93.   sidx, eidx: Int64;
  94. begin
  95.   for i:=1 to Iterations do
  96.   begin
  97.     sidx := Random(64);
  98.     eidx := size - 1 - Random(64);
  99.     memset(@Buff[sidx], Random(256), eidx - sidx);
  100.   end;
  101. end;
  102.  
  103. procedure TestCMemset(Buff: PByte; Size: SizeInt; Iterations: Integer);
  104. var
  105.   i: Integer;
  106.   sidx, eidx: Int64;
  107. begin
  108.   for i:=1 to Iterations do
  109.   begin
  110.     sidx := Random(64);
  111.     eidx := size - 1 - Random(64);
  112.     libc_memset(@Buff[sidx], Random(256), eidx - sidx);
  113.   end;
  114. end;
  115.  
  116. const
  117.   bsize = 1024*32;
  118.   iters = 1000000;
  119. var
  120.   start: QWord;
  121.   buff: array[0..bsize-1] of byte;
  122. begin
  123.   Randomize;
  124.   start := GetTickCount64;
  125.   TestFillChar(@buff, bsize, iters);
  126.   WriteLn('FillChar: ', GetTickCount64 - start);
  127.   start := GetTickCount64;
  128.   TestMemset(@buff, bsize, iters);
  129.   WriteLn('Memset:  ', GetTickCount64 - start);
  130.   start := GetTickCount64;
  131.   TestCMemset(@buff, bsize, iters);
  132.   WriteLn('LibC Memset:  ', GetTickCount64 - start);
  133.   ReadLn;
  134. end.
  135.  

My results for 32 bit windows:
Quote
FillChar: 1079
Memset:  2281
LibC Memset:  672
I can't test on linux because I don't have a 32bit libc on my Linux.

To me this result suggests that on 32 bit the optimization isn't as good as on 64 bit, as  memset is much slower. I've seen that FillChar in the RTL is already implemented in assembly (probably for optimization reasons), so it looks to me as if  the compiled memset on 64 bit is much closer to such an assembly implementation than the 32 bit one is

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: Zeroing the array.
« Reply #12 on: September 17, 2021, 03:37:04 pm »
My results for 32 bit windows:
Quote
FillChar: 1079
Memset:  2281
LibC Memset:  672

i7-3770:

Quote from: 32bit
FillChar: 828
Memset:  2469
LibC Memset:  797

A what more modern CPU, a  Ryzen 7 4800H:

Quote from: 32bit
FillChar: 484
Memset:  1953
LibC Memset:  281

Suddenly a factor 2, so I suspect msvcrt doing CPU dependent memset. 
« Last Edit: September 17, 2021, 03:43:18 pm by marcov »

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: Zeroing the array.
« Reply #13 on: September 17, 2021, 03:43:54 pm »
@Warfly - you are aware that your measurement may be disproportional due to the numerous calls to 'Random()'? 'Random()' is quite time consuming, being based on Mersenne Twister.

I'd rather prepare an array holding the parameter values and pick from that on each call to the respective variants of memory setting.

The differences between the three variants will come out more pronounced I'd guess.

Cheers,
MathMan

olly

  • New Member
  • *
  • Posts: 42
Re: Zeroing the array.
« Reply #14 on: September 17, 2021, 03:52:55 pm »
I guess it averages out, but with each iteration of each test calling Random(something) for start/end + value it can't be ideal for such a test. Also a different seed each time you run.

I redeclared Random to just return a static value:

Code: Pascal  [Select][+][-]
  1. function Random(i: Integer): Integer;
  2. begin
  3.   Result := 50;
  4. end;

i7 7600k.
(Would RAM speed affect this?)

Code: [Select]
FillChar: 1016
Memset:  1062
LibC Memset:  532
« Last Edit: September 17, 2021, 03:57:20 pm by olly »

 

TinyPortal © 2005-2018