Recent

Author Topic: Why CMem?  (Read 2329 times)

k1ng

  • New Member
  • *
  • Posts: 36
Re: Why CMem?
« Reply #15 on: August 17, 2019, 10:53:00 am »
There are some benchmarks in mORMot source code:
Code: [Select]
  Some raw numbers, from TestSQL3 string allocation tests (single threaded):
    - FPC default heap
     500000 interning 8 KB in 77.34ms i.e. 6,464,959/s, aver. 0us, 98.6 MB/s
     500000 direct 7.6 MB in 100.73ms i.e. 4,963,518/s, aver. 0us, 75.7 MB/s
    - glibc 2.23
     500000 interning 8 KB in 76.06ms i.e. 6,573,152/s, aver. 0us, 100.2 MB/s
     500000 direct 7.6 MB in 36.64ms i.e. 13,645,915/s, aver. 0us, 208.2 MB/s
    - jemalloc 3.6
     500000 interning 8 KB in 78.60ms i.e. 6,361,323/s, aver. 0us, 97 MB/s
     500000 direct 7.6 MB in 58.08ms i.e. 8,608,667/s, aver. 0us, 131.3 MB/s
    - Intel TBB 4.4
     500000 interning 8 KB in 61.96ms i.e. 8,068,810/s, aver. 0us, 123.1 MB/s
     500000 direct 7.6 MB in 36.46ms i.e. 13,711,402/s, aver. 0us, 209.2 MB/s
    for multi-threaded process, we observed best scaling with TBB on this system
    BUT memory consumption raised to 60 more space (gblic=2.6GB vs TBB=170GB)!
    -> so for serious server work, glibc (FPC_SYNCMEM) sounds the best candidate
Unfortunately they didn't published the results for the multi-threaded tests afaik.

Note that there is also a fork of FastMM (FastMM4-AVX) which seems to do pretty good for multi-threading.

Delphi 10.2 Tokyo:
Code: [Select]
                     Xeon E5-2543v2 2*CPU      i7-7700K CPU
                    (allocated 20 logical   (8 logical threads,
                     threads, 10 physical    4 physical cores),
                     cores, NUMA), AVX-1          AVX-2

                    Orig.  AVX-br.  Ratio   Orig.  AVX-br. Ratio
                    ------  -----  ------   -----  -----  ------
02-threads realloc   96552  59951  62.09%   65213  49471  75.86%
04-threads realloc   97998  39494  40.30%   64402  47714  74.09%
08-threads realloc   98325  33743  34.32%   64796  58754  90.68%
16-threads realloc  116273  45161  38.84%   70722  60293  85.25%
31-threads realloc  122528  53616  43.76%   70939  62962  88.76%
64-threads realloc  137661  54330  39.47%   73696  64824  87.96%
NexusDB 02 threads  122846  90380  73.72%   79479  66153  83.23%
NexusDB 04 threads  122131  53103  43.77%   69183  43001  62.16%
NexusDB 08 threads  124419  40914  32.88%   64977  33609  51.72%
NexusDB 12 threads  181239  55818  30.80%   83983  44658  53.18%
NexusDB 16 threads  135211  62044  43.61%   59917  32463  54.18%
NexusDB 31 threads  134815  48132  33.46%   54686  31184  57.02%
NexusDB 64 threads  187094  57672  30.25%   63089  41955  66.50%

Delphi 10.2 Update 3 (note that it uses different CPUs as well)
Code: [Select]
                     Xeon E5-2667v4 2*CPU       i9-7900X CPU
                    (allocated 32 logical   (20 logical threads,
                     threads, 16 physical    10 physical cores),
                     cores, NUMA), AVX-2          AVX-512

                    Orig.  AVX-br.  Ratio   Orig.  AVX-br. Ratio
                    ------  -----  ------   -----  -----  ------
02-threads realloc   80544  60025  74.52%   66100  55854  84.50%
04-threads realloc   80751  47743  59.12%   64772  40213  62.08%
08-threads realloc   82645  32691  39.56%   62246  27056  43.47%
12-threads realloc   89951  43270  48.10%   65456  25853  39.50%
16-threads realloc   95729  56571  59.10%   67513  27058  40.08%
31-threads realloc  109099  97290  89.18%   63180  28408  44.96%
64-threads realloc  118589 104230  87.89%   57974  28951  49.94%
NexusDB 01 thread   160100 121961  76.18%   93341  95807 102.64%
NexusDB 02 threads  115447  78339  67.86%   77034  70056  90.94%
NexusDB 04 threads  107851  49403  45.81%   73162  50039  68.39%
NexusDB 08 threads  111490  36675  32.90%   70672  42116  59.59%
NexusDB 12 threads  148148  46608  31.46%   92693  53900  58.15%
NexusDB 16 threads  111041  38461  34.64%   66549  37317  56.07%
NexusDB 31 threads  123496  44232  35.82%   62552  34150  54.60%
NexusDB 64 threads  179924  62414  34.69%   83914  42915  51.14%

julkas

  • Sr. Member
  • ****
  • Posts: 348
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Why CMem?
« Reply #16 on: August 17, 2019, 11:06:56 am »
 Malloc intro by Dan Luu - https://danluu.com/malloc-tutorial/.
« Last Edit: August 17, 2019, 11:11:50 am by julkas »
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;
(* Pointer game *) Inc(ptr, 1); (* vs *) ptr := ptr + 1;

BrunoK

  • Full Member
  • ***
  • Posts: 174
  • Retired programmer
Re: Why CMem?
« Reply #17 on: August 17, 2019, 12:59:11 pm »
I find this article and ensuing discussion pretty interesting on the relative nature of memory allocator performance.
Lazarus trunk r. 59978/03.01.2019 (+/- patches regarding enabled, TScrollBar, TCursorImage). FPC 3.0.4 32 bits. (+heaptrc with leaked ClassName+Revisited TList) , Windows 10 Pro x64 (v. 1903)

julkas

  • Sr. Member
  • ****
  • Posts: 348
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Why CMem?
« Reply #18 on: August 17, 2019, 01:25:16 pm »
I find this article and ensuing discussion pretty interesting on the relative nature of memory allocator performance.
+1.
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;
(* Pointer game *) Inc(ptr, 1); (* vs *) ptr := ptr + 1;

BrunoK

  • Full Member
  • ***
  • Posts: 174
  • Retired programmer
Re: Why CMem?
« Reply #19 on: August 17, 2019, 02:27:15 pm »
Lazarus trunk r. 59978/03.01.2019 (+/- patches regarding enabled, TScrollBar, TCursorImage). FPC 3.0.4 32 bits. (+heaptrc with leaked ClassName+Revisited TList) , Windows 10 Pro x64 (v. 1903)

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7359
Re: Why CMem?
« Reply #20 on: August 17, 2019, 05:14:08 pm »
If you benchmarked on Linux, your results are already old because the heapmanager changed:

Quote
r42713 florian 2019-08-16 22:47:37 +0200 (Fri, 16 Aug 2019)
+ make use of the mremap syscall of linux to re-allocate large memory blocks faster
Commit consists out of 1 line

    M /trunk/rtl/inc/heap.inc
    M /trunk/rtl/linux/ossysc.inc
    M /trunk/rtl/unix/sysheap.inc
    A /trunk/tests/test/theap2.pp

Also do such micro benchmarks really mean something ? The memory manager that holds large per thread spares of 8kb blocks wins the 8kb benchmark, but in practice might only be wasting memory.

Thaddy

  • Hero Member
  • *****
  • Posts: 8680
Re: Why CMem?
« Reply #21 on: August 17, 2019, 05:20:55 pm »
Yes. Then again I ran into the problem that in FPC we have to maintain and store size, otherwise Lazarus crashes..... Which is silly.
That slows down our memory manager considerably. The size can on many platforms be obtained elsewhere:
- windows _memsize
- linux & bsd  malloc_usable_size

For pure pascal programs you can skip the size requirement in the memory manager.
Lazarus is using some - should be! - implementation detail.

That said, I have become to realize that modern cmem is not like 2005, more like 2015, and indeed scales better than ours.
After a lot of testing today: yes, a modern cmem is faster in ALL cases I tested. Even with Florian's improvement.
« Last Edit: August 17, 2019, 05:26:49 pm by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7359
Re: Why CMem?
« Reply #22 on: August 17, 2019, 05:33:45 pm »
Yes. Then again I ran into the problem that in FPC we have to maintain and store size, otherwise Lazarus crashes..... Which is silly.

It is a property of the language.

Quote
That slows down our memory manager considerably. The size can on many platforms be obtained elsewhere:
- windows _memsize
- linux & bsd  malloc_usable_size

No. Malloc_usable_size is a property of the heapmanager, just like FPC's heapmanager has it.

You are confusing base memory primitives (like mmap and realloc) with alternate heapmanagers. FPC's heapmanager is an alternative for malloc, so telling it to use malloc is not logical.    I can't find a url for memsize. Seems like an (msv)crt internal function, but FPC doesn't use crt at all, again the same mistake, FPC RTL is a replacement for the crt, not something on top of it.

Maybe there are better ways than storing it in every a memoryblock (like keeping the administration  in unallocated memory), that could avoid some SSE alignment penalties for the actual blocks, but could possibly also cause an access performance hit when used relatively a lot by a language.

Thaddy

  • Hero Member
  • *****
  • Posts: 8680
Re: Why CMem?
« Reply #23 on: August 17, 2019, 06:07:48 pm »
_msize Marco,  https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/msize?view=vs-2019 Sorry. That is the windows equivalent.
I did not write the correct call but for the major platforms it is not needed to keep size (Win, apple, linux, bsd all have a call that does the same)
You confuse the lowest level: the OS.
Delphi doesn't keep size.  (not the first time I mentioned it.....like 8 years ago)
It is a major hinder to improve the FPC memory manager: too much housekeeping on top.
« Last Edit: August 17, 2019, 06:12:00 pm by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7359
Re: Why CMem?
« Reply #24 on: August 17, 2019, 09:04:12 pm »
_msize Marco,  https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/msize?view=vs-2019 Sorry. That is the windows equivalent.

No it is not. It is part of the C language runtime on top of the original windows api(which are kernel32 and user32). Of course msvcrt is pretty standard now, but it is a totally different thing.

Quote
I did not write the correct call but for the major platforms it is not needed to keep size (Win, apple, linux, bsd all have a call that does the same)

All heapmanagers have such ability. Malloc, FPC's heapmgr, all. But that needs to implemented somewhere. Either in fpc heapmgr or in some malloc. *nix only has mmap/sbrk.

malloc and crt's memmanager is on the same level as the FPC heapmanager, part of a language runtime. IOW a call to a malloc  or crt function doesn't solve the problem for the FPC heapmanager.

Quote
You confuse the lowest level: the OS.
Delphi doesn't keep size.  (not the first time I mentioned it.....like 8 years ago)

Possible. There are legacy features that nobody knows exactly why they are implemented as they are. Sometimes in the past the simplest solution was chosen. But it is there are also possibilities that this makes the memsize() call very cheap.

Quote
It is a major hinder to improve the FPC memory manager: too much housekeeping on top.

Now is the time. Sooner or later we will have to emit only aligned blocks (to 16 or 32 byte borders), and then this must also change.

PascalDragon

  • Hero Member
  • *****
  • Posts: 573
  • Compiler Developer
Re: Why CMem?
« Reply #25 on: August 18, 2019, 08:55:36 pm »
Yes. Then again I ran into the problem that in FPC we have to maintain and store size, otherwise Lazarus crashes..... Which is silly.
That slows down our memory manager considerably. The size can on many platforms be obtained elsewhere:
- windows _memsize
- linux & bsd  malloc_usable_size
How do you come to that conclusion? FPC does not enforce that all, you just need to make sure that your implementation of the functions of TMemoryManager behave correctly. And if Lazarus (or some package of it) should depend on the internal workings of the memory manager (e.g. by assuming a size field in front of the allocated block) then that is a bug and should be reported.

Thaddy

  • Hero Member
  • *****
  • Posts: 8680
Re: Why CMem?
« Reply #26 on: August 18, 2019, 09:03:21 pm »
@PascalDragon
In general, but focussing on cmem:
If you leave out the size, Lazarus doesn't work - but Fpc does! -. Try it yourself using cmem as a template, simply rip out the silly size management.
So Lazarus relies on implementation detail.
This behavior is also documented, although obfuscated (freemem remark) https://forum.lazarus.freepascal.org/index.php?action=post;topic=46420.15;last_msg=331064
The "should" should be replaced with "must" but only for Lazarus.
(Actually I think this is quite a substantial bug in Lazarus)
It is enough to move the declarations from interface to implementation. (These should not have any use there whatever any memory manager implementation)
Actually, I would recommend to move the declarations to the implementation section just because of that!

Problem: Lazarus people: And now what??????????

To my knowledge, Compiler and RTL do not rely on size in the memory manager at all.
« Last Edit: August 18, 2019, 09:37:54 pm by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

k1ng

  • New Member
  • *
  • Posts: 36
Re: Why CMem?
« Reply #27 on: August 18, 2019, 10:01:43 pm »
This behavior is also documented, although obfuscated (freemem remark) https://forum.lazarus.freepascal.org/index.php?action=post;topic=46420.15;last_msg=331064
I'm sure your link is wrong %)

Actually, I would recommend to move the declarations to the implementation section just because of that!

Problem: Lazarus people: And now what??????????
Fix it in Lazarus code? ;)

jamie

  • Hero Member
  • *****
  • Posts: 1901
Re: Why CMem?
« Reply #28 on: August 18, 2019, 11:08:53 pm »
Since we are on the subject memory I've wonder at times how FPC manages to know how to release a pointer allocation?

 I know it's obvious if you simply use the same pointer to the FreeMem it should work because It knows where it is in memory however, what happens when I increment that pointer? Now its no longer at its original starting point..

 do we save the actual location of the pointer itself so that it can be identified no matter what value it gets changed to or do we try to find a memory block that pointer fits into and assume that is the correct one? I would think the latter would be a little slow but the former would be great since the address of the pointer body shouldn't change.

PascalDragon

  • Hero Member
  • *****
  • Posts: 573
  • Compiler Developer
Re: Why CMem?
« Reply #29 on: August 19, 2019, 09:45:47 am »
@PascalDragon
In general, but focussing on cmem:
If you leave out the size, Lazarus doesn't work - but Fpc does! -. Try it yourself using cmem as a template, simply rip out the silly size management.
So Lazarus relies on implementation detail.
This behavior is also documented, although obfuscated (freemem remark) https://forum.lazarus.freepascal.org/index.php?action=post;topic=46420.15;last_msg=331064
The "should" should be replaced with "must" but only for Lazarus.
(Actually I think this is quite a substantial bug in Lazarus)
First of your link is wrong. I assume you meant this?
It's only a possible implementation. Maybe that should be clarified by providing an alternative (e.g. "or the underlying memory manager provides a function to retrieve the size of a memory block"). If some code relies on that then this is a bug in that code and needs to be reported.
It is enough to move the declarations from interface to implementation. (These should not have any use there whatever any memory manager implementation)
Actually, I would recommend to move the declarations to the implementation section just because of that!
What declarations? Those of TMemoryManager and its function variables? No. Those must be in the interface section, because they are needed to be used from other units, namely to implement memory managers.
Problem: Lazarus people: And now what??????????
Report a bug with them.