Recent

Author Topic: Threaded application in MacOS not faster as in Windows and Linux  (Read 2136 times)

AlanTheBeast

  • Sr. Member
  • ****
  • Posts: 390
  • My software never cras....
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #15 on: June 24, 2025, 04:44:47 pm »
I have added threads to my application to speed up. In Windows and Linux it runs now about two times faster compared to a single thread version but not in MacOS. For an Intel Mac the processing speed is about the same. And for a M-processor Mac I get reports it is much slower then a single thread version.

Furthermore the System.CPUCount indicates one cpu in the debugger but the activity monitor indicated more 6 or 7 threads.

Has anybody an idea what could cause this poor performance and how to fix this? If have tried including cmem in the .lpr file but it doesn't help.

I do a lot of threaded testing and generally I get much more when threaded, but not in proportion to the number of threads (up to n performance cores or all cores).
But definitely where needed, threading boosts overall output.

Some odd things though ...

While InterlockedExcange( ZZZ,1)>0 do; // (Spinlock) will bog down the computer when multiple
                                                            //. instances (threads) of the same function are accessing that lock.
While InterlockedExcange( ZZZ,1)>0 do sleep(0);   // will not improve things much, but
While InterlockedExcange( ZZZ,1)>0 do sleep(1);  // will improve things a lot.

Maybe that's to be expected, but I don't have (I think) the same issues on x86 (MacOS) or ARM (RaspPi, Ultibo).

I haven't used CriticalSection for the above as such has a higher cycle cost in and out.
« Last Edit: June 24, 2025, 05:11:00 pm by AlanTheBeast »
Everyone talks about the weather but nobody does anything about it.
..Samuel Clemens.

han

  • Full Member
  • ***
  • Posts: 130
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #16 on: June 24, 2025, 05:04:04 pm »
The problem is solved. I keep a local modified copy of lazarus/components/multithreadprocs/mtpcpu.pas   

Where _SC_NPROCESSORS_ONLN is changed to _SC_NPROCESSORS_CONF  but the original value 83 is correct.

https://gitlab.com/freepascal.org/lazarus/lazarus/-/issues/41659
https://gitlab.com/freepascal.org/fpc/source/-/issues/41187

And a fix for my virtual machine when the number of CPU configured is reported as 128:

Code: Pascal  [Select][+][-]
  1. function GetSystemThreadCount: integer;
  2. // returns a good default for the number of threads on this system
  3. {$IF defined(windows)}
  4. //returns total number of processors available to system including logical hyperthreaded processors
  5. var
  6.   i: Integer;
  7.   ProcessAffinityMask, SystemAffinityMask: DWORD_PTR;
  8.   Mask: DWORD;
  9.   SystemInfo: SYSTEM_INFO;
  10. begin
  11.   if GetProcessAffinityMask(GetCurrentProcess, ProcessAffinityMask, SystemAffinityMask)
  12.   then begin
  13.     Result := 0;
  14.     for i := 0 to 31 do begin
  15.       Mask := DWord(1) shl i;
  16.       if (ProcessAffinityMask and Mask)<>0 then
  17.         inc(Result);
  18.     end;
  19.   end else begin
  20.     //can't get the affinity mask so we just report the total number of processors
  21.     GetSystemInfo(SystemInfo);
  22.     Result := SystemInfo.dwNumberOfProcessors;
  23.   end;
  24. end;
  25. {$ELSEIF defined(UNTESTEDsolaris)}
  26.   begin
  27.     t = sysconf(_SC_NPROC_CONF);
  28.   end;
  29. {$ELSEIF defined(freebsd) or defined(darwin)}
  30. type
  31.   PSysCtl = {$IF FPC_FULLVERSION>=30200}pcint{$ELSE}pchar{$ENDIF};
  32. var
  33.   mib: array[0..1] of cint;
  34.   len: csize_t;
  35.   t: cint;
  36. begin
  37.   mib[0] := CTL_HW;
  38.   mib[1] := HW_NCPU;
  39.   len := sizeof(t);
  40.   fpsysctl(PSysCtl(@mib), 2, @t, @len, Nil, 0);
  41.   Result:=t;
  42. end;
  43. {$ELSEIF defined(linux)}
  44.   begin
  45.     Result:=sysconf(_SC_NPROCESSORS_CONF);
  46.     if result=128 then
  47.        result:=sysconf(84 {_SC_NPROCESSORS_ONLN}); //fix for VMWare virtual machine
  48.  
  49.   end;
  50. {$ELSE}
  51.   begin
  52.     Result:=1;
  53.   end;
  54. {$ENDIF}    
  55.  
  56.  


Thaddy

  • Hero Member
  • *****
  • Posts: 17396
  • Ceterum censeo Trump esse delendam
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #17 on: June 24, 2025, 05:28:02 pm »
Yes, it is not only really odd, but what the odd things are you thinking?:
Code: Pascal  [Select][+][-]
  1. Some odd things though ...
  2.  
  3. While InterlockedExcange( ZZZ,1)>0 do; // (Spinlock) will bog down the computer when multiple
  4.                                                             //. instances (threads) of the same function are accessing that lock.
  5. While InterlockedExcange( ZZZ,1)>0 do sleep(0);   // will not improve things much, but
  6. While InterlockedExcange( ZZZ,1)>0 do sleep(1);  // will improve things a lot.
  7.  
That is really bad code and slows down on all platforms unless the values are absolutely atomic.
You must know that the lcl itself is not threadsafe.
If you are using trunk/main I can give you a much better example.
« Last Edit: June 24, 2025, 05:43:26 pm by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

AlanTheBeast

  • Sr. Member
  • ****
  • Posts: 390
  • My software never cras....
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #18 on: June 24, 2025, 05:38:36 pm »
M3 (4 performance, 4 efficiency)

Single threaded:

 Bub:   1 100000     6.526s.   Swaps: 2503772928
QSor:   1 100000     0.014s. 


4 threads (each thread working a different array of 100,000 random doubles.
(The Bubble sort array [n] is identical (but separate) to the QSort array [n], but the various n's are unique arrays)
 Bub:   1 100000     6.431s.   Swaps: 2500476391
 Bub:   2 100000     6.440s.   Swaps: 2502651257
 Bub:   4 100000     6.548s.   Swaps: 2493082950
 Bub:   3 100000     6.656s.   Swaps: 2504529526
QSor:   3 100000     0.010s. 
QSor:   2 100000     0.010s. 
QSor:   4 100000     0.011s. 
QSor:   1 100000     0.011s. 

Note that QSorts are started after all of the BSorts are competed - but doens't really matter much given how fast QSort is (usually I run all 8 or 16 threads at the same time - but for the table above I separated the two bunches).

So, about 5% slower per thread, but 380% as much work done (Bubble sort).

The cost on the QSort is higher (overhead being more of a fixed penalty than proportional; and the time resolution being pretty coarse at 1ms.

In all cases the sorts don't start until the top of a new millisecond.


For 8 you ask?  Such a deal:
 Bub:   8 100000    12.420s.   Swaps: 2497630351
 Bub:   5 100000    12.582s.   Swaps: 2494023876
 Bub:   7 100000    12.599s.   Swaps: 2502442288
 Bub:   4 100000    12.700s.   Swaps: 2493234764
 Bub:   6 100000    12.753s.   Swaps: 2497173207
 Bub:   3 100000    12.764s.   Swaps: 2501689957
 Bub:   2 100000    12.802s.   Swaps: 2507992479
 Bub:   1 100000    12.861s.   Swaps: 2502666380
QSor:   4 100000     0.010s.   
QSor:   1 100000     0.011s.   
QSor:   3 100000     0.011s.   
QSor:   2 100000     0.011s.   
QSor:   6 100000     0.012s.   
QSor:   5 100000     0.013s.   
QSor:   8 100000     0.013s.   
QSor:   7 100000     0.013s.   

Here the relative slowness of the efficiency cores pops in to slow everything down.

Edit: if I have time this afternoon I'll try the above on a 2012 iMac (i7 4core hyperthreaded).

On an iMac 2012 i7 Quad (hyperthreaded) we get
4 threads:
 Bub:   1 100000    24.385s.   Swaps: 2499644665
 Bub:   2 100000    24.395s.   Swaps: 2499245409
 Bub:   3 100000    24.416s.   Swaps: 2498038873
 Bub:   4 100000    24.438s.   Swaps: 2507638488
QSor:   1 100000     0.010s.   
QSor:   3 100000     0.010s.   
QSor:   4 100000     0.013s.   
QSor:   2 100000     0.014s.   

During which the "HT" virtual core was not close to saturation.

8 Threads:
  Bub:   6 100000    34.444s.   Swaps: 2496430314
 Bub:   7 100000    34.506s.   Swaps: 2496810509
 Bub:   3 100000    34.540s.   Swaps: 2497711630
 Bub:   8 100000    34.619s.   Swaps: 2497535014
 Bub:   1 100000    34.626s.   Swaps: 2502372368
 Bub:   2 100000    34.628s.   Swaps: 2502359431
 Bub:   5 100000    34.684s.   Swaps: 2496849057
 Bub:   4 100000    34.706s.   Swaps: 2503886671
QSor:   3 100000     0.016s.   
QSor:   1 100000     0.017s.   
QSor:   5 100000     0.017s.   
QSor:   4 100000     0.017s.   
QSor:   6 100000     0.018s.   
QSor:   7 100000     0.017s.   
QSor:   8 100000     0.017s.   
QSor:   2 100000     0.020s.   
During which the 4 cores were saturated and the 4 virtual cores about 80-90%.


Or for S&G, intel code running on the M3!!!
Surprisingly only about 18% slower than the ARM code on the M3

 Bub:   2 100000    14.713s.   Swaps: 2499702458
 Bub:   3 100000    14.830s.   Swaps: 2508447563
 Bub:   7 100000    14.880s.   Swaps: 2490812949
 Bub:   1 100000    14.888s.   Swaps: 2496611076
 Bub:   8 100000    14.892s.   Swaps: 2490643169
 Bub:   6 100000    14.911s.   Swaps: 2500327739
 Bub:   5 100000    14.963s.   Swaps: 2494789694
 Bub:   4 100000    14.975s.   Swaps: 2496071083
QSor:   5 100000     0.019s.   
QSor:   4 100000     0.024s.   
QSor:   2 100000     0.024s.   
QSor:   7 100000     0.013s.   
QSor:   6 100000     0.015s.   
QSor:   1 100000     0.027s.   
QSor:   3 100000     0.013s.   
QSor:   8 100000     0.016s.   

(Edit: Swap count for Qsort was very wrong - and the overhead to include it for multiple parallel instances would defeat the timing purpose.  So omitted.  Suffice to to say there are far, far less swaps with qsort than bubble, esp. as the number of sorted elements rises).
« Last Edit: June 25, 2025, 04:49:45 pm by AlanTheBeast »
Everyone talks about the weather but nobody does anything about it.
..Samuel Clemens.

AlanTheBeast

  • Sr. Member
  • ****
  • Posts: 390
  • My software never cras....
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #19 on: June 24, 2025, 05:42:48 pm »
Yes, it is not only really odd, but what the odd things are you thinking?:
Code: Pascal  [Select][+][-]
  1. Some odd things though ...
  2.  
  3. While InterlockedExcange( ZZZ,1)>0 do; // (Spinlock) will bog down the computer when multiple
  4.                                                             //. instances (threads) of the same function are accessing that lock.
  5. While InterlockedExcange( ZZZ,1)>0 do sleep(0);   // will not improve things much, but
  6. While InterlockedExcange( ZZZ,1)>0 do sleep(1);  // will improve things a lot.
  7.  
That is really bad code and slows down on all platforms unless the values are absolutely atomic.

Be more specific please.   Do note the above are separate versions of the same entry point to atomic operations within a thread.  Of course the ZZZ Interlock set back to 0 when the critical section of code is done.

The above is just to show that the OS (and possibly hardware platform) has great effect on such operations.

Do also note I'm working command line only, no object coding, etc.
Everyone talks about the weather but nobody does anything about it.
..Samuel Clemens.

Thaddy

  • Hero Member
  • *****
  • Posts: 17396
  • Ceterum censeo Trump esse delendam
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #20 on: June 24, 2025, 05:46:45 pm »
I added an example, just yesterday,  for console apps to the threading course in the wiki.
That needs trunk, but shows a pattern that does not need all the interlocked primitives and delegates to a controller thread.
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

AlanTheBeast

  • Sr. Member
  • ****
  • Posts: 390
  • My software never cras....
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #21 on: June 24, 2025, 05:53:55 pm »
I added an example, just yesterday,  for console apps to the threading course in the wiki.
That needs trunk, but shows a pattern that does not need all the interlocked primitives and delegates to a controller thread.

I'll take a look when I have a moment.  That said, when I do such as I noted above, I usually find an optimized value for the sleep period that varies between 1 and 10 - all at the expense of boring testing that usually lasts longer than the actual project.  No need to pass via a controlling process.

It is not clear to me what the MacOS preemption rate is anymore.  2 decades ago it was 10ms.  But I think it's shorter now, and probably more variable (with things like thread coalescing going on with blackbox definition).
Everyone talks about the weather but nobody does anything about it.
..Samuel Clemens.

Thaddy

  • Hero Member
  • *****
  • Posts: 17396
  • Ceterum censeo Trump esse delendam
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #22 on: June 24, 2025, 06:05:50 pm »
It is much shorter. In the sub nanosecond range on AARCH.
Anyway: if you run multiple processes/cores over the same interfaces you need  a controlling thread.
The example I refer to runs six workers over 6 cores + a controller thread and the main thread for a total of 8. The workload is 1000ms for all. The main process finishes when the controller thread is finished and takes .... 1000ms in total for all threads. Main finishes when the longest workload is finished. Just as expected.
There is no to very minimal overhead. You do not even have to bother about affinity: makes no difference.

The example looks deceptively simple but it isn't.
https://wiki.freepascal.org/Multithreaded_Application_Tutorial#Waiting_for_another_thread_part_2,_the_future
« Last Edit: June 24, 2025, 06:13:19 pm by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

AlanTheBeast

  • Sr. Member
  • ****
  • Posts: 390
  • My software never cras....
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #23 on: June 24, 2025, 07:54:52 pm »
It is much shorter. In the sub nanosecond range on AARCH.
Anyway: if you run multiple processes/cores over the same interfaces you need  a controlling thread.
The example I refer to runs six workers over 6 cores + a controller thread and the main thread for a total of 8. The workload is 1000ms for all. The main process finishes when the controller thread is finished and takes .... 1000ms in total for all threads. Main finishes when the longest workload is finished. Just as expected.
There is no to very minimal overhead. You do not even have to bother about affinity: makes no difference.


As I pointed out, the result on Ultibo RTL (RaspPi / ARM) is not round nanoseconds, thus you need to to get the "scale factor" for the ticks.  Whether this resolves to greater or finer than a nanosecond is hardware dependent.  On an M3 it could theoretically be 4 ticks per ns (maybe 8 ).  But that wouldn't mean much as a sampling function would be limited to its own execution cycles to even pull the value, never mind the call and return cost.  Yes - it could be inlined - but then as you mentioned it's abstracted out of view and that usually means overhead - even if little.

All that said, my current burden is ms.  µs would be a huge improvement (and adequate to 99% of my needs) and anything finer than that a bonus.

On Ultibo RTL (RaspPi 4) I used afinity but it was overkill for the application.

Unless you meant pre-emption? By pre-emption rate I was referring to the length of time a process could run before the OS would pre-empt it and give control of the CPU (or core) to another process.  This cannot be on the order of nanoseconds (wait 50 or so years and then maybe....?).

On the RaspPi 4 Ultibo RTL the pre-emption period was 0.5 ms (IIRC).   I'd have been happy with 1ms as, IAC, there was far better thread control with Ultibo's FPC library than FPC for MacOS/Windows/Linux offers.
« Last Edit: June 24, 2025, 08:18:55 pm by AlanTheBeast »
Everyone talks about the weather but nobody does anything about it.
..Samuel Clemens.

AlanTheBeast

  • Sr. Member
  • ****
  • Posts: 390
  • My software never cras....
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #24 on: June 26, 2025, 02:37:10 am »
The example looks deceptively simple but it isn't.
https://wiki.freepascal.org/Multithreaded_Application_Tutorial#Waiting_for_another_thread_part_2,_the_future

Want to try that but I'm on fpc 3.2.2 and this example demands 3.3.1 or better.

To get 3.3.1 apparently I need to install FpcUpDeluxe ... which (from Github) appears damaged ...

Anyway, in time ... maybe.
Everyone talks about the weather but nobody does anything about it.
..Samuel Clemens.

dbannon

  • Hero Member
  • *****
  • Posts: 3407
    • tomboy-ng, a rewrite of the classic Tomboy
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #25 on: June 26, 2025, 09:39:08 am »

Want to try that but I'm on fpc 3.2.2 and this example demands 3.3.1 or better.

To get 3.3.1 apparently I need to install FpcUpDeluxe ... which (from Github) appears damaged ...

Your fpc322 will comfortably build 331 for you. See the wiki.

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

AlanTheBeast

  • Sr. Member
  • ****
  • Posts: 390
  • My software never cras....
Re: Threaded application in MacOS not faster as in Windows and Linux
« Reply #26 on: June 26, 2025, 03:33:38 pm »

Want to try that but I'm on fpc 3.2.2 and this example demands 3.3.1 or better.

To get 3.3.1 apparently I need to install FpcUpDeluxe ... which (from Github) appears damaged ...

Your fpc322 will comfortably build 331 for you. See the wiki.

Davo

Good point.  But I've also had a reply from DonAlfredo over fixing the other issue so I'll start there.

Thanks.

PS - will be forced to do so to test try another bit of code from Thaddy.

(Crossed out "test" as I'm sure it will work very well and probably well beyond my needs).
« Last Edit: June 26, 2025, 04:00:14 pm by AlanTheBeast »
Everyone talks about the weather but nobody does anything about it.
..Samuel Clemens.

 

TinyPortal © 2005-2018