Recent

Author Topic: would multi threading help this project?  (Read 2429 times)

LeP

  • Full Member
  • ***
  • Posts: 124
Re: would multi threading help this project?
« Reply #15 on: January 22, 2026, 03:21:08 pm »
You can use the SysUtils.Now function before and after doing some work, if the difference isn't too small. For smaller/more precise durations you can get the time from the CPU's built-in timer, in Windows via QueryPerformanceCounter/-Frequency:
You can use this (diagnostic.pas) that have TStopWatch implementation: https://forum.lazarus.freepascal.org/index.php/topic,71540.msg558558.html#msg558558

MathMan

  • Sr. Member
  • ****
  • Posts: 472

MathMan

  • Sr. Member
  • ****
  • Posts: 472
Re: would multi threading help this project?
« Reply #17 on: January 22, 2026, 03:49:15 pm »
Just to give speter a ballpark figure here - on modern CPU cores, modern OS, the creation and start of a thread is in the range of multiple 10,000 CPU cycles. One can do substantial work in that time indeed ...
I don't think we should think this way. Initializing something always has its burden, obviously, but it's also necessary to consider the benefits it can bring... 20 minutes of processing time can easily be shaved off using threads, and initialization would have absolutely no impact on the process.
If we assume that only 10 threads can process 10 puzzles in parallel, each lasting 359 ms (the time claimed by the author), we would have a load of 370 ms (considering that each thread takes 1 ms to initialize and schedule... a HUGE and improbable amount of time) versus 3590 ms, or just over 1/9 of the time. If we increase the threads to 16 (as is likely), it's easy to see that the 20 minutes would drop to a few minutes (perhaps even less).
The author should have a processor with 8 power cores (16 threads) and 16 ec cores.
Using the 16 power core threads alone would significantly reduce processing time.
And I doubt the processor would saturate to 100% (it could be around 65%).

I agree with the above. However you seem to have misread my statement. It was given in light of 440bx comment where I just wanted to give a watermark below which one might not gain from multi-threading without further careful analysis.

BeniBela

  • Hero Member
  • *****
  • Posts: 955
    • homepage
Re: would multi threading help this project?
« Reply #18 on: January 22, 2026, 04:59:51 pm »
If you're creating files anyway, you can use multi processes.

Simply start the program ten times, create ten files (of 95k puzzles each), and then merge them.

speter

  • Sr. Member
  • ****
  • Posts: 487
Re: would multi threading help this project?
« Reply #19 on: January 23, 2026, 02:14:14 am »
My thanks to everyone, but particularly Lep and BeniBela.

Food for thought.

cheers
S.

PS: at present I am using GetTickCount64() for timing.
« Last Edit: January 23, 2026, 02:16:05 am by speter »
I climbed mighty mountains, and saw that they were actually tiny foothills. :)

LeP

  • Full Member
  • ***
  • Posts: 124
Re: would multi threading help this project?
« Reply #20 on: January 23, 2026, 09:48:31 am »
My thanks to everyone, but particularly Lep and BeniBela.
Food for thought.
cheers
I'm happy to help.

PS: at present I am using GetTickCount64() for timing.

The resolution is in ms. (milliseconds) and depend from system. Typically it has the granularity between 10 to 16 milliseconds. This granularity cannot be changed even with the dedicated system APIs (like multimedia).
If you use TStopWatch you can have a view on timing more accurate.

Bye

speter

  • Sr. Member
  • ****
  • Posts: 487
Re: would multi threading help this project?
« Reply #21 on: January 23, 2026, 11:34:46 am »
If you use TStopWatch you can have a view on timing more accurate.
Thanks, I had thought GetTick was more accurate.

cheers
S.
I climbed mighty mountains, and saw that they were actually tiny foothills. :)

Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: would multi threading help this project?
« Reply #22 on: January 23, 2026, 12:14:32 pm »
Yes, that is misleading: GetTickCount64 does not count "ticks"
QueryPerformanceCounter on Windows does...As does the unix kernel which has depending on CPU nanosecond resolution or better.
Timing is more accurate on Linux, but a bit over the top for everyday use.

Also note that in every case long running timers loose their precision.
These high resolution timers are meant for short timings, unless you run a RTOS.
« Last Edit: January 23, 2026, 12:16:32 pm by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

440bx

  • Hero Member
  • *****
  • Posts: 6069
Re: would multi threading help this project?
« Reply #23 on: January 23, 2026, 01:48:19 pm »
Just for the record, for reasonably accurate measurements that would provide a reasonable idea of how well a particular algorithm performs, neither GetTickCount nor querying the performance counter (rdtsc and/or rdtscp) will be truly accurate (however, they'll provide a good ballpark figure.)

A more accurate measurement can be obtained by querying the number of clock cycles a thread has used from one instant in time to another.  That way, unlike rdtsc/rdtscp, the time when the thread was not executing is _not_ included in the counter thereby yielding a more accurate measurement.

Unfortunately, IIRC, the API(s) that return the number of clock cycles used by the process and each thread are not documented (they are in ntdll) but, they can be seen in action in utilities such as Process Explorer (go to the threads tab) and Process Hacker/System Informer.

Anyway, in the majority of cases, rdtsc/rdtscp can provide a reasonably good idea of relative performance.
FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

LeP

  • Full Member
  • ***
  • Posts: 124
Re: would multi threading help this project?
« Reply #24 on: January 23, 2026, 03:15:34 pm »
Anyway, in the majority of cases, rdtsc/rdtscp can provide a reasonably good idea of relative performance.
You're right, but keep in mind that massive use of these instructions overloads the processor.
If you have 16 threads and use those instructions for each thread (for example, to monitor small parts like an algorithm), you risk using a lot of CPU time. The rdtsc/rdtscp instructions are "heavy" to execute.

When I used them, for example, with about twenty (maybe even 24) threads, the processor load increased by 10-12% on an Intel I9 14900K CPU (already loaded with 50% processing) compared to using TStopWatch (which uses QueryPerformanceCounter).

LV

  • Sr. Member
  • ****
  • Posts: 408
Re: would multi threading help this project?
« Reply #25 on: January 23, 2026, 05:10:30 pm »
If the algorithm can be parallelized, and as the thread's author mentions, "At present my "creator" takes about 20minutes" then using multithreading will clearly be beneficial. Additionally, a wall clock without a second hand will be adequate for tracking time. 🕰️  😉

440bx

  • Hero Member
  • *****
  • Posts: 6069
Re: would multi threading help this project?
« Reply #26 on: January 23, 2026, 06:48:25 pm »
The rdtsc/rdtscp instructions are "heavy" to execute.
Yes, they are. 

If a "relative" measurement is acceptable, QueryPerformanceCounter is fine.  OTH, it isn't as cheap as I'd like it to be given that it causes a ring transition and, that can also end up consuming a fair number of clock cycles if done "too often".

rdtsc (not the p version) should be ok because it doesn't cause serialization which is the reason for the high overhead.

For those interested in measurement accuracy, the following links provide interesting reading:
https://stackoverflow.com/questions/12631856/difference-between-rdtscp-rdtsc-memory-and-cpuid-rdtsc
https://stackoverflow.com/questions/27693145/rdtscp-versus-rdtsc-cpuid

I haven't checked but, off hand, I believe that rdtsc _without_ the use of CPUID _may_ provide the best performance/methodology if instruction re-ordering is acceptable (allowed to be reflected in the measurements.)

The point I'm trying to make is that, the pursuit of precise measurement is not as simple as it may initially appear.  GetTickCount64() is definitely quite "coarse" but, could definitely be acceptable in some cases where additional precision and accuracy wouldn't change the final _relative_ results.
FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

LV

  • Sr. Member
  • ****
  • Posts: 408
Re: would multi threading help this project?
« Reply #27 on: January 24, 2026, 09:49:56 am »
This code is simply for morning exercise. I based it on reply #6 and made slight modifications to warm up the processor.

Code: Pascal  [Select][+][-]
  1. program threadbenchmark;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   SysUtils, Classes;
  7.  
  8. const
  9.   num = 100000000;
  10.   Tests: array[0..8] of Integer = (1, 2, 4, 6, 8, 10, 12, 14, 16);
  11.  
  12. type
  13.   TRec = record
  14.     a, b, c: Integer;
  15.   end;
  16.   TArray = array of TRec;
  17.   PRec = ^TRec;
  18.  
  19. type
  20.   TWorkerThread = class(TThread)
  21.   private
  22.     FL, FR: Integer;
  23.     FData: PRec;
  24.   protected
  25.     procedure Execute; override;
  26.   public
  27.     constructor Create(L, R: Integer; Data: PRec);
  28.   end;
  29.  
  30. constructor TWorkerThread.Create(L, R: Integer; Data: PRec);
  31. begin
  32.   inherited Create(True);
  33.   FL := L;
  34.   FR := R;
  35.   FData := Data;
  36. end;
  37.  
  38. procedure TWorkerThread.Execute;
  39. var
  40.   i: Integer;
  41. begin
  42.   for i := FL to FR do
  43.   begin
  44.     FData[i].a := i mod 10;
  45.     FData[i].b := (i + 5) mod 10;
  46.     FData[i].c := FData[i].a + FData[i].b;
  47.   end;
  48. end;
  49.  
  50. function RunTest(var Arr: TArray; CurrentThreads: Integer): QWord;
  51. var
  52.   Threads: array of TWorkerThread;
  53.   i, L, R: Integer;
  54.   t0: QWord;
  55. begin
  56.  
  57.   FillChar(Arr[0], Length(Arr) * SizeOf(TRec), 0);
  58.  
  59.   SetLength(Threads, CurrentThreads);
  60.   t0 := GetTickCount64;
  61.  
  62.   for i := 0 to CurrentThreads - 1 do
  63.   begin
  64.     L := (i * Length(Arr)) div CurrentThreads;
  65.     R := ((i + 1) * Length(Arr)) div CurrentThreads - 1;
  66.     Threads[i] := TWorkerThread.Create(L, R, @Arr[0]);
  67.     Threads[i].Start;
  68.   end;
  69.  
  70.   for i := 0 to High(Threads) do
  71.   begin
  72.     Threads[i].WaitFor;
  73.     Threads[i].Free;
  74.   end;
  75.  
  76.   Result := GetTickCount64 - t0;
  77. end;
  78.  
  79. procedure Benchmark;
  80. var
  81.   Arr: TArray;
  82.   tSingle, tCurrent: QWord;
  83.   Speedup, Efficiency: Double;
  84.   i: Integer;
  85. begin
  86.   Writeln('Allocating ', (Int64(num) * SizeOf(TRec)) div (1024*1024), ' MB RAM...');
  87.   SetLength(Arr, num);
  88.  
  89.   Writeln('Running benchmarks...');
  90.   Writeln;
  91.   Writeln('Threads | Time (ms) | Speedup | Efficiency');
  92.   Writeln('-------------------------------------------');
  93.  
  94.   tSingle := RunTest(Arr, 1);
  95.   Writeln(1:7, ' | ', tSingle:9, ' |  1.00x  |   100.0%');
  96.  
  97.   for i := 1 to High(Tests) do
  98.   begin
  99.     tCurrent := RunTest(Arr, Tests[i]);
  100.  
  101.     if tCurrent > 0 then
  102.     begin
  103.       Speedup := tSingle / tCurrent;
  104.       Efficiency := (Speedup / Tests[i]) * 100;
  105.       Writeln(Tests[i]:7, ' | ', tCurrent:9, ' | ', Speedup:6:2, 'x  | ', Efficiency:7:1, '%');
  106.     end;
  107.   end;
  108.  
  109.   Writeln('-------------------------------------------');
  110.   Writeln('Done. Press Enter to exit.');
  111.   Readln;
  112. end;
  113.  
  114. begin
  115.   Benchmark;
  116. end.
  117.  

On my old guy i7 8700 output:

Code: Text  [Select][+][-]
  1. Allocating 1144 MB RAM...
  2. Running benchmarks...
  3.  
  4. Threads | Time (ms) | Speedup | Efficiency
  5. -------------------------------------------
  6.       1 |      1328 |  1.00x  |   100.0%
  7.       2 |       641 |   2.07x  |   103.6%
  8.       4 |       359 |   3.70x  |    92.5%
  9.       6 |       250 |   5.31x  |    88.5%
  10.       8 |       235 |   5.65x  |    70.6%
  11.      10 |       188 |   7.06x  |    70.6%
  12.      12 |       172 |   7.72x  |    64.3%
  13.      14 |       218 |   6.09x  |    43.5%
  14.      16 |       234 |   5.68x  |    35.5%
  15. -------------------------------------------
  16. Done. Press Enter to exit.
  17.  

LeP

  • Full Member
  • ***
  • Posts: 124
Re: would multi threading help this project?
« Reply #28 on: January 24, 2026, 01:38:42 pm »
This is from mine (14900HX):
Code: Text  [Select][+][-]
  1. Allocating 1144 MB RAM...
  2. Running benchmarks...
  3.  
  4. Threads | Time (ms) | Speedup | Efficiency
  5. -------------------------------------------
  6.       1 |       360 |  1.00x  |   100.0%
  7.       2 |       188 |   1.91x  |    95.7%
  8.       4 |       125 |   2.88x  |    72.0%
  9.       6 |       140 |   2.57x  |    42.9%
  10.       8 |        93 |   3.87x  |    48.4%
  11.      10 |        94 |   3.83x  |    38.3%
  12.      12 |        78 |   4.62x  |    38.5%
  13.      14 |        62 |   5.81x  |    41.5%
  14.      16 |        62 |   5.81x  |    36.3%
  15. -------------------------------------------
  16. Done. Press Enter to exit.

BrunoK

  • Hero Member
  • *****
  • Posts: 765
  • Retired programmer
Re: would multi threading help this project?
« Reply #29 on: January 24, 2026, 04:48:28 pm »
I modified a bit LV's test program and made a few change so it runs also on my Linux mint installation.

At the bottom, some timings obtained running that program (-O1 and -OoREGVAR optimizations).

The objective of modifying LV's test was to compare various ways of getting sub millisecond time values.

Using my uRdtsc, column RDTSC gives time measured using RDTSC that  has been calibrated with Windows QueryPerformance counter. The linux version uses the much finer grained (in linux) GetTickCount64 ms for calibration.

In the Windows run, RDTSC and QPerfC columns are very near in values hence showing RDTSC can be quite safely used for short timing benchmarks.

In the Linux run, the "Time (ms)" column matches RDTSC column indicating that RDTSC can be quite safely used for short timing benchmarks.

Windows 10, on my system, the maximum throughput is obtained for thread count = GetCPUCount.

CPU count  8 --> Windows
Allocating 1144 MB RAM...
Running benchmarks...

Threads | Time (ms) |    RDTSC   |  QPerfC  |  Speedup | Efficiency
-------------------------------------------------------------------
      1 |       516 | 514.277 ms |  514.276 |   1.00x  |     0.0%
      1 |       531 | 518.190 ms |  518.188 |   0.97x  |    97.2%
      2 |       265 | 266.587 ms |  266.586 |   1.95x  |    97.4%
      3 |       172 | 183.179 ms |  183.177 |   3.00x  |   100.0%
      4 |       203 | 187.850 ms |  187.849 |   2.54x  |    63.5%
      5 |       172 | 171.002 ms |  171.001 |   3.00x  |    60.0%
      6 |       156 | 168.870 ms |  168.869 |   3.31x  |    55.1%
      7 |       141 | 147.154 ms |  147.153 |   3.66x  |    52.3%
      8 |       141 | 137.597 ms |  137.596 |   3.66x  |    45.7%
      9 |       172 | 178.021 ms |  178.020 |   3.00x  |    33.3%
     10 |       156 | 148.710 ms |  148.709 |   3.31x  |    33.1%
     11 |       156 | 147.953 ms |  147.952 |   3.31x  |    30.1%
     12 |       156 | 153.956 ms |  153.955 |   3.31x  |    27.6%
     13 |       140 | 140.507 ms |  140.506 |   3.69x  |    28.4%
     14 |       157 | 146.378 ms |  146.378 |   3.29x  |    23.5%
     15 |       141 | 139.950 ms |  139.949 |   3.66x  |    24.4%
     16 |       140 | 145.673 ms |  145.672 |   3.69x  |    23.0%
-------------------------------------------------------------------
Done. Press Enter to exit.

CPU count  1 --> Linux
Allocating 1144 MB RAM...
Running benchmarks...

Threads | Time (ms) |    RDTSC   |  QPerfC  |  Speedup | Efficiency
-------------------------------------------------------------------
      1 |       501 | 500.970 ms |  ???.??? |   1.00x  |     0.0%
      1 |       501 | 501.097 ms |  ???.??? |   1.00x  |   100.0%
      2 |       301 | 301.004 ms |  ???.??? |   1.66x  |    83.2%
      3 |       201 | 200.685 ms |  ???.??? |   2.49x  |    83.1%
      4 |       201 | 200.786 ms |  ???.??? |   2.49x  |    62.3%
      5 |       201 | 200.822 ms |  ???.??? |   2.49x  |    49.9%
      6 |       201 | 200.892 ms |  ???.??? |   2.49x  |    41.5%
      7 |       201 | 201.131 ms |  ???.??? |   2.49x  |    35.6%
      8 |       201 | 201.185 ms |  ???.??? |   2.49x  |    31.2%
      9 |       202 | 201.354 ms |  ???.??? |   2.48x  |    27.6%
     10 |       208 | 207.612 ms |  ???.??? |   2.41x  |    24.1%
     11 |       205 | 204.436 ms |  ???.??? |   2.44x  |    22.2%
     12 |       209 | 208.845 ms |  ???.??? |   2.40x  |    20.0%
     13 |       207 | 207.361 ms |  ???.??? |   2.42x  |    18.6%
     14 |       207 | 207.307 ms |  ???.??? |   2.42x  |    17.3%
     15 |       211 | 210.910 ms |  ???.??? |   2.37x  |    15.8%
     16 |       211 | 210.598 ms |  ???.??? |   2.37x  |    14.8%
-------------------------------------------------------------------
Done. Press Enter to exit.

Attached, project that outputs above timings.

 

TinyPortal © 2005-2018