🤔 To demonstrate multithreaded scalability, let's burden the threads with intensive algebraic operations.
i7-12700H.
Pascal FPC 3.2.2 -O3 (Double Precision + Heavy Algebra)
-----------------------------------------------------------------------------
Allocating 2288 MB RAM...
-----------------------------------------------------------------------------
Threads | Time (ms) | Speedup | Efficiency | Bandwidth (GB/s) | Validate
-----------------------------------------------------------------------------
1 | 9109 | 1.00x | 100.00% | 0.25 | OK
2 | 4750 | 1.92x | 95.88% | 0.47 | OK
4 | 2594 | 3.51x | 87.79% | 0.86 | OK
6 | 1797 | 5.07x | 84.48% | 1.24 | OK
8 | 1656 | 5.50x | 68.76% | 1.35 | OK
10 | 1453 | 6.27x | 62.69% | 1.54 | OK
12 | 1344 | 6.78x | 56.48% | 1.66 | OK
14 | 1250 | 7.29x | 52.05% | 1.79 | OK
16 | 1187 | 7.67x | 47.96% | 1.88 | OK
18 | 1172 | 7.77x | 43.18% | 1.91 | OK
20 | 1187 | 7.67x | 38.37% | 1.88 | OK
22 | 1141 | 7.98x | 36.29% | 1.96 | OK
24 | 1125 | 8.10x | 33.74% | 1.99 | OK
-----------------------------------------------------------------------------
C++ gcc 14.2.0 -O3 (Double Precision + Heavy Algebra)
-----------------------------------------------------------------------------
Allocating 2288 MB RAM...
-----------------------------------------------------------------------------
Threads | Time (ms) | Speedup | Efficiency | Bandwidth (GB/s) | Validate
-----------------------------------------------------------------------------
1 | 9340 | 1.00x | 100.00% | 0.24 | OK
2 | 4571 | 2.04x | 102.17% | 0.49 | OK
4 | 2504 | 3.73x | 93.25% | 0.89 | OK
6 | 1739 | 5.37x | 89.52% | 1.29 | OK
8 | 1535 | 6.08x | 76.06% | 1.46 | OK
10 | 1410 | 6.62x | 66.24% | 1.59 | OK
12 | 1312 | 7.12x | 59.32% | 1.70 | OK
14 | 1244 | 7.51x | 53.63% | 1.80 | OK
16 | 1161 | 8.04x | 50.28% | 1.93 | OK
18 | 1086 | 8.60x | 47.78% | 2.06 | OK
20 | 1174 | 7.96x | 39.78% | 1.90 | OK
22 | 1256 | 7.44x | 33.80% | 1.78 | OK
24 | 1292 | 7.23x | 30.12% | 1.73 | OK
-----------------------------------------------------------------------------
The P-Core Zone (1–6 Threads)
Perfect Scaling: Efficiency remains above 85-90% up to 6 threads. These are 6 high-performance cores working at full tilt. C++ is slightly ahead here (5.37x Speedup vs. Pascal's 5.07x), likely due to GCC 14’s more aggressive instruction scheduling.
The E-Core Entry (8–14 Threads)
This is where it gets interesting. After the 6th thread, the speedup stops being linear.
Heterogeneity in Action: At 14 threads (6P + 8E), the execution time hits 1244–1250ms. Note that adding the 8 small cores provided a significant boost, but Efficiency naturally dropped to ~50-53%. This is expected: E-cores are slower, which reduces the "average" performance per thread.
Hyper-Threading & Over-Saturation (16–24 Threads)
Peak Performance: C++ peaks at 18 threads (1086ms), while Pascal peaks at 24 threads (1125ms).
C++ Performance Drop: Look at the C++ results after 18 threads—the time begins to increase (1086 -> 1174 -> 1292). This is a classic "pipeline stall" effect. Hyper-Threading on the P-cores starts fighting for the same DIVSD (Division) units that are already 100% saturated.
Pascal’s Stability: Pascal remains more stable at extremely high thread counts (1125ms at 24 threads). This suggests the code generated by FPC might be creating less "noise" in the instruction cache or interacting differently with the Windows thread scheduler.
🚴