Recent

Author Topic: Benchmark: converting array of structures to structure of arrays  (Read 149 times)

LemonParty

  • Hero Member
  • *****
  • Posts: 537
After this topic https://forum.lazarus.freepascal.org/index.php/topic,74158.0.html it become interesting to me how fast the pointed convertion can be done.

Here is a simple benchmark work in both x86 64-bit and AArch64. Should work at Linux too. There may be a problems with old compiler versions when compile under AArch64.

Results Intel Core Ultra 7 258V:
Code: Pascal  [Select][+][-]
  1. 4096 ELEMENTS BY 100 RESULTS
  2. Naive    : 4792
  3. Unrolled : 3412
  4. SIMD     : 2810
  5. 1048576 ELEMENTS BY 4 RESULTS
  6. Naive    : 1477308
  7. Unrolled : 1093105
  8. SIMD     : 760640
  9.  
  10. 4096 ELEMENTS BY 100 RESULTS
  11. Naive    : 5152
  12. Unrolled : 3441
  13. SIMD     : 2701
  14. 1048576 ELEMENTS BY 4 RESULTS
  15. Naive    : 1460518
  16. Unrolled : 1080429
  17. SIMD     : 766542
  18.  
  19. 4096 ELEMENTS BY 100 RESULTS
  20. Naive    : 5429
  21. Unrolled : 5035
  22. SIMD     : 2695
  23. 1048576 ELEMENTS BY 4 RESULTS
  24. Naive    : 1467502
  25. Unrolled : 1067923
  26. SIMD     : 791595
  27.  
  28. 4096 ELEMENTS BY 100 RESULTS
  29. Naive    : 5038
  30. Unrolled : 3374
  31. SIMD     : 3051
  32. 1048576 ELEMENTS BY 4 RESULTS
  33. Naive    : 1464983
  34. Unrolled : 1054564
  35. SIMD     : 775087
  36.  
  37. 4096 ELEMENTS BY 100 RESULTS
  38. Naive    : 4840
  39. Unrolled : 3608
  40. SIMD     : 2955
  41. 1048576 ELEMENTS BY 4 RESULTS
  42. Naive    : 1483384
  43. Unrolled : 1108301
  44. SIMD     : 806910

Results Raspberry Pi 5:
Code: Pascal  [Select][+][-]
  1. 4096 ELEMENTS BY 100 RESULTS
  2. Naive    : 47603
  3. Unrolled : 46531
  4. SIMD     : 5484
  5. 1048576 ELEMENTS BY 4 RESULTS
  6. Naive    : 8293058
  7. Unrolled : 7676892
  8. SIMD     : 1382494
  9.  
  10. Naive    : 47603
  11. Unrolled : 46531
  12. SIMD     : 5484
  13. 1048576 ELEMENTS BY 4 RESULTS
  14. Naive    : 8293058
  15. Unrolled : 7676892
  16. SIMD     : 1382494
  17.  
  18. 4096 ELEMENTS BY 100 RESULTS
  19. Naive    : 29455
  20. Unrolled : 29864
  21. SIMD     : 3426
  22. 1048576 ELEMENTS BY 4 RESULTS
  23. Naive    : 8010053
  24. Unrolled : 7741117
  25. SIMD     : 1422446
  26.  
  27. 4096 ELEMENTS BY 100 RESULTS
  28. Naive    : 29617
  29. Unrolled : 29096
  30. SIMD     : 3426
  31. 1048576 ELEMENTS BY 4 RESULTS
  32. Naive    : 8031165
  33. Unrolled : 7690447
  34. SIMD     : 1343017
  35.  
  36. 4096 ELEMENTS BY 100 RESULTS
  37. Naive    : 43800
  38. Unrolled : 43603
  39. SIMD     : 5153
  40. 1048576 ELEMENTS BY 4 RESULTS
  41. Naive    : 8238321
  42. Unrolled : 7689008
  43. SIMD     : 1389328

Notes:
1. Unrolling on x86 work really well (almost +50% speed), when on ARM it only gives +7%;
2. SIMD version on ARM not uses actually SIMD, it just utilize a lot of registers. Even in this case a speed difference is almost 9.0x on small chunks and 6.0x on big chunks. That difference is impressive (I hope I don't make a mistake in "SIMD" code);
3. The SIMD solution on x86 is twice faster than Naive approach.
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

 

TinyPortal © 2005-2018