After this topic
https://forum.lazarus.freepascal.org/index.php/topic,74158.0.html it become interesting to me how fast the pointed convertion can be done.
Here is a simple benchmark work in both x86 64-bit and AArch64. Should work at Linux too. There may be a problems with old compiler versions when compile under AArch64.
Results Intel Core Ultra 7 258V:
4096 ELEMENTS BY 100 RESULTS
Naive : 4792
Unrolled : 3412
SIMD : 2810
1048576 ELEMENTS BY 4 RESULTS
Naive : 1477308
Unrolled : 1093105
SIMD : 760640
4096 ELEMENTS BY 100 RESULTS
Naive : 5152
Unrolled : 3441
SIMD : 2701
1048576 ELEMENTS BY 4 RESULTS
Naive : 1460518
Unrolled : 1080429
SIMD : 766542
4096 ELEMENTS BY 100 RESULTS
Naive : 5429
Unrolled : 5035
SIMD : 2695
1048576 ELEMENTS BY 4 RESULTS
Naive : 1467502
Unrolled : 1067923
SIMD : 791595
4096 ELEMENTS BY 100 RESULTS
Naive : 5038
Unrolled : 3374
SIMD : 3051
1048576 ELEMENTS BY 4 RESULTS
Naive : 1464983
Unrolled : 1054564
SIMD : 775087
4096 ELEMENTS BY 100 RESULTS
Naive : 4840
Unrolled : 3608
SIMD : 2955
1048576 ELEMENTS BY 4 RESULTS
Naive : 1483384
Unrolled : 1108301
SIMD : 806910
Results Raspberry Pi 5:
4096 ELEMENTS BY 100 RESULTS
Naive : 47603
Unrolled : 46531
SIMD : 5484
1048576 ELEMENTS BY 4 RESULTS
Naive : 8293058
Unrolled : 7676892
SIMD : 1382494
Naive : 47603
Unrolled : 46531
SIMD : 5484
1048576 ELEMENTS BY 4 RESULTS
Naive : 8293058
Unrolled : 7676892
SIMD : 1382494
4096 ELEMENTS BY 100 RESULTS
Naive : 29455
Unrolled : 29864
SIMD : 3426
1048576 ELEMENTS BY 4 RESULTS
Naive : 8010053
Unrolled : 7741117
SIMD : 1422446
4096 ELEMENTS BY 100 RESULTS
Naive : 29617
Unrolled : 29096
SIMD : 3426
1048576 ELEMENTS BY 4 RESULTS
Naive : 8031165
Unrolled : 7690447
SIMD : 1343017
4096 ELEMENTS BY 100 RESULTS
Naive : 43800
Unrolled : 43603
SIMD : 5153
1048576 ELEMENTS BY 4 RESULTS
Naive : 8238321
Unrolled : 7689008
SIMD : 1389328
Notes:
1. Unrolling on x86 work really well (almost +50% speed), when on ARM it only gives +7%;
2. SIMD version on ARM not uses actually SIMD, it just utilize a lot of registers. Even in this case a speed difference is almost 9.0x on small chunks and 6.0x on big chunks. That difference is impressive (I hope I don't make a mistake in "SIMD" code);
3. The SIMD solution on x86 is twice faster than Naive approach.