Why?
There is an explanation in this answer.
In short processor can do calculations fast until the cache not limit him.
For example you can get 64 bytes from cache or 16 dwords, that's mean you can process 16 or 64 integers per portion of time.
This doesn't depend on instructions because the speed of RAM is constant.
Yes. That is possible, if your dataset is much larger than your cache, so that it can be assumed cold, and with relatively simple instructions (unpack - add - pack cycle a few times unrolled, or not even that I assume). Memory sizes, cache size etc, ARE considerably increasing though. Todays cold load might still be in cache tomorrow, look at the sizes on
these puppiesI have some SIMD code for work, mostly dealing with image format transformation and kernel operations. It is simplified (64-bit only, aligned only, only widths that are multiples of 32px etc)
Most of it is still SSE2 for similar reasons. Only color distance and YUV/HSV conversions are AVX2. For three reasons:
- simple code doesn't benefit as much
- I use Delphi which @$*@$HYQ#E@# still doesn't support AVX2. The avx2 code is in FPC generated DLLs, but I only do it when it matters
- The shuffle units of AVX2 still have some limitations for 1 and 2-byte quantities to shift them over 128-bit lanes. Probably it is possible, but a whole lot more complicated