Recent

Author Topic: The fastest integer type?  (Read 839 times)

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7350
Re: The fastest integer type?
« Reply #15 on: August 10, 2019, 01:36:45 pm »
The code already is avx2 ?
It's enough SSE2 to beat AVX2 that working with Int32.

Why? Show and explain, let us learn. Do you miss certain instructions in AVX2? Do you have a processor (like Ryzen<3000 series) that implements avx2 with two pipes? Is the 128-bit lane shuffle limit somehow a problem? (hitting shuffle limits?)

I mostly do bytewise SSE2 (and in rare cases avx2), but inbetween results are often 16-bit.   

Thaddy

  • Hero Member
  • *****
  • Posts: 8664
Re: The fastest integer type?
« Reply #16 on: August 10, 2019, 01:52:07 pm »
The code already is avx2 ?
It's enough SSE2 to beat AVX2 that working with Int32.

Please answer my question! You are completely incomprehensable. Usually caused by language problems (fair) or having no clue at all (worrying). For now I assume the latter.
« Last Edit: August 10, 2019, 01:54:38 pm by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

LemonParty

  • New Member
  • *
  • Posts: 28
Re: The fastest integer type?
« Reply #17 on: August 10, 2019, 02:07:23 pm »
Why?
There is an explanation in this answer.
In short processor can do calculations fast until the cache not limit him.
For example you can get 64 bytes from cache or 16 dwords, that's mean you can process 16 or 64 integers per portion of time.
This doesn't depend on instructions because the speed of RAM is constant.
« Last Edit: August 10, 2019, 02:10:39 pm by LemonParty »

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7350
Re: The fastest integer type?
« Reply #18 on: August 10, 2019, 02:33:07 pm »
Why?
There is an explanation in this answer.
In short processor can do calculations fast until the cache not limit him.
For example you can get 64 bytes from cache or 16 dwords, that's mean you can process 16 or 64 integers per portion of time.
This doesn't depend on instructions because the speed of RAM is constant.

Yes. That is possible, if your dataset is much larger than your cache, so that it can be assumed cold, and with relatively simple instructions (unpack - add - pack cycle a few times unrolled, or not even that I assume). Memory sizes, cache size etc, ARE considerably increasing though. Todays cold load might still be in cache tomorrow, look at the sizes on these puppies

I have some SIMD code for work, mostly dealing with image format transformation and kernel operations. It is simplified (64-bit only, aligned only, only widths that are multiples of 32px etc)

Most of it is still SSE2 for similar reasons. Only color distance and YUV/HSV conversions are AVX2. For three reasons:

  • simple code doesn't benefit as much
  • I use Delphi which @$*@$HYQ#E@# still doesn't support AVX2. The avx2 code is in FPC generated DLLs, but I only do it when it matters
  • The shuffle units of AVX2 still have some limitations for 1 and 2-byte quantities to shift them over 128-bit lanes. Probably it is possible, but a whole lot more complicated


LemonParty

  • New Member
  • *
  • Posts: 28
Re: The fastest integer type?
« Reply #19 on: August 10, 2019, 02:47:13 pm »
CPUs with 8+ cores can open a potential of AVX2 instructions in multithread code, because the bigger cache you have the more data you can prefetch.

circular

  • Hero Member
  • *****
  • Posts: 2954
    • Personal webpage
Re: The fastest integer type?
« Reply #20 on: August 19, 2019, 02:43:57 pm »
I would agree that cache is the main limit. I have been trying to use 64 bit integers instead of 32 bit integers but as a matter of fact, it did not increase the speed significantly. But hitting the cache limit make a very big difference.
Conscience is the debugger of the mind

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7350
Re: The fastest integer type?
« Reply #21 on: August 19, 2019, 04:58:43 pm »
CPUs with 8+ cores can open a potential of AVX2 instructions in multithread code, because the bigger cache you have the more data you can prefetch.

On AMD afaik cores can only use cache on the same core complex. Larger numbers are typically fragmented over multiple core complexes.