Recent

Author Topic: Benchmark aligned vs unaligned memory access  (Read 829 times)

LemonParty

  • Hero Member
  • *****
  • Posts: 537
Re: Benchmark aligned vs unaligned memory access
« Reply #15 on: June 04, 2026, 01:40:07 pm »
Seenkao, MovData3 is the most optimized version? As I can see you use a sequential access to the memory. Is this an optimization you talked in the video?
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

Seenkao

  • Hero Member
  • *****
  • Posts: 761
    • New ZenGL.
Re: Benchmark aligned vs unaligned memory access
« Reply #16 on: June 04, 2026, 02:30:22 pm »
Изначально работа с памятью была не последовательна (MoveData1, MoveData2) и данные могли записываться в разные участки памяти, не последовательно и прыгая с одной кэш-линии на другую. В MoveData3 была произведена оптимизация, чтоб память наиболее последовательно заполнялась каждая ячейка друг за другом. Если в вариантах MoveData1 и MoveData2 надо было менять кэш-линию, то в MoveData3 её менять либо не надо, либо надо менять всего один раз.


----------------------------------------
Google translate:
Initially, memory access was inconsistent (MoveData1, MoveData2), and data could be written to different memory locations non-sequentially, jumping from one cache line to another. In MoveData3 an optimization was made so that the memory is filled as sequentially as possible, each cell after the other. While in MoveData1 and MoveData2, the cache line had to be changed, in MoveData3, it either doesn't need to be changed at all, or only needs to be changed once.
Rus: Стремлюсь к созданию минимальных и достаточно быстрых приложений.

Eng: I strive to create applications that are minimal and reasonably fast.
Working on ZenGL

LemonParty

  • Hero Member
  • *****
  • Posts: 537
Re: Benchmark aligned vs unaligned memory access
« Reply #17 on: June 04, 2026, 04:44:49 pm »
Yes, sequential memory access is a classic optimization technic. But this benchmark touches a bit different topic. If you read some computer science literature you may read that data require a proper aligment. So I created this benchmark to see what the real penalty from unaligned memory access.

Interesting how doing RISC-V with this benchmark. At current moment I haven't a RISC-V board.
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

440bx

  • Hero Member
  • *****
  • Posts: 6542
Re: Benchmark aligned vs unaligned memory access
« Reply #18 on: June 04, 2026, 05:11:51 pm »
I've only quickly read the posts in this thread... I always thought alignment would result in faster access, i.e, less time spent getting the information but, it seems that is not always the case.

In the thread:
https://forum.lazarus.freepascal.org/index.php/topic,70108.msg546071.html#msg546071
some timings are shown for packed vs unpacked, which translates to non-aligned vs aligned and the results seem to favor the unaligned access when dealing with blocks of data because there is more packed data in the CPU cache than unpacked one causing the timings to favor packed/unaligned access.

I found that surprising and, as a result, I now believe that the best way is to measure the performance for the specific case at hand because the intuitive conclusion may not reflect reality when maximum speed/performance is the goal.

HTH.
FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

Seenkao

  • Hero Member
  • *****
  • Posts: 761
    • New ZenGL.
Re: Benchmark aligned vs unaligned memory access
« Reply #19 on: June 04, 2026, 07:49:43 pm »
Yes, sequential memory access is a classic optimization technic. But this benchmark touches a bit different topic. If you read some computer science literature you may read that data require a proper aligment. So I created this benchmark to see what the real penalty from unaligned memory access.
По той причине я и писал, что ваш тест специфичен и во многих ситуациях ни как не повлияет на производительность.
Необходимо, чтоб обращение к данным происходило в разных строках кэша. Точнее данные чтоб были не явно последовательны. Если вы лучше посмотрите, то мой тест, как раз более показателен в данном случае. Потому что даже при построении данных более последовательно и при обращении к ним более последовательно, всё равно есть переходы на другие участки памяти и кэшу всё равно приходится обрабатывать разные линии, а так же возможно загружать данные в линии кэша. Что в вашем случае теста, практически не происходит, а если происходит, то это может вообще ни как не сказаться на производительности. Из чего следует, что не важно будет, выровнены данные или нет, работа с ними как с выровненными, так и не с выровненными может почти ни как не сказаться. Потому что обращение происходит постоянно последовательно, процессор просто загрузит данные и будет их обрабатывать постепенно подменяя линию за линией, если это необходимо.

-------------------------------------------
Google translate:
That's why I wrote that your test is specific and won't impact performance in many situations.
Data access needs to occur in different cache lines. More precisely, the data needs to be imprecisely sequential. If you look closely, my test is actually more indicative in this case. Because even when data is constructed more sequentially and accessed more sequentially, there are still transitions to other memory locations, and the cache still has to process different cache lines, and it's also possible to load data into cache lines. In your test, this almost never happens, and if it does, it may not impact performance at all. This means that it doesn't matter whether the data is aligned; working with it, whether aligned or misaligned, can have virtually no impact. Because the access occurs constantly sequentially, the processor will simply load the data and process it gradually, replacing it line by line if necessary.
Rus: Стремлюсь к созданию минимальных и достаточно быстрых приложений.

Eng: I strive to create applications that are minimal and reasonably fast.
Working on ZenGL

LemonParty

  • Hero Member
  • *****
  • Posts: 537
Re: Benchmark aligned vs unaligned memory access
« Reply #20 on: June 04, 2026, 09:13:30 pm »
Quote
Data access needs to occur in different cache lines.
Do you understand that this is a separate benchmark?

In current benchmark we test the ability of architecture to handle an unaligned access to memory. If we choose a random access to data that will touch the efficiency and capacity of cache and this is a whole new topic.

Look at numbers from Raspberry (unaligned read). There you can see what exactly was studied.
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

Seenkao

  • Hero Member
  • *****
  • Posts: 761
    • New ZenGL.
Re: Benchmark aligned vs unaligned memory access
« Reply #21 on: June 04, 2026, 09:28:38 pm »
Do you understand that this is a separate benchmark?
Вы не правы. Это не отдельный показатель.
Вы забываете простую вещь, что выровненные данные делаются для того выровненными, чтоб в любой момент времени, при любом обращении к данным, оно попадало в линию кэша. В вашем варианте, когда все данные идут последовательно, это практически не критично. Потому что даже если данные не попали в кэш в первый раз, то дальнейшее обращение к данным будет всегда (отмечаю ВСЕГДА!!!) попадать в линию кэша. Линия закончится и будет обрабатываться следующая линия кэша, которая уже готова будет для работы с данными.
А вот если мы будем обращаться к данным произвольно, а не последовательно, то здесь уже выравнивание может сыграть роль и достаточную. Потому что выровненным данным не надо будет переходить с одной линии на другую и будет грузится сразу нужная линия. А вот не выровненные данные могут оказаться на разных линиях кэша, что заставит данные снова подгружаться в линию кэша.

Я не знаю как вам ещё объяснить, честно говоря. Об этом не я один вам писал, но вы не хотите этого понять.

------------------------------------------------------
Google translate:
You're wrong. This isn't a separate metric.
You're forgetting a simple thing: aligned data is aligned so that at any given time, any access to the data hits a cache line. In your scenario, where all the data is sequential, this isn't really a big deal. Because even if the data doesn't hit the cache the first time, subsequent accesses will always (and I mean ALWAYS!!!) hit a cache line. The line will end, and the next cache line, which will then be ready to handle the data, will be processed.
But if we access the data randomly, rather than sequentially, then alignment can play a significant role. Because aligned data won't need to jump from one cache line to another, and the right one will be loaded immediately. However, misaligned data can end up on different cache lines, which can force the data to be loaded again into a new cache line.

I don't know how else to explain it to you, frankly. I wasn’t the only one who wrote to you about this, but you don’t want to understand it.
Rus: Стремлюсь к созданию минимальных и достаточно быстрых приложений.

Eng: I strive to create applications that are minimal and reasonably fast.
Working on ZenGL

LemonParty

  • Hero Member
  • *****
  • Posts: 537
Re: Benchmark aligned vs unaligned memory access
« Reply #22 on: June 04, 2026, 10:24:50 pm »
Quote
Because even if the data doesn't hit the cache the first time, subsequent accesses will always (and I mean ALWAYS!!!) hit a cache line.
That is true. But I measure a whole different thing. When you do a random access on a large chunks of memory you test cache misses and this is what I do not test in this benchmark (for this particular reason there is a subcategory for 32KB of data). Again I do not test cache misses. The subject of study is handling of missaligned reads/writes. I don't want cache misses to interfere in final results so I picked a sequential access. And there is one more reason why I picked a sequential access this is close to real world cases.

Testing situation when a record get between cache lines is interesting (that is going to impact the performance), but this is a different benchmark. Maybe I extend this benchmark to handle such cases.
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

Seenkao

  • Hero Member
  • *****
  • Posts: 761
    • New ZenGL.
Re: Benchmark aligned vs unaligned memory access
« Reply #23 on: June 04, 2026, 10:45:14 pm »
Again I do not test cache misses.
Но выравнивание как раз для этого и делается.  :) Чтобы избавится от лишних промахов в кэше.

Google translate:
But this is precisely what alignment is done for. :) To get rid of unnecessary cache misses.
Rus: Стремлюсь к созданию минимальных и достаточно быстрых приложений.

Eng: I strive to create applications that are minimal and reasonably fast.
Working on ZenGL

LemonParty

  • Hero Member
  • *****
  • Posts: 537
Re: Benchmark aligned vs unaligned memory access
« Reply #24 on: June 05, 2026, 04:01:02 pm »
Updated benchmark.

Now it testing a random access on a various ranges of elements up to 1024 * 1024 elements.

Conclusions:
1. Assumption that aligment will affect the performance was right. Especially that is visible on big arrays in Intel results;
2. Raspberry handles writes much better on large chunks;
3. With reads there seems no big different between aligned and unaligned on both architectures.

So the global conclusion will be: if you use sequential access unaligned data is kind of OK, when expected random access to data then you should think about aligned structures.
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

LeP

  • Sr. Member
  • ****
  • Posts: 347
Re: Benchmark aligned vs unaligned memory access
« Reply #25 on: June 05, 2026, 06:33:08 pm »
@LemonParty, I tried you code with various $align and $codealign, values but nothing changed. This is as I expected.

And try with this records, naturally aligned / unaligned but only for some beats:

Code: Pascal  [Select][+][-]
  1. type
  2.   TAlignedRec = packed record
  3.     Q: QWord;
  4.     D: DWord;
  5.     W: Word;
  6.     _padding: Word;
  7.   end;
  8.  
  9.   TUnAlignedRec = packed record
  10.     _padding: byte;
  11.     Q: QWord;
  12.     D: DWord;
  13.     W: Word;
  14.     __padding: byte;
  15.   end;

If you see there are randoms results. I think it's right.
Why is this right?
Because in modern Intel processors (for many years now), unaligned reads and writes have been handled internally by hardware for all processors since "Sandy Bridge" (Ref. Intel® 64 and IA-32 Architectures Software Developer’s Manual). Take care that hardware alignment (call "natural" in Intel docs) is 4 bytes or 8 bytes. For some simd it became 16 bytes.
Of course this may be influenced by a lot of factors (like hit of cache lines (all levels)).

I believe the difference you noted in your tests was primarily related to the different size of the two records. Furthermore, we must not forget all the hardware implementations related to performance improvements (out-of-order execution, pipelines, etc.).
It should be noted that some instructions can be executed in parallel in hardware, some up to 7 together.
Furthermore, in a system like Windows, benchmark tests conducted this way make sense if the difference is significant (more than double, and here too, appropriate distinctions must be made).

Note: when you construct an array of  "record with unaligned structure" some elelments become natural aligned (tipically every four elements in the worse case).

Edit: add TXT with results.
« Last Edit: June 05, 2026, 06:37:20 pm by LeP »
Un Sistema per domarli, un IDE per trovarli, un codice per ghermirli e nel framework incatenarli.
An operating system to tame them, an IDE to find them, a code to catch them and in the framework chain them.

 

TinyPortal © 2005-2018