Recent

Author Topic: InterlockedExchange v. CriticalSection  (Read 4701 times)

Red_prig

  • Full Member
  • ***
  • Posts: 153
Re: InterlockedExchange v. CriticalSection
« Reply #15 on: July 15, 2023, 09:44:53 pm »

I've never seen this: Write/ReadBarrier before. 

I just looked at the description ( https://www.freepascal.org/docs-html/rtl/system/readdependencybarrier.html ) and it's not exaclty clear what is happening.

What is the scope of it?  The one following line of code?

Really ambiguous to me.
https://en.wikipedia.org/wiki/Memory_barrier

Warfley

  • Hero Member
  • *****
  • Posts: 2067
Re: InterlockedExchange v. CriticalSection
« Reply #16 on: July 15, 2023, 09:48:10 pm »
Sleep(0) as I understand it, will yield to the next executable thread.

If I wanted a SpinLock, I would have simply done        While .... do ;

As stuff I do usually has other things to do, I usually sleep(0) to give other things the opportunity to do what they need to do.
This is a really common misconception about sleep, but that's not the case. It will yield to another thread if there is another thread to yield to. When I on my 24 core PC have a loop with sleep 0 it will still get a 100% CPU load because there is no other task requiring the same CPU, and the task will not be re-scheduled. So on a system without much load a sleep(0) will just be skipped
So a sleep(0) will be a spin lock unless there is so much load on the system that it must reschedule.

Just try that measurement again with the process pinned to a CPU and then put some load on that CPU (e.g. by having a spinning process also pinned to that CPU) and measure again.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12430
  • Debugger - SynEdit - and more
    • wiki
Re: InterlockedExchange v. CriticalSection
« Reply #17 on: July 15, 2023, 09:58:44 pm »
I've never seen this: Write/ReadBarrier before. 

I just looked at the description ( https://www.freepascal.org/docs-html/rtl/system/readdependencybarrier.html ) and it's not exaclty clear what is happening.

What is the scope of it?  The one following line of code?

To simplify it as much as possible. Modern CPU can execute instructions out of order.

So
Code: Pascal  [Select][+][-]
  1. a:=1; b:=2;
may be executed in reverse order. They don't depend on each other.

But if you write data for another thread, and then you set the "lock-var" to 0, then you must be sure all other data has actually been written, and the CPU doesn't re-order the lock-var to be written earlier.

A write Barrier means, in the most basic terms, that all mem-writes in the code up to the barrier are executed, before any mem-write in code after the barrier.

The same for read barrier...


InterlockedExchange includes read/write barriers (afaik)

So do most other thread locking methods, like CS, SetEvent, .... (again afaik)


So if the receiving thread uses InterlockedExchange  or CS then it can read any memory afterwards, without needing to fear that the CPU may have executed those reads in advance.
But if the receiving thread doesn't use anything that use a barrier, then might need a ReadBarrier.



You never heard of it, because it is always already included, if you stick to the normal ways of synchronizing threads.

And if you don't, and forget the barrier => you may have to wait a few years before you (you personally) run into the bug. (If others report it, you will just look at a riddle)....

Threads... Just a lot of fun.
« Last Edit: July 15, 2023, 10:06:36 pm by Martin_fr »

MarkMLl

  • Hero Member
  • *****
  • Posts: 8572
Re: InterlockedExchange v. CriticalSection
« Reply #18 on: July 15, 2023, 10:48:05 pm »
To simplify it as much as possible. Modern CPU can execute instructions out of order.

So
Code: Pascal  [Select][+][-]
  1. a:=1; b:=2;
may be executed in reverse order. They don't depend on each other.

Umm. I wonder whether that is entirely accurate...

It might be safer to say that a modern CPU goes to significant trouble to execute the instructions in order with the correct dependencies, but does not guarantee that memory access operations are in the same order. As an example, if there were two 32-bit writes scheduled to consecutive addresses with a 32-bit write to the next address between them, the CPU might defer and reorder at least the first write so it can do all three as a single 64-bit operation.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

AlanTheBeast

  • Sr. Member
  • ****
  • Posts: 407
  • My software never cras....
Re: InterlockedExchange v. CriticalSection
« Reply #19 on: July 15, 2023, 11:11:48 pm »

I've never seen this: Write/ReadBarrier before. 

I just looked at the description ( https://www.freepascal.org/docs-html/rtl/system/readdependencybarrier.html ) and it's not exaclty clear what is happening.

What is the scope of it?  The one following line of code?

Really ambiguous to me.
https://en.wikipedia.org/wiki/Memory_barrier

To be clear - I've never seen (in Pascal) so it's not clear exactly what is going on.

So the correct usage would be:

Code: Pascal  [Select][+][-]
  1.       n := n + 1;
  2.       writebarrier;        // assuring the n is actually in memory.
  3.  

Further, the compiler just threw:

thr.pas(35,2) Note: Call to subroutine "procedure WriteBarrier;" marked as inline is not inlined


Sleep(0) as I understand it, will yield to the next executable thread.

If I wanted a SpinLock, I would have simply done        While .... do ;

As stuff I do usually has other things to do, I usually sleep(0) to give other things the opportunity to do what they need to do.
This is a really common misconception about sleep, but that's not the case. It will yield to another thread if there is another thread to yield to. When I on my 24 core PC have a loop with sleep 0 it will still get a 100% CPU load because there is no other task requiring the same CPU, and the task will not be re-scheduled. So on a system without much load a sleep(0) will just be skipped
So a sleep(0) will be a spin lock unless there is so much load on the system that it must reschedule.

Just try that measurement again with the process pinned to a CPU and then put some load on that CPU (e.g. by having a spinning process also pinned to that CPU) and measure again.

    do ;

Is a spinlock.

   sleep(0);

Is not.  Because - and entirely supported by what you wrote - nobody knows what is waiting in the thread queue and ready to go.
So where " do ; " will hog the CPU until pre-emption,  " do sleep(0); " will yield if there's something to yield to.

« Last Edit: July 15, 2023, 11:20:07 pm by AlanTheBeast »
Everyone talks about the weather but nobody does anything about it.
..Samuel Clemens.

Red_prig

  • Full Member
  • ***
  • Posts: 153
Re: InterlockedExchange v. CriticalSection
« Reply #20 on: July 15, 2023, 11:22:52 pm »
With sleep(0); the algorithm does not cease to be a spin lock, it can be called a spin lock with preemption, that's all, as described earlier, it can have various problems with the load on the processor.
PS: Memory Barriers are mostly needed for advanced lock-free algorithms, no wonder you haven't seen this.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12430
  • Debugger - SynEdit - and more
    • wiki
Re: InterlockedExchange v. CriticalSection
« Reply #21 on: July 15, 2023, 11:29:05 pm »
To simplify it as much as possible. Modern CPU can execute instructions out of order.

So
Code: Pascal  [Select][+][-]
  1. a:=1; b:=2;
may be executed in reverse order. They don't depend on each other.
Umm. I wonder whether that is entirely accurate...

Well, I wasn't looking to be accurate. But to simplify.

Warfley

  • Hero Member
  • *****
  • Posts: 2067
Re: InterlockedExchange v. CriticalSection
« Reply #22 on: July 15, 2023, 11:40:12 pm »
    do ;

Is a spinlock.

   sleep(0);

Is not.  Because - and entirely supported by what you wrote - nobody knows what is waiting in the thread queue and ready to go.
So where " do ; " will hog the CPU until pre-emption,  " do sleep(0); " will yield if there's something to yield to.
Yes, I know, but this is what explains the performance differences. Sleep(0) only yields the CPU if there is something else in the queue (also it's not just on the queue, modern schedulers try to schedule with CPU affinity and other stuff in mind, so there can be something in the queue but is prioritized on another CPU and therefore not chosen to interrupt your process). Critical sections on the other hand will always yield the CPU for at least one time slice.

So while you may observe a performance benefit on a system without much load compared to a critical section, there are two caveats to this. First, in this situation you are spinning your CPU to 100%, so if you trade of reaction time for energy consumption, which depending on your environment can be an issue.
The second caveat is, that it is only faster in the case of low load. In the case of high load, when Sleep(0) will actually yield, there shouldn't be much of a performance difference. Because Sleep(0) is an unconditional yield, compared to the critical section, where the OS knows that this is waiting for the release of the CS, it could be that the OS will prioritize the process comming out of the CS over one that just called sleep, meaning that in this situation it may actually be slower.

So using a sleep(0) loop is unreliably faster when there is a low load on the system, and potentially even slower than a CS. When you try to have a faster locking mechanism than classical critical sections, this is usually solved is actually differently, for example the Windows API Critical Section has a combination of Spin Wait and Sleep Wait, where at first it tries for a pre-defined time to spin lock (e.g. for a few nanoseconds), and if the lock is not freed during the spin phase, the thread will be put to sleep, which due to it yielding to the scheduler may add an additional multi millisecond delay.

If your goal is to have a better performance than regular critical sections without relying solely on spin locks, this is the way to go, because it is reliable. In comparison your method is reliant on the external state of the system and how much load is on that, something you cannot control or parametize to ensure a consisten performance
« Last Edit: July 15, 2023, 11:41:57 pm by Warfley »

AlanTheBeast

  • Sr. Member
  • ****
  • Posts: 407
  • My software never cras....
Re: InterlockedExchange v. CriticalSection
« Reply #23 on: July 16, 2023, 12:10:01 am »
With sleep(0); the algorithm does not cease to be a spin lock, it can be called a spin lock with preemption, that's all, as described earlier, it can have various problems with the load on the processor.
PS: Memory Barriers are mostly needed for advanced lock-free algorithms, no wonder you haven't seen this.

Calling Sleep(0) is not pre-emption.  It is giving up the CPU so any thread ready to go can run.

If the "timeslice" is 10ms long and I do that loop at the beginning of the 10ms and the thread holding the lock doesn't give it up, then sleep(0) allows some other ready thread to have the CPU; otherwise "    do ;   " would burn up the 10ms until the pre-emption occurs. 
Pretty unfriendly code.

I'll grant that it is still a SpinLock - but a cooperative one.

As to memory barriers, if I call writebarrier, no parameter is setup, but there is a call to _SYSTEM_$$_WRITEBARRIER

No assembly code is generated ( wmb ).

And the compiler throws a warning:
      Call to subroutine "procedure WriteBarrier;" marked as inline is not inlined

So I don't get to see the h/w based instructions (if any) that are used.
Everyone talks about the weather but nobody does anything about it.
..Samuel Clemens.

AlanTheBeast

  • Sr. Member
  • ****
  • Posts: 407
  • My software never cras....
Re: InterlockedExchange v. CriticalSection
« Reply #24 on: July 16, 2023, 01:20:19 am »
    do ;

Is a spinlock.

   sleep(0);

Is not.  Because - and entirely supported by what you wrote - nobody knows what is waiting in the thread queue and ready to go.
So where " do ; " will hog the CPU until pre-emption,  " do sleep(0); " will yield if there's something to yield to.
Yes, I know, but this is what explains the performance differences. Sleep(0) only yields the CPU if there is something else in the queue (also it's not just on the queue, modern schedulers try to schedule with CPU affinity and other stuff in mind, so there can be something in the queue but is prioritized on another CPU and therefore not chosen to interrupt your process). Critical sections on the other hand will always yield the CPU for at least one time slice.

So while you may observe a performance benefit on a system without much load compared to a critical section, there are two caveats to this. First, in this situation you are spinning your CPU to 100%, so if you trade of reaction time for energy consumption, which depending on your environment can be an issue.
The second caveat is, that it is only faster in the case of low load. In the case of high load, when Sleep(0) will actually yield, there shouldn't be much of a performance difference. Because Sleep(0) is an unconditional yield, compared to the critical section, where the OS knows that this is waiting for the release of the CS, it could be that the OS will prioritize the process comming out of the CS over one that just called sleep, meaning that in this situation it may actually be slower.

So using a sleep(0) loop is unreliably faster when there is a low load on the system, and potentially even slower than a CS. When you try to have a faster locking mechanism than classical critical sections, this is usually solved is actually differently, for example the Windows API Critical Section has a combination of Spin Wait and Sleep Wait, where at first it tries for a pre-defined time to spin lock (e.g. for a few nanoseconds), and if the lock is not freed during the spin phase, the thread will be put to sleep, which due to it yielding to the scheduler may add an additional multi millisecond delay.

If your goal is to have a better performance than regular critical sections without relying solely on spin locks, this is the way to go, because it is reliable. In comparison your method is reliant on the external state of the system and how much load is on that, something you cannot control or parametize to ensure a consisten performance

Stretching it, hu?.  CPU affinity is not normally a concern.  A core is available for work and there is a thread needing a core, the marriage is made.   (I do use affinity on some board projects where I reserve 1 or 2 cores for certain tasks and relegate all others to the remaining cores, but these are for very narrow cases where input data senescence is a concern).

My point is I'd rather (in many cases) give up the CPU to any thread that needs it than hold the CPU waiting for a state to clear.  That also has arguments for efficiency (if needed) as well as progressing other work on the system.
To be sure, in many cases, the duration of a locked section of code is very short and that is desirable so a hard SpinLock is used, but it is not always the case.

The performance loss I observe is most likely due to the higher overhead of operating CS over IE and that is all.

My real concern is about what is the true advantage of one over the other, and that seems to come down to the higher level of safety of CS where i/o and recursion are concerned - and that would likely be worth the higher overhead.
Everyone talks about the weather but nobody does anything about it.
..Samuel Clemens.

440bx

  • Hero Member
  • *****
  • Posts: 6542
Re: InterlockedExchange v. CriticalSection
« Reply #25 on: July 16, 2023, 04:17:26 am »
As is usually the case, the "correct" answer is dependent on a number of factors, among them contention and frequency.

Of course, the best thing is understanding the differences between the various synchronization mechanisms (and doing a little application specific testing doesn't hurt either.)  To help getting to that understanding, I recommend reading Chapter 7 of Pavel Yosifovitch's Windows 10 System Programming Part 1 (can be downloaded from the net but doing so is a violation of the author's copyright, therefore I won't post any of those links.)  The sample programs can be legally obtained from his github repo, specifically https://github.com/zodiacon/Win10SysProgBookSamples

Note that compiling the samples can be a bit problematic due to his use of some "libraries" that need to be refreshed.  (but, maybe he has solved that problem by now - of course, MSVC is required.)

Another good source is Jeffrey Richter's Windows Via C, CPP, particularly chapter 8.  The book's sample programs can be downloaded from his Wintellect website.

Both of them include sample to determine the elapsed time to achieve synchronization in various cases. 

Both books deserve to be recommended and are a good investment of any programmer's time who wants to understand how things work.

HTH.
FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

Lulu

  • Sr. Member
  • ****
  • Posts: 405
Re: InterlockedExchange v. CriticalSection
« Reply #26 on: July 16, 2023, 10:53:35 am »
OMG! we're a long way from the days of the z80...!!  :D
wishing you a nice life!
GitHub repositories https://github.com/Lulu04

MarkMLl

  • Hero Member
  • *****
  • Posts: 8572
Re: InterlockedExchange v. CriticalSection
« Reply #27 on: July 16, 2023, 11:26:45 am »
OMG! we're a long way from the days of the z80...!!  :D

Indeed. But a Z80 can run coroutines hence potentially threads, and one if its immediate successors was supposed to be able to implement SMP.

So /potentially/, people could have got bogged down by these problems 40 years ago: when there was /even/ /less/ of an understanding of how to deal with them.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Warfley

  • Hero Member
  • *****
  • Posts: 2067
Re: InterlockedExchange v. CriticalSection
« Reply #28 on: July 16, 2023, 12:07:14 pm »
Stretching it, hu?.  CPU affinity is not normally a concern.  A core is available for work and there is a thread needing a core, the marriage is made.   (I do use affinity on some board projects where I reserve 1 or 2 cores for certain tasks and relegate all others to the remaining cores, but these are for very narrow cases where input data senescence is a concern).
With CPU affinity I do not mean CPU pinning, I mean that the OS will try to keep threads running on the same CPU to avoid cache invalidation. When a task is re-scheduled it is not simply assigned the first CPU thats available, if it has run before on one CPU the OS will preferr continuing that task on the same CPU sometimes at the cost of other already waiting tasks.

If you look at long running tasks taking a lot of CPU time you will observe that they will mostly always be rescheduled on the same CPU. Unless there is heavy load on the system, all long running tasks will basically get their own CPU, while short running/often breaking tasks will be sharing the remaining ones.

For example, I have a Threadripper with 24 CPUs, when I run any loop with Sleep(0) it will pretty much always get one CPU and solely spin that up to 100%, because I never have more other High CPU usage tasks, such that the OS would decide to break that process up, even though with "Sleep(0)" if it was just first come first serve it should. It needs around 20-22 other high CPU usage long running tasks before the OS will start interrupting the Sleep(0) for another task.

High CPU usage tasks, including Sleep(0) loops, will only be interrupted if there is a high load. So in effect they basically get a higher priority by the OS scheduler than tasks that get regularly interrupted. This is something you always need to keep in mind when writing a Sleep(0) loop. It will never starve other processes, but sd I already said, it will spin up the CPU to 100% and this has some potentially negative side effects (like energy usage, or simply noise level from spinning up the fan)

AlanTheBeast

  • Sr. Member
  • ****
  • Posts: 407
  • My software never cras....
Re: InterlockedExchange v. CriticalSection
« Reply #29 on: July 16, 2023, 03:08:34 pm »
As is usually the case, the "correct" answer is dependent on a number of factors, among them contention and frequency.

Of course, the best thing is understanding the differences between the various synchronization mechanisms (and doing a little application specific testing doesn't hurt either.)  To help getting to that understanding, I recommend reading Chapter 7 of Pavel Yosifovitch's Windows 10 System Programming Part 1 (can be downloaded from the net but doing so is a violation of the author's copyright, therefore I won't post any of those links.)  The sample programs can be legally obtained from his github repo, specifically https://github.com/zodiacon/Win10SysProgBookSamples

Note that compiling the samples can be a bit problematic due to his use of some "libraries" that need to be refreshed.  (but, maybe he has solved that problem by now - of course, MSVC is required.)

Another good source is Jeffrey Richter's Windows Via C, CPP, particularly chapter 8.  The book's sample programs can be downloaded from his Wintellect website.

Both of them include sample to determine the elapsed time to achieve synchronization in various cases. 

Both books deserve to be recommended and are a good investment of any programmer's time who wants to understand how things work.


Well, I'm doing these experiments on a Mac and the actual application is OS-less RTL on a Raspberry Pi ... so not sure how Windows examples will play out.  I'll go grab them and see what's there.

Thanks.

Everyone talks about the weather but nobody does anything about it.
..Samuel Clemens.

 

TinyPortal © 2005-2018