Recent

Author Topic: Scalable RWLock versoin 1.12  (Read 12095 times)

aminer

  • Hero Member
  • *****
  • Posts: 956
Scalable RWLock versoin 1.12
« on: September 23, 2013, 09:40:56 pm »

Hello,

SemaCondvar was updated to version 1.11 and Scalable RWLock was updated to version 1.12, i have used a Ticket spinlock with and exponentiel backoff inside SemaCondvar and SemaMonitor and now the Ticket spinlock  avoids the starvation problem and it has a decent performance.


You can download SemaCondvar and my scalable RWLock from:

http://pages.videotron.com/aminer/


Thank you,
Amine Moulay Ramdane.


eny

  • Hero Member
  • *****
  • Posts: 1659
Re: Scalable RWLock versoin 1.12
« Reply #1 on: September 23, 2013, 10:39:57 pm »
Three new topics about the same subject?
Really?
All posts based on: Win11; Lazarus 4_4  (x64) 12-02-2026 (unless specified otherwise...)

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #2 on: September 23, 2013, 10:54:46 pm »

Hello,

That's not the end of the story, i have benchmarked the ticket spinlock
that i am using and that you find here:

http://code.google.com/p/gpdelphiunits/source/browse/trunk/src/SpinLock.pas?r=37
 
and found that it has a poor performance cause in the case of a spinlock without tickets when it is unlocked() the first thread that comes first to the lock() will enter immediatly the locked section, but that's not the case with a Ticket spinlock cause if the first thread that enter the lock() have not the ticket the other threads that have the ticket and that are waiting for there turn will wait longer/more , so this is why the Ticket spinlock have a poor performance compared to a simple spinlock with a backoff.

 
So i will advice you to avoid the Ticket spinlock.




Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #3 on: September 23, 2013, 11:08:41 pm »
I correct some english mistakes, pease read again...



Hello,

That's not the end of the story, i have benchmarked the ticket spinlock
that i am using and that you find here:

http://code.google.com/p/gpdelphiunits/source/browse/trunk/src/SpinLock.pas?r=37
 
and found that it has a poor performance cause in the case of a spinlock without tickets when it is unlocked() the first thread that comes first to the lock() will enter immediatly the locked section, but that's not the case with a Ticket spinlock cause if the first thread that enters the lock() first has not the ticket to enter, the other thread that have the ticket to enter and that is waiting for his turn will wait longer/more , so this is why the Ticket spinlock has a poor performance compared to a simple spinlock with a backoff.

 
So i will advise you to avoid the Ticket spinlock.




Thank you,
Amine Moulay Ramdane.


aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #4 on: September 23, 2013, 11:32:20 pm »

Hello,

Finally , i have read the following, and it says this:

"Starvation freedom is desirable, but not essential
—practical locks: many permit starvation, although it is unlikely to occur"

read here:

http://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec19-s08-v1.pdf

So as you have just read , starvation is unlikely to occur with spinlock eith backoff, so the probability is
very low that starvation occurs, so i will return back to the Spinlock with exponential backoff inside my SemaMonitor and SemaCondvar cause it has a decent performance.



Thank you,
Amine Moulay Ramdane.









aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #5 on: September 23, 2013, 11:47:42 pm »

Hello,

And look at the graph of the "Lock comparison" up to 80 cores
the test&set with an exponential backoff have a good performance,
so no need for an MCS queue Lock, so i will return back to
the SpinLock with an exponential backoff cause it has a good performance.

Here is the graph bellow on this PDF page:
 
http://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec19-s08-v1.pdf




Thank you,
Amine Moulay Ramdane.


aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #6 on: September 24, 2013, 12:18:04 am »

Hello,

SemaCondvar was updated to version 1.12 and Scalable RWLock was updated to version 1.13, i have returned back to the  spinlock with and exponentiel backoff inside SemaCondvar and SemaMonitor and now the gives a good performance.


You can download SemaCondvar and my scalable RWLock from:

http://pages.videotron.com/aminer/


Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #7 on: September 24, 2013, 02:23:59 am »

Hello,

Please take a look at the following MCS queue lock algorithm:

==

mcs_node{
      mcs_node next;
      int is_locked;
}

mcs_lock{
      mcs_node queue;
}

function Lock(mcs_lock lock, mcs_node my_node){
      my_node.next = NULL;
      mcs_node predecessor = fetch_and_store(lock.queue, my_node);
      //if its null, we now own the lock and can leave, else....
      if (predecessor != NULL){
          my_node.is_locked = true;
          //when the predecessor unlocks, it will give us the lock
          predecessor.next = my_node;
          while(my_node.is_locked){}
}

function UnLock(mcs_lock lock, mcs_node my_node){
      //is there anyone to give the lock to?
      if (my_node.next == NULL){
            //need to make sure there wasn't a race here
            if (compare_and_swap(lock.queue, my_node, NULL)){
                 return;
            }
            else{
                 //someone has executed fetch_and_store but not set our next field
                 while(my_node.next==NULL){}
           }
      }
     //if we made it here, there is someone waiting, so lets wake them up
     my_node.next.is_locked=false;
}

===


Firt you will notice that it's using an atomic fetch_and_store,
plus it is using a spin wait, so it's more expensive than
the Spinlock with an exponential backoff, so i prefer the
SpinLock with an exponential backoff  cause the exponential backoff
will reduce the cache coherency traffic and reduce the CPU utilisation
when there is more contention, this is why the Spinlock with an exponential backoff is giving a good performance , and almost the same performance as the MCS queue Lock,
proof of that ? look at the  the following graph of the "Lock comparison" up to 80 core bellow in the following PDF file and
you will notice that the test&set with an exponential backoff
have almost the same performance as the MCS queue Lock:

http://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec19-s08-v1.pdf


So i think i will stay with the Spinlock with an exponential backoff
cause it has a good performance.



Thank you,
Amine Moulay Ramdane.

 



aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #8 on: September 24, 2013, 04:57:40 am »
Hello,

That's not the end of the story, i have benchmarked
the following Spin lock with exponential backoff that i am
using:
 
http://code.google.com/p/gpdelphiunits/source/browse/trunk/src/SpinLock.pas?r=37


And this is not good at all, cause there is a problem with
the exponential backoff , i have benchmarked it and i have notived
that the backoff slows down a lot the critical section even if the contention is not high, so this TTAS with an exponential backoff is a stupid thing , the ticket spinlock is also a stupid thing.

I have also benchmarked the ticket spinlock and found that it has a poor performance cause in the case of a spinlock without tickets when it is unlocked() the first thread that comes first to the lock() will enter immediatly the locked section, but that's not the case with a Ticket spinlock cause if the first thread that enters the lock() first has not yet the ticket to enter, the other thread that has the ticket to enter and that is waiting for his turn will wait longer , so this is why the Ticket spinlock has also a poor performance compared to a simple spinlock with a backoff.


So i will advise you to avoid the Spinlock with exponential backoff and the Ticket spinlock.


Thank you,
Amine Moulay Ramdane.






ludob

  • Hero Member
  • *****
  • Posts: 1173
Re: Scalable RWLock versoin 1.12
« Reply #9 on: September 24, 2013, 08:44:46 am »
Quote
I correct some english mistakes, pease read again...
I told you already before, you can edit your older posts. No need to beef up your post count and filling up the Recent Posts view by  rewriting the same text.
Quote
So i think i will stay with the Spinlock with an exponential backoff
cause it has a good performance.
Quote
So i will advise you to avoid the Spinlock with exponential backoff and the Ticket spinlock.
You can also delete a complete thread if you want. Have you ever tried to imagine how people reading this thread feel about your work? If in the same thread you say one thing and his opposite, you are just considered a clown. And because of your netiquette, a spamming clown. Is that the reputation you're after?

If you want to jot down your feelings while browsing through the complex world of parallel processing, open yourself a blog. They are made for that. This is a developers forum, not a blog.


aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #10 on: September 24, 2013, 04:38:23 pm »

Hello,

Here is my new algorithm for a distributed priority and
non priority LOCK, this algorithm reduce a lot the
cache coherency traffic and is good, so follow me please...


First you you allocate an array with the same number of elements
as the numbers of cores and the elements in the array must be 64 bytes
cache aligned and must be a record containing one element that is the
a flag for the first CAS that will be used as a critical section .

You first initialise the distributed priority LOCK by setting
a flag for the second CAS that will be used as a critical section to 0 and you set the flag of the first CAS that will be used also as a critical section to 1 (to permit the first thread to enter the lock())

The lock() function will look like this:

Every thread that enters the lock() must acquire his processor
number by using GetProcessorNumber() function, but this function
will be optimized to amortize the number of calls to it, and that's
easy to do.

You enter the lock() function by pushing the processor number of the threads into a queue or priority queue that will be optimized, this queue will be using a CAS on the push() and will not use any CAS on the pop(), we don't need it on the pop() cause only one thread will pop() from the unlock() function, after that you enter a repeat loop where you test with the first CAS and you test also with the second CAS the flag of the corresponding element(flag) of the array using the processor number of the threads as an index , the flag for the second CAS will be set to 1 so the first thread will enter the lock() and the other threads will test in a repeat loop for the first CAS and the second CAS and there flags will have been set to zero so they will wait..

After that the first thread will arrive to the unlock() function and
he will pop() the processor number from the optimized priority queue
or non priority queue and set the flag for the first CAS to 0 for the corresponding processor core this will allow a thread to enter the lock , if there is no elements in the queue the thread will set the flag of the second CAS to zero and this will allow a thread to enter the lock.


So as you have noticed my algorithm is efficient also cause if there
is threads waiting, the cache-coherence traffic will be reduced a lot
cause we are using local variables in each element of the array alligned to 64 byte.


So if you have noticed, in fact i am using 2 CASes , so my algorithm
is good.

This was my algorithm for a distributed priority and non priority LOCK
that is efficient, and i will code it as soon as possible.



Thank you,
Amine Moulay Ramdane.


aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #11 on: September 24, 2013, 05:47:46 pm »


I correct some typo, please read again...

Hello,

Here is my new algorithm for a distributed priority and
non priority LOCK, this algorithm reduce a lot the
cache-coherence  traffic and is good, so follow me please...


First you you allocate an array with the same number of elements
as the numbers of cores and the elements in the array must be 64 bytes
cache aligned and must be a record containing one element that is the
a flag for the first CAS that will be used as a critical section .

You first initialise the distributed priority LOCK by setting
a flag for the second CAS that will be used as a critical section to 0 and you set the flag of the first CAS that will be used also as a critical section to 1 (to permit the first thread to enter the lock())

The lock() function will look like this:

Every thread that enters the lock() must acquire his processor
number by using GetProcessorNumber() function, but this function
will be optimized to amortize the number of calls to it, and that's
easy to do.

You enter the lock() function by pushing the processor number of the threads into a queue or priority queue that will be optimized, this queue will be using a CAS on the push() and will not use any CAS on the pop(), we don't need it on the pop() cause only one thread will pop() from the unlock() function, after that you enter a repeat loop where you test with the first CAS and you test also with the second CAS the flag of the corresponding element(flag) of the array using the processor number of the threads as an index , the flag for the second CAS will be set to 1 so the first thread will enter the lock() and the other threads will test in a repeat loop for the first CAS and the second CAS and there flags will have been set to zero so they will wait..

After that the first thread will arrive to the unlock() function and
he will pop() the processor number from the optimized priority queue
or non priority queue and set the flag for the first CAS to 1 for the corresponding processor core this will allow a thread to enter the lock , if there is no elements in the queue the thread will set the flag of the second CAS to 1 and this will allow a thread to enter the lock.


So as you have noticed my algorithm is efficient also cause if there
is threads waiting, the cache-coherence traffic will be reduced a lot
cause we are using local variables in each element of the array alligned to 64 byte.


So if you have noticed, in fact i am using 2 CASes , so my algorithm
is good.

This was my algorithm for a distributed priority and non priority LOCK
that is efficient, and i will code it as soon as possible.



Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #12 on: September 24, 2013, 06:42:25 pm »

Hello,

Be smart please, to reduce a lot the cache-coherence traffic,
you have to choose a queue that reduces a lot the cache-coherence traffic, and for that you have to code a queue as a linklist
(similar to to the MCS queue) that reduces a lot the cache-coherence traffic, but you have to avoid the lockfree queues cause they will higher the cache-coherence. This is how you have to code my distributed LOCK and it will efficient and be good.



Thank you,
Amine Moulay Ramdane.


aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #13 on: September 24, 2013, 06:45:34 pm »

I correct:

Hello,

Be smart please, to reduce a lot the cache-coherence traffic,
you have to choose a queue that reduces a lot the cache-coherence traffic, and for that you have to code a queue as a linklist
(similar to to the MCS queue) that reduces a lot the cache-coherence traffic, but you have to avoid the lockfree queues cause they will higher the cache-coherence traffic. This is how you have to code my distributed LOCK and it will efficient and be good.



Thank you,
Amine Moulay Ramdane.


aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Scalable RWLock versoin 1.12
« Reply #14 on: September 24, 2013, 07:11:41 pm »

Hello,


I have come to an interresting subject about lockfree queues,
if you take a look at those lockfree queues, like in transactional memory, if the data have changed those lockfree queue retries and loop again, but this (lockfree) mechanism generates a lot of cache-coherence traffic, so i will not advice this lockfree mechanism, so how
can we do it ? you can take a look at the MCS queue Lock , i
think they are doing it like a waitfree linklist that doesn't spin and
that reduces a lot the cache-coherence traffic, other than that
if your memory manager is not optimal and uses a lockfree mechanism and generates a lot cache-coherence traffic, so you have to use a freelist to lower the cache coherence traffic.


So be smart please and think about that.


Thank youi,
Amine Moulay Ramdane.


 

TinyPortal © 2005-2018