Recent

Author Topic: RWLock version 1.01  (Read 9327 times)

aminer

  • Hero Member
  • *****
  • Posts: 956
RWLock version 1.01
« on: June 18, 2013, 03:52:55 pm »
Hello,

I have deleted my previous implementation of RWLock cause it was not
good on scalability and i have completly reinvented a new RWLock that is scalable, this one uses LockedExchangeAdd() on he read side and is more efficient than lockfree TOmniMREW on frequent reads and my new implementation doesn't starve the writes.

Description: A fast Multiple-Readers-Exclusive-Writer Lock that works across process and threads.


A Read/Write Lock is a performance improvement over a standard mutex for cases where reads outnumber writes. with a Read/Write Lock multiple simultaneous read locks may be held, but write locks are exclusively held.

The exclusive writing lock ensures that race conditions do not occur,  since if one client is writing to the data no other client may read or write. Also, the allowance for multiple simultaneous read locks decreases  esource contention since multiple readers can safely use the shared data.

This increases performance over a standard mutex for the assumed usage pattern of frequent simultaneous reads and infrequent writes.

I have looked at another lockfree Read/Write Lock implementation called TOmniMREW that you will find in the OmniThread library(at http://otl.17slon.com/), but my implementation is more efficient when there is frequent simultaneoous reads cause it uses a LockedExchangeAdd on the read side , not a CAS as is using TOmniMREW, and my implementation doesn't starve the writes.

Please take a look a the test.pas Object Pascal demo inside the zipfile, compile and run it...



You can download RWLock versoin 1.01 from:


http://pages.videotron.com/aminer/


Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/

Operating Systems: Windows, Mac OSX , Linux , Unix...


Required FPC switches: -O3 -Sd -dFPC -dFreePascal

-Sd for delphi mode....



Thank you,
Amine Moulay Ramdane.


aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #1 on: June 18, 2013, 04:57:50 pm »

Hello,


You will find my new implementation of RWLock here:

http://pages.videotron.com/aminer/

I will explain my algorithm:

You will read this inside the RLock() method:

--
procedure TRWLOCK.RLock;
var count:ulong;
begin

repeat

count:=LockedExchangeAdd(FCount2^,2);

if (count mod 2) = 0
then break
else LockedExchangeAdd(FCount2^,-2);
if mysleep=1 then sleep(random(2))
else sleep(0);
until false;
     
end;

--

As you will notice this will be amortized to only one LockedExchangeAdd() when there is frequent reads, this
is why my RWLock is more efficient than lockfree TOmniMREW that is using a CAS, i have benchmarked my RWLock against TOmniMREW
and it's 2X times faster than lockfree TOmniMREW, try my RWLock yourself  and be happy, and as you have noticed i am testing with an
if (count mod 2) = 0 to see if there no writers threads that entered
the WLock() method and incremented FCount2^, if the thread that
entered WLock have not yet incremented FCount2^ the threads on
the RLock side will proceed and the thread that entered WLock
will wait until FCount2^ will be equal to 1, so all is correct here,
so as you have noticed with my algorithm the writers will not be starved. 

On the WLock() and WUnlock() side you will read this:

--

procedure TRWLOCK.WLock;

var count:ulong;
begin

while not CAS(FCount1^,0,1)
do
 begin
  if mysleep=1 then sleep(random(2))
  else sleep(0);
 end;

LockedExchangeAdd(FCount2^,1);


while (FCount2^ <>1)
do
begin
if mysleep=1 then sleep(random(2))
else sleep(0);
end;
end;
//==============================================================================

procedure TRWLOCK.WUnlock;
begin
LockedExchangeAdd(FCount2^,-1);
FCount1^:=0;
end;
end.

---

As you have noticed i am using a CAS() as a critical section
cause only one thread must enter the WLock, after that
i am incrementing FCount2^ by 1 to block the new
readers threads entering the RLock() method, after that
i am waiting for FCount2^ to equal 1 that means
i am waiting for all the threads that entered the RLock()
to exit with a RUnlock(), so i think my algorithm is correct.




Thank you,
Amine Moulay Ramdane.








aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #2 on: June 18, 2013, 08:53:48 pm »

Hello all,

I have enhanced my RWLock algorithm, now please
follow with me..

In my previous version the RLock() method was like this:

==
procedure TRWLOCK.RLock;
var count:ulong;
begin

repeat

count:=LockedExchangeAdd(FCount2^,2);

if (count mod 2) = 0
then break
else LockedExchangeAdd(FCount2^,-2);
if mysleep=1 then sleep(random(2))
else sleep(0);
until false;

end;

==


and the WLock() method was like this:

==

procedure TRWLOCK.WLock;

var count:ulong;
begin

while not CAS(FCount1^,0,1)
do
 begin
  if mysleep=1 then sleep(random(2))
  else sleep(0);
 end;

LockedExchangeAdd(FCount2^,1);


while (FCount2^ <>1)
do
begin
if mysleep=1 then sleep(random(2))
else sleep(0);
end;
end;
==



So what was the problem then ? the problem is when the
WLock() method will execute the following part of the code:

--
while (FCount2^ <>1)
do
begin
if mysleep=1 then sleep(random(2))
else sleep(0);
end;
end;
--

It can starve , like a deadlock, cause the RLock() is spinning and incrementing and decrementing by 2 the FCount2^, so to solve
this problem i have to stop the readers that enter the RLock() method  when the writer have incremented FCount2^ by 1, and this is done by
my new algorithm, hwew is the new RWLock() method:

==

procedure TRWLOCK.RLock;
var count,count1:int;
begin

repeat

count:=LockedExchangeAdd(FCount2^,2);

if (count mod 2) = 0
then break
else
begin

count:= LockedExchangeAdd(FCount2^,-2);

while (FCount2^ and (not(not 1))) =  1
do if mysleep=1 then sleep(random(2))
else sleep(0);
end;

if mysleep=1 then sleep(random(2))
else sleep(0);
until false;

     
end;

--

So as you have noticed there is a
FCount2^ and (not(not 1))) =  1
to test on the first bit , if it is equal
to 1 that means that the writer have entered and incremented
FCount2^ by 1, hence the readers will not spin by incrementing
and decrementing FCount2^ by 2 , but they will spin on a sleep(0)
until FCount2^ will not equal to 1, so my new algorithm is efficient now
and it is 2X faster than the lockfree TOmniMREW that you will
find on the Omnithread library.


You can download my new RWLock versoin 1.1 with my new algorithm   
from:

http://pages.videotron.com/aminer/


Thank you,
Amine Moulay Ramdane.




aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #3 on: June 18, 2013, 09:51:19 pm »

Hello all,

Of course i have included MSync library inside RWLock zip file,
this MSync library include the Distributed RWLock by Dmitry Vyukov
that i have enhanced and wrote in Object Pascal, this Distributed
RWLock will use my RWLock and make it more scalable, please
take a look at TDRWLock class inside the MSync library inside the RWLock zipfile.


And of course you can port my SemaCondvar algorithm and my new
RWLock algorithm to C or C++ if you want.


You can download my RWLock 1.1 from:

http://pages.videotron.com/aminer/


Thank you,
Amine Moulay Ramdane.



aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #4 on: June 19, 2013, 09:54:20 pm »

Hello,

I have updated my RWLock to version 1.11, i have benchmarked it against lockfree TOmniMREW and now it's 2.5X times faster than the lockfree TOmniMREW of the OmniThread library.

You can download my RWLock version  1.11 from:

http://pages.videotron.com/aminer/


Thank you,
Amine Moulay Ramdane.





aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #5 on: June 19, 2013, 11:29:00 pm »

Hello,

I have wrote a webpage that explains my RWLock algorithm, here it is:

http://pages.videotron.com/aminer/rwlock.html



Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #6 on: June 20, 2013, 01:44:11 am »

Hello,

I have read this paper about NUMA aware Reader-Writer locks:

http://mcg.cs.tau.ac.il/papers/ppopp2013-rwlocks.pdf


But i don't think you need this complexity of NUMA, my RWLock algorithm and the DRWLock distributed algorithm will be efficient and fast in the Intel SCC 48-core x86 Processor, this Intel SCC 48 cores processor is arranged in 24 "tiles" - what you can think of as 24 separate dual-core processors.  These tiles are then connected via a 2D mesh network that can support up to 256 GB/s of bandwidth!  There are four integrated DDR3 memory controllers (though no mention of how many channels each controller is) that can house as much as 64GB of total memory.

Read this:

http://www.pcper.com/reviews/Processors/Intel-Shows-48-core-x86-Processor-Single-chip-Cloud-Computer

http://news.cnet.com/8301-30685_3-10407818-264.html

http://www.pcworld.com/article/18365 /Intel_48Core_SingleChip_Cloud_Computer_Improves_Power_Efficiency.html




Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #7 on: June 20, 2013, 02:22:37 am »

Hello,


In NUMA i think you have to be unfair also , cause requests that arrive at the same lock might be spread over the physical cores of the system. In these cases, it might be beneficial to group together requests coming from cores that are grouped by there NUMA node number, in order to improve memory performance, as these cores are more probable to share memory data. Moreover, being unfair allows the use of more advanced techniques, like lock cohorts, that reduce traffic on the interconnect between the physical modules, by letting the lock be transferred "locally", without generating cache-coherence-related interconnect traffic.

So as you have noticed if you are in a NUMA system you have to rewrite
RWLocks to take advantage of NUMA, but as i said before i don't think you need this complexity of NUMA, my RWLock algorithm (and the DRWLock distributed algorithm inside MSync library) will be efficient and fast in the Intel SCC 48-core x86 Processor, this Intel SCC 48 cores processor is arranged in 24 "tiles" - what you can think of as 24 separate dual-core processors.  These tiles are then connected via a 2D mesh network that can support up to 256 GB/s of bandwidth!  There are four integrated DDR3 memory controllers (though no mention of how many channels each controller is) that can house as much as 64GB of total memory.

Read this:

http://www.pcper.com/reviews/Processors/Intel-Shows-48-core-x86-Processor-Single-chip-Cloud-Computer

http://news.cnet.com/8301-30685_3-10407818-264.html
 


Thank you,
Amine Moulay Ramdane.


aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #8 on: June 20, 2013, 02:52:41 am »

Hello,

I have read the following paper:

http://mcg.cs.tau.ac.il/papers/ppopp2013-rwlocks.pdf



So as you have noticed they are using the same method that are using
the Distributed RWLock by Dmitry Vyukov, so we have to change the
"function GetCurrentProcessorNumber:long;" inside MSync library by function that returns the current NUMA number, by doing this we will
reduce coherence traffic and the distributed RWLock will be NUMA aware. And that's easy to do.




Thank you,
Amine Moulay Ramdane.




aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #9 on: June 20, 2013, 07:46:01 pm »

Hello,

I have looked at the Dmitry Vyukov asymmetric rwmutex algorithm
with no atomic_rmw/membars on reader side, here it is:

https://groups.google.com/forum/?fromgroups#!original/comp.programming.threads/t9O4wI-co8Y/CIy9UAQ0rHoJ


https://groups.google.com/forum/?fromgroups#!original/comp.programming.threads/t9O4wI-co8Y/ya3QnzvbajUJ

and i have tried to do some benchmarks with 4 threads
on four core looping 100000 times in each thread and entering the rlock() and leaving with and wunlock() and i have noticed that you can gain only 40 milliseconds compared to my RWLock, i think that 40 milliseconds is not so important in none Realtime applications under Windows and Linux and Mac OSX, hence i will still use my RWLock algorithm.



Thank you,
Amine Moulay Ramdane.



aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #10 on: June 20, 2013, 08:15:32 pm »

Question:

Amine, how can you say this? in the Dmitry Vyukov
asymmetric rwmutex algorithm there is no hotspot
in the reader side , so it's really scalable.

Answer:

I will still use my RWLock algorithm with the Distributed RWLock
of Dmitry Vyukov that i have wrote in Object Pascal and
included in the MSync library, this distributed RWLock will eliminate the contention so there will be no hotspot and my RWLock combined with
the Distributed RWLock will then scale very well.


Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #11 on: June 20, 2013, 08:53:04 pm »

Hello,

As i said the Dmitry Vyukov's distributed rwlock
eleminates the contention, so it eliminates the
hotspot on the reader side, so it will scale very well,
and of course it's NUMA aware, and for this you have
to add only one function to adapt it to NUMA, the function
that also permit Dmitry Vyukov's distributed rwlock
to scale is "function GetCurrentProcessorNumber:long;"
i have wrote it in Object Pascal, to be NUMA aware
you have to add another function that returns the NUMA
node number and that's easy to do, so Dmitry Vyukov's
distributed rwlock combined with my RWLock is the answer
and it will scale very well.

Dmitry Vyukov's asymmetric rwmutex algorithm is not NUMA aware,
so i prefer to use Dmitry Vyukov's distributed rwlock combined
with my RWLock algorithm.




Thank you,
Amine Moulay Ramdane.



 

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #12 on: June 20, 2013, 09:02:28 pm »

On 6/20/2013 2:51 PM, aminer wrote:> Dmitry Vyukov's asymmetric rwmutex algorithm is not NUMA aware,
> so i prefer to use Dmitry Vyukov's distributed rwlock combined
> with my RWLock algorithm.



Sorry, the Dmitry Vyukov's asymmetric rwmutex algorithm is scalable
on the reader side on NUMA also, but i will use the Dmitry Vyukov's distributed rwlock combined with my RWLock algorithm.



Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: RWLock version 1.01
« Reply #13 on: June 20, 2013, 09:45:31 pm »

Hello,


I have wrote the parallel part of Parallel hashlist
(a parallel hash table), and it scales very well:

"A parallel HashList with O(1) best case and O(log(n)) worst case access that uses lock striping and lightweight MREWs(multiple-readers-exclusive-writer) , this allows multiple threads to write and read concurently. also parallelhashlist maintains an independant counter , that counts the number of entries , for each segment of the hashtable and uses a lock for each counter, this is also for better scalability."

http://pages.videotron.com/aminer/
 
Now i didn't upgrade parallel hashlist to adapt to NUMA systems, but that's easy to do, i have to use the Dmitry Vyukov's NUMA aware distributed RWLock (combined with my RWLock) for each segment of the hash table and parallel hashlist will then be scalable on NUMA systems too.



Thank you,
Amine Moulay Ramdane.
 





 

TinyPortal © 2005-2018