### Bookstore

 Computer Math and Games in Pascal (preview) Lazarus Handbook

### Recent

#### aminer

• Hero Member
• Posts: 956
« on: September 15, 2012, 01:21:28 am »

Hello,

I have implemented the Object Pascal Distributed Reader-Writer Mutex
based on the C++ Distributed Reader-Writer Mutex by Dmitry Vyukov,
in this post i will  discuss the TomniMREW reader-writer lock by
Primoz Gabrijelcic.

Lets take a look at the source code and my explaination will follow:

====
unit MREW;

{\$IFDEF FPC}
{\$ASMMODE intel}
{\$ENDIF FPC}

interface
uses sysutils,
classes;

{\$I defines.inc}

type

{\$IFDEF CPU64}
int = int64;
{\$ENDIF CPU64}
{\$IFDEF CPU32}
int = integer;
{\$ENDIF CPU32}

typecache1 = array[0..15] of longword;

TOmniMREW = class

private
omrewReference: int; //Reference.Bit0 is 'writing in progress' flag
// cache:typecache1;

public

procedure EnterWriteLock;
procedure ExitWriteLock;
end; { TOmniMREW }

implementation

{ TOmniMREW }

function LockedExchangeAdd(var Target: Int; Value: Int): Integer;
asm
{\$IFDEF CPU32}
// --> EAX Target
// EDX Value
// <-- EAX Result
MOV ECX, EAX
MOV EAX, EDX
// ECX Target
// EAX Value
{\$ENDIF CPU32}
{\$IFDEF CPU64}
// --> RCX Target
// EDX Value
// <-- EAX Result
MOV RAX, EDX
// RCX Target
// RAX Value
{\$ENDIF CPU64}
end;

function CAS(var Target:int;Comp ,Exch : int): boolean;assembler;stdcall;
asm
{\$IFDEF CPU64}
mov rax, comp
lock cmpxchg [Target], Exch
setz al
{\$ENDIF CPU64}
{\$IFDEF CPU32}
mov eax, comp
mov ecx,Target
mov edx,exch
lock cmpxchg [ecx], edx
setz al
{\$ENDIF CPU32}
end; { CAS }
?
var
currentReference: int;
begin
//Wait on writer to reset write flag so Reference.Bit0 must be 0 than
increase Reference
repeat
currentReference := omrewReference AND NOT 1;
until CAS(omrewReference, currentReference,currentReference + 2);

procedure TOmniMREW.EnterWriteLock;
var
currentReference: int;
begin
//Wait on writer to reset write flag so omrewReference.Bit0 must be 0 then
set omrewReference.Bit0
repeat
currentReference := omrewReference AND NOT 1;
until CAS(omrewReference, currentReference,currentReference + 1);
repeat
until omrewReference = 1;
end; { TOmniMREW.EnterWriteLock }

begin
//Decrease omrewReference

procedure TOmniMREW.ExitWriteLock;
begin
omrewReference := 0;
end; { TOmniMREW.ExitWriteLock }
end.
===

This method is using the following technique:

currentReference := omrewReference AND NOT 1;

So if omrewReference is 0 or multiple of 2 and is an even number
the reader can enter the MREW lock and increment omrewReference
by -2 , or the writer will enter and increment omrewReference by 1 and
this will stop the reader and the writer from crossing the CAS, and after
that
the writer will wait on the reader to exit by using a repeat loop
like this:

repeat
until omrewReference = 1;

and as you have noticed with this method the writers will not starve
forever,
and thie MREW lock is very fast also, but even if it's very fast it doesn't
scale
cause there is a single point of access on the CAS and this can  cause
a lot of contention , also the inter-thread communication can be  expensive
if
the reader's time under the MREW lock is not so signifant..

Distributed Reader-Writer Mutex , I have included the following
Mutexes inside this Distributed Reader-Writer mutex:

TOmniMREW a lightweight MREW that is very fast and TMultiReadExclusiveWrite
from JCL and now both of them can scale better with Distributed
and i have modified the Dmitry Vyukov Distributed Reader-Writer Mutex, in
the first
version i have not used GetCurrentProcessor() but i have used
and i have also provided you with a second version that scales better, to be
ableto use the second version please use the version2 in defines.inc, i have
given you a test.pas example for the first version and test1.pas for the
second version,
but don't forget to use version2 inside defines.inc, to use the second
version
just uncomment the version2 inside defines.inc and comment version1. I have
also
done a cache line alignement in TOmniMREW, this has allowed Drwlock to
scale better.

I have provided you with the source code, please take a look at the source
code to understand better.The Object Pascal Distributed Reader-Writer Mutex
is
based on the following C++ Distributed Reader-Writer Mutex by Dmitry Vyukov,

I have also modified the Dmitry Vyukov's Distributed Reader-Writer Mutex  to
use a variable number of MREWs, you can pass the number of MREWs to the
constructor like this:

drw:=TDRWLOCK.create(100);

You have four methods:

procedure wlock; // same as EnterWriteLock of TOmniMREW
procedure wunlock; // same as ExitWriteLock
procedure rlock; // same as EnterReadLock
procedure runlock; // same as  ExitReadLock

and you have to pass the number of MREWs(multiple-readers-exclusive-writer)
to the constructor like this:

drw:=TDRWLOCK.create(200); // here we are creating 200 MEWs

Here is some scalability numbers:

I have used TOmniMREW of the Omnithread library and used only
TOmniMREW gave a negative scalability of -5.51x

And when i have used the second version of Distributed Reader-Writer Mutex
using only rlock() and runlock() , it gave me +3.94x scalability with four
threads on four cores. So now it's scaling.

And about the second version , don't forget to initialize the number  that
you pass to rlock() and runlock()  to 0 before calling  rlock() and
runlock() .

In the previous versions i have aligned the array elements on cache line
bounderies like have done Dmitry Vyukov, and it didn't work correctly when i
have tested the second version, so i have thought about that and after that
i have decided to not align the array elements on cache line bounderied but
just
add a cache line padding to TOmniMREW for example and this time it has
worked
perfectly and now the second version is scaling perfectly..

And if you have noticed , there is still a weakness with the Dmitry Vyukov
C++ Distributed Reader-Writer Mutex, cause since he is using
GetCurrentProcessorNumber() he is limiting the array of rwlocks to the
number
of avaiblable cores and this is not good i think , cause if you have many
more
threads than the  avaiblable cores and there is high contention this will
cause the
performance to degrade, so i have decided to change that in my
implementation and
i have used a  variable number of rwlocks/MREWs so that you can lower more
the
contention and this is better for scalability.