Recent

Author Topic: A new and scalable FIFO fair lock  (Read 29711 times)

aminer

  • Hero Member
  • *****
  • Posts: 956
A new and scalable FIFO fair lock
« on: April 11, 2014, 04:05:29 am »

Hello,

I was thinking more all those days about my scalable array based Lock
but the problem with it is that each element on the array
must have the same size of as the size of a cache-line, hence my scalable array based lock takes more memory, so i have decided to come with another new scalable FIFO fair lock that do not take too much memory and that is scalable, and you have to be smart please and have to know that on very small critical sections the TicketSpinlock is 2X times faster than my scalable FIFO fair lock, but with a little bit bigger critical section my scalable FIFO fair lock will have almost the same speed as the TicketSpinlock on a Quad Core, but will be faster when you will use more cores with mores threads with a little bigger critical section.


Description:

A scalable FIFO fair lock.

Please take a look a the test.pas Object Pascal demo inside the zipfile, compile and run it...


You can download scalable FIFO fair lock from:

http://pages.videotron.com/aminer/


Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/

Operating Systems: Windows, Mac OSX , Linux , Unix...


Required FPC switches: -O3 -Sd -dFPC -dFreePascal

-Sd for delphi mode....

{$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems

{$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems



Thank you,
Amine Moulay Ramdane.


hinst

  • Sr. Member
  • ****
  • Posts: 303
Re: A new and scalable FIFO fair lock
« Reply #1 on: April 11, 2014, 12:08:22 pm »
I am currently looking at your library "FIFO fair lock"; and I have some questions if you don't mind
What I have been thinking about lately is: how to make reference counting in FPC both fast and thread-safe. I currently use standard TRTLCriticalSection type from FPC RTL to synchronize reference counting for my refcounted objects.

While for trivial tasks thread synchronization performance is not critical; I believe that it becomes critical in case of reference counted objects. If I want automatic memory management for most objects in my application, then I end up with a critical section attached to every single object; If I use more refcounted objects, then reference counting functions get called very frequently.

I wanted to compare performance of various approaches to thread synchronization, but to do this I need at least one more way thread synchronization, and currently I only have TRTLCriticalSection

I Googled "faster thread synchronization" and found this article:
http://stackoverflow.com/questions/10770171/how-to-make-thread-synchronization-without-using-mutex-semorphore-spinlock-and
The author suggests using "interlocked" functions to synchronize threads. Such way of thread synchronization is a lot faster, but only when synchronized operations are not lengthy and race condition is unlikely to occur, which is exactly the case with reference counting

I did not fully understand how exactly I implement thread synchronization using interlocked functions, but then I realized that your library does just this. Is that right?

I noticed that on your web page you have a list of various libraries like "scalable array based lock", "scalable FIFO fair lock" and others. Can you tell which of these libraries is best fit to synchronize reference counting? My guess is "fair lock".

I see that some of your functions contain assembler code. I have no idea what it does. How stable is it? Well, if it works, and also works faster than os-level synchronizing primitives like mutexes and critical sections, then it's just great

BTW: What I think could be done to further improve the performance is change your locker type from class to object. Maybe the compiler will have generate less pointer dereferencing instructions this way; but I'm not 100% sure
« Last Edit: April 11, 2014, 12:10:37 pm by hinst »
Too late to escape fate

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: A new and scalable FIFO fair lock
« Reply #2 on: April 11, 2014, 04:52:55 pm »

hinst wrote:

>I noticed that on your web page you >have a list of various libraries like >"scalable array >based lock", "scalable >FIFO fair lock" and others. Can you tell >which of these libraries is best fit to >synchronize reference counting? My >guess is "fair lock".


You can use the TicketSpinlock with a proportional backoff, it is on my website
and it will do the job very well.

You can download TicketSpinlock with a proportional backoff from:

http://pages.videotron.com/aminer/


Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: A new and scalable FIFO fair lock
« Reply #3 on: April 12, 2014, 12:53:23 am »
Hello,

I have just corrected a bug in my new scalable FIFO fair lock,
and i have tested it thoroughly and it is stable now,
other than that as you have noticed i am using a
FIFO queue that uses a Ticket spinlock with a proportional,
but to  optimize it more for speed you can use a waitfree queue.

Why have i invented my new scalable FIFO fair lock and
have i not used just a TicketSpinlock with a proportional backcoff ?
cause wioth the TicketSpinlock if the time in the critical section
becomes bigger then the proportional backoff will not be able to
do its job correctly and the TicketSpinlock will not scale to more and more cores...

You can download my new scalable FIFO fair lock version 1.01
from:

http://pages.videotron.com/aminer/


Thank you,
Amine Moulay Ramdane.




aminer

  • Hero Member
  • *****
  • Posts: 956
Re: A new and scalable FIFO fair lock
« Reply #4 on: April 12, 2014, 02:50:12 am »
Hello,


We have to be smart please, as you have noticed i have
wrote another scalable FIFO fair lock and i have
studied it deeply , and i have come to the conclusion
that it's useful to learn and useful to use, i have
began to test it under contention to see how it behave,
under contention with 2 threads on a Quadcore and
under contention with 4 threads, what i have
discovered that under contention with two threads
the following section in the Enter() method
will be accessed by 1/2 of the total  number of times the threads enter the Enter() method.

here is the section, it is when PMyRecord4(a)^.myid equal -1:

if PMyRecord4(a)^.myid=0
          then
           begin
            PMyRecord4(a)^.count:=1;
            exit;
            end
         else
         begin   
          freemem(PMyRecord4(a)^.mem); <- this one
          end;
      end

but under contention with 4 threads this same section
will be accessed almost 0 times.. so if you want to test
the scalability with  two threads scenario under contention,
you will not be able to do it accuratly cause under contention
with two threads the queue.pop(tobject(a)) statement inside the Enter() method will be cheap compared to the 4 threads scenario under contention, but what i can tell you for sure is that with 4 threads and more and with more cores and under contention my FIFO fair lock
will be scalable.

But we have to be smart more than that, you will say that my
scalable FIFO fair lock is slower by almost 2x times than the
Ticket spinlock with a proportional backoff, but you have
to understand that the TicketSpinlock will not scale with
bigger critical sections ,  my scalable FIFO fair lock
will, other than that you have to understand that as the time
inside the  critical section becomes bigger the speed
of my scalable FIFO fair lock will be almost equal to that of
the TicketSpinlock with a proportional backoff, other than
that, to make my scalable FIFO fair lock faster you can use
a Waitfree FIFO queue on the pop side of the queue.

You can download my scalable FIFO fair lock from:

http://pages.videotron.com/aminer/




Thank you,
Amine Moulay Ramdane.
 

aminer

  • Hero Member
  • *****
  • Posts: 956
Concurrent FIFO queue version 2.01
« Reply #5 on: April 12, 2014, 03:37:28 am »
Hello...


I have come to another subject...

Take a look at my scalable and relaxed MPMC and almost strict FIFO priority Queue version 1.07 here:

http://pages.videotron.com/aminer/


As i have told you, this queue is very important cause it's scalable,
that means if you have four threads on a Quadcore pushing at for example 1000000 pushes per second each, you will be able to have
a total throughput of 4000000 pushs per second, so it will scale , and with more cores
it will scale more and you will have more throughtput than that...


But you have to be smart please, as you have noticed my scalable
almost strict FIFO queue is composed of multiple queues that runs in parallel, and it has to be used cleaverly to be scalable, so imagine that you have a Quadcore and you have two producers threads and
4 consumers threads and you want all the 4 cores to be utilized fully by the consumners threads , so you have to model my scalable queue as
follows:

You have to create 2 of my scalable and almost strict FIFO queue
each with 4 queues, and the first producer thread have to distribute the items in a round robin manner to the two queues..., and the second producer thread must do the same... this is how you can model a scenario with a number of producers smaller than the number
of consumers threads equal to the number of cores to be scalable.

The other scenario is the following: if you have more producers threads
than the numbers of cores that you want to utilize fully , you have
to create one of my scalable queue with the same number of queues as the number of producers, and the consumers have to enter a semaphore with a counter equal to the number of cores.


Hope you have understood my idea.


You can download my scalable and relaxed MPMC and almost strict FIFO priority Queue version 1.07 from:

http://pages.videotron.com/aminer/


Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Concurrent FIFO queue version 2.01
« Reply #6 on: April 13, 2014, 08:08:55 pm »
Hello,

I have studied deeply the following FIFO queue ,
it's Waitfree on the push() and lockfree on the pop()...

http://pastebin.com/f72cc3cc1


What i have done is replacing my array based FIFO queue inside
my scalable FIFO fair lock with this node based FIFO queue
above that is Waitfree on the push.. but what i have noticed
on my benchmarks that it's 32% slower than my array based
FIFO queue that uses a Ticket spinlock with a proportional backoff,
and if you take a look at the FIFO queue above , it is using
the same technic as the scalable MCS lock or the scalable CLH lock
that uses a node based FIFO queue, so i think that since the scalable MCS lock and the scalable CLH lock are using a node based FIFO queue,
hence they become slower than the array based FIFO queues by 32%.


Thank you,
Amine Moulay Ramdane.


 

aminer

  • Hero Member
  • *****
  • Posts: 956
Concurrent FIFO queue version 2.01
« Reply #7 on: April 14, 2014, 10:40:12 pm »
Hello,


We have to be smart more than that...

So follow with me please...

The follwing Bakery concurrent FIFO queue algorithm have a weakness:

https://groups.google.com/d/topic/lock-free/acjQ3-89abE/discussion


Look at the consumer function:

double consumer() {
    uint32_t ver = XADD(&tail, 1);
    cell& c = cells[ver & (N - 1)];
    while (LOAD(&c.ver) != ver + 1) backoff();
    double state = c.state;
    STORE(&c.ver, ver + N);
    return state;
}



As you have noticed if LOAD(&c.ver) = ver + 1
the thread will then pop up the item.. but this is a weakness
i will explain to you why...

With a CAS based method the pop() method can use a backoff mechanism
when the CAS fails under high contention and this will give
3x times more throughput than the bakery algorithm in the pop()
side, i have benchmarked it and it has been confirmed under
my benchmarks , the CAS based method with a backoff gives
35 millions of POPs per second on my Quacore, and the Bakery gives
12 millions of POPs per second on my Quadcore.

To elevate this weakness with the Bakery concurrent FIFO queue
you will have to choose another model like the following..

Imagine that you want to design a Threadpool that is efficient,
if you use a single FIFO queue for all the consumers this will not
be so fast and efficient, to be more efficient and fast you have have to use a concurrent FIFO queue for each consumers in your Threadpool
but the producer have to be a single thread that will
pop from a concurrent FIFO queue where multiple threads will
push there items, so this single producer thread of the Threadpool
engine will round robin between the Queues of each concumer
and put the items, so since you are using a Single producer thread
there will be variables like the tail variable above that will be accessed from only the local cache , so there will be no cache transfer for the tail variable etc. so this will be cheaper and faster and i think this will give as the same throughput as the CAS with a backoff on the pop side, a=nd this will give us 35 millions of POPs per second
on my Quadcore and this method can be applied to the Bakery concurrent
FIFO queue.

On my previous post i have said that  the service rate of the pop() is limited by the arrival rate of the push() under high contention,
but that's not completly true cause with the Threadpool engine
since the service rate will be lowered on the consumers side
since they have to so some work, the service rate must be
faster and 3x times by applying the above method.



Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Concurrent FIFO queue version 2.01
« Reply #8 on: April 19, 2014, 04:24:23 am »
Hello,


A concurrent FIFO Queue version 2.0


Authors: Amine Moulay Ramdane and Dariusz Mazur.


Description:

A concurrent FIFO queue that satisfies many requirements: it is FIFO fair on the push() side and lockfree on the pop() side, it uses only an atomic increment on the push() side and a proportional backoff on the push() to optimize it more on the push() side and it uses a CAS on the pop() side , and it minimizes efficiently the cache-coherence traffic and it is energy efficient on the pop() side: when there is no items in the queue it will not spin-wait , but it will block and wait on my SemaMonitor, this concurrent FIFO queue is limited to 1000 threads on the push() side, if you want to higher that , just modify the "margin" constant in the source code. This concurrent FIFO queue  gives a throughput of 3.2 millions of transactions  on my 2.4 GHz Quadcore, that's cool and that's  a decent throughtput .

Please look more information here: Concurrent FIFO queue.

Please take a look a the test.pas Object Pascal demo inside the zipfile, compile and run it...


You can download my concurrent FIFO queue from:

http://pages.videotron.com/aminer/


Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/

Operating Systems: Windows, Mac OSX , Linux , Unix...


Required FPC switches: -O3 -Sd -dFPC -dFreePascal

-Sd for delphi mode....

{$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems

{$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems



Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Concurrent FIFO queue version 2.0
« Reply #9 on: April 19, 2014, 08:39:21 pm »

Hello,

The zip link was not working correctly, it must be CQueue1.zip,
i have just correct that, please download again my concurrent FIFO Queue version 2.0 from:
 

http://pages.videotron.com/aminer/


Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Concurrent FIFO queue version 2.01
« Reply #10 on: April 20, 2014, 02:43:51 am »
Hello,


I have discovered a serious problem today, when you are using a Ticketspinlock or an array based lock, don't use the "pause" asm instruction only when you are spinning waiting for the lock or don't use a proportional backoff using the "pause" asm instruction only when you are spinning waiting for the lock, cause this can lead to a big problem that look like a lock convoy, if you start more threads than cores for example the system will switch between the threads and giving each thread its quantum time even if the threads don't enter the Enter() method of the lock, and this will lead to a big problem , this can slow by much the threads entering the lock, this will look like a lock convoy and this is not acceptable, so the solution is to use a sleep(0) when you are spinning waiting for the lock inside the Enter() method of the lock,  but i have noticed that sleep(0) for example lowers a lot the throughput of my concurrent FIFO queue by 3x times, and swithchtothread() will also lowers a lot the throughput of my concurrent FIFO queue by 3x times ... so this is very sad ! so from now on becarefull of this serious  problem, i have just tested it on my computer and it has showed up and it's a serious problem  !


Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Please read the following
« Reply #11 on: April 20, 2014, 03:59:24 am »

Hello...

I think i will now change the fair lock to unfair lock inside my SemaCondvar.pas that will be used by my concurrent FIFO queue,it has given a throughput of 7.4 millions millions of transactions per second for my concurrent FIFO queue , and what about the starvation problem ?

Please read this:

"The change to unfair locks clearly has the risk of leading to starvation.  But, statistically speaking, timing in concurrent systems tends to be so volatile that each thread will eventually get its turn to run, probabilistically speaking.  Many more programs would suffer from the convoy problems resulting from fair locks than would notice starvation happening in production systems as a result of unfair locks."


Please read the following:



http://joeduffyblog.com/2006/12/14/anticonvoy-locks-in-windows-server-2003-sp1-and-windows-vista/



Thank you,
Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Concurrent FIFO queue version 2.01
« Reply #12 on: April 21, 2014, 10:00:31 pm »
Hello,

I have updated my concurrent FIFO queue to version 2.01,
i have added a sleep(0) with a proportaional backoff
to the TicketSpinlock to avoid the serious problem  that i have spook
about in my previous post, and my concurrent FIFO queue
is giging now 3.2 millions of Transactions per second on
my 2.4 GHz Quadcore.. and it is stable now, i have updated
all my other programs with this patch , so i advise you to
download all my other programs too if you need them.


Description:

A concurrent FIFO queue that satisfies many requirements: it is FIFO fair on the push() side and lockfree on the pop() side, it uses only an atomic increment on the push() side and s proportional backoff on the push() to optimize it more on the push() side and it uses a CAS on the pop() side , and it minimizes efficiently the cache-coherence traffic and it is energy efficient on the pop() side: when there is no items in the queue it will not spin-wait , but it will block and wait on my SemaMonitor, this concurrent FIFO queue is limited to 1000 threads on the push() side, if you want to higher that , just modify the "margin" constant in the source code. This concurrent FIFO queue  gives a throughput of 3.2 millions of transactions per second on my 2.4 GHz Quadcore, that's cool and that's  a decent throughtput .


You can download my concurrent FIFO queue from:

http://pages.videotron.com/aminer/


Please take a look a the test.pas Object Pascal demo inside the zipfile, compile and run it...

Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/

Operating Systems: Windows, Mac OSX , Linux , Unix...


Required FPC switches: -O3 -Sd -dFPC -dFreePascal

-Sd for delphi mode....

{$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems

{$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems



Thank you,
Amine Moulay Ramdane.





aminer

  • Hero Member
  • *****
  • Posts: 956
A concurrent waitfree FIFO queue version 1.0
« Reply #13 on: April 22, 2014, 01:32:20 am »

Hello,


A concurrent waitfree FIFO queue version 1.0


Authors: Christopher Michael Thomasson and Amine Moulay Ramdane(author of Object pascal code and SemaMonitor) 


Description:

A concurrent FIFO queue that satisfies many requirements: it is waitfree on the push() side and waitfree on the pop() side,  and it minimizes efficiently the cache-coherence traffic and it is energy efficient on the pop() side: when there is no items in the queue it will not spin-wait , but it will block and wait on my SemaMonitor. This concurrent FIFO queue  gives a throughput of 3.2 millions of transactions  per second on my 2.4 GHz Quadcore, that's cool and that's a decent  throughtput .

You can download my concurrent waitfree FIFO queue from:

http://pages.videotron.com/aminer/


Please take a look a the test.pas Object Pascal demo inside the zipfile, compile and run it...

Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/

Operating Systems: Windows, Mac OSX , Linux , Unix...


Required FPC switches: -O3 -Sd -dFPC -dFreePascal

-Sd for delphi mode....

{$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems

{$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems


Thank you,
Amine Moulay Ramdane.



aminer

  • Hero Member
  • *****
  • Posts: 956
A fast concurrent FIFO queue version 1.0
« Reply #14 on: April 23, 2014, 01:05:48 am »

Hello...


I have changed the TicketSpinlock with Spinlock with a backoff
and i throughput have been enhanced to 6.4 millions per second
and that's a very good throughput, it is still FIFO fair on the push() side and on the pop() side.

And what about starvation ?

Read this:

"The change to unfair locks clearly has the risk of leading to starvation.  But, statistically speaking, timing in concurrent systems tends to be so volatile that each thread will eventually get its turn to run, probabilistically speaking.  Many more programs would suffer from the convoy problems resulting from fair locks than would notice starvation happening in production systems as a result of unfair locks."

Please read the following:

http://joeduffyblog.com/2006/12/14/anticonvoy-locks-in-windows-server-2003-sp1-and-windows-vista/


Authors: Christopher Michael Thomasson and Amine Moulay Ramdane(author of Object pascal code and SemaMonitor) 


Description:

A fast concurrent FIFO queue that satisfies many requirements: it is FIFO fair on the push() and the pop() sides,  and it minimizes efficiently the cache-coherence traffic and it is energy efficient on the pop() side: when there is no items in the queue it will not spin-wait , but it will block and wait on my SemaMonitor. This fast concurrent FIFO queue  gives a throughput of 6.4 millions of transactions per second on my 2.4 GHz Quadcore, that's a very good throughtput .

Please take a look a the test.pas Object Pascal demo inside the zipfile, compile and run it...

Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/

Operating Systems: Windows, Mac OSX , Linux , Unix...


Required FPC switches: -O3 -Sd -dFPC -dFreePascal

-Sd for delphi mode....

{$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems

{$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems


Thank you,
Amine Moulay Ramdane.


 

TinyPortal © 2005-2018