Recent

Author Topic: Lockfree_mpmc and scalability ...  (Read 8876 times)

aminer

  • Hero Member
  • *****
  • Posts: 956
Lockfree_mpmc and scalability ...
« on: May 28, 2012, 09:49:52 pm »
Hello all,


I have finally found why lockfree_mpmc doesn't scale...

you can get the the source code of lockfree_mpmc from:

http://pages.videotron.com/aminer/

So please follow with me..

If you take a look at lockfree_mpmc object pascal
source code you will read this on the push side:


---

function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
 then
  begin
    result:=false;
   exit;
end;

result:=true;

newTemp:=LockedIncLong(temp);
lastTail:=newTemp-1;

setObject(lastTail,tm);

repeat
 if CAS(tail,lasttail,newtemp)
 then
   begin
    exit;
   end;
asm pause end;
until false;
end;

---

When i have tested the push() side with 4 threads i have noticed that lockfree_mpmc
doesn't scale at all., in fact i have got a retrograde throughput, that means that
i got less throughput than on a single thread  test.. and i have finally found
why lockfree_mpmc doesn't scale.  When you are using a lockfree_mpmc
on a single thread test the CAS does read and update the variables on the
level 1 cache, and it's fast, but when you are using 4 threads it does get
too slow cause we are reading and updating from the L2  and from the memory.

I have thried to play with the affinity mask and i have found that when i am
using two threads on my tests and reading and updating from the same level 2 cache
it does scale a little bit more and i have got more throughput with two threads
on different cores and on the same level 2 cache than the single threadtest.


I have also modified lockfree_mpmc to not touch the CAS and
the cache when tail and lasttail are not equal by using the following code inside
the repeat until loop:

if tail <> lasttail
then
begin
continue;
end;

and it does give  better  performance with this method

here is the final code of the push() side of lockfree_mpmc..

i think i will modify the pop() side like that...


---
function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
 then
  begin
    result:=false;
   exit;
end;

result:=true;

newTemp:=LockedIncLong(temp);
lastTail:=newTemp-1;

setObject(lastTail,tm);

repeat

if tail <> lasttail
then
 begin
  continue;
 end;

 if CAS(tail,lasttail,newtemp)
 then
   begin
    exit;
   end;
asm pause end;
until false;
end;
---

But as i have said before lockfree_mpmc doesn't scale when we are
using different cores and WE ARE NOT  sharing the same cache,
that means that on my Intel Core 2 Quad Q6600 it does scale only
when we are using 2 threads on different cores that shares the same
level2 cache.



Thank you.


Amine Moulay Ramdane.


aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Lockfree_mpmc and scalability ...
« Reply #1 on: May 28, 2012, 10:41:49 pm »

Hello,


I have only tested lockfree_mpmc on a Intel Core 2 Quad Q6600,
i don't have here an L3 cache, but perhaps  lockfree_mpmc
will scale on an x86 that have an L3 cache.


I didn't tested it with an L3 cache, can you please do it for me
if you have an L3 cache and a quad core or more core on your x86 computer ?



Just download the Lockfree MPMC and SPMC fifo queues version 1.12
from http://pages.videotron.com/aminer/ and look inside the zip
file i have put a push.pas and a pop.pas tests, just open for example the
push.pas test and test it with a single threads and after that with 4 threads 
by giving the variable a the value of 1 and after that 4 and after that just
email me the throughput for 1 and 4 threads.



Thank you.

Amine Moulay Ramdane.



aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Lockfree_mpmc and scalability ...
« Reply #2 on: May 29, 2012, 12:52:47 am »

Hello,

Here it is, i have compiled the push.pas to
push1.exe (one thread) and push4.exe (four threads)

Here they are:

http://pages.videotron.com/aminer/push1.exe

http://pages.videotron.com/aminer/push4.exe


If your computer is an x86 and you have an L3 cache
and your computer have  4 or more cores  , can please
run those two programs and give me there output...



Thank you.

Amine Moulay Ramdane.


 


Blaazen

  • Hero Member
  • *****
  • Posts: 2994
  • POKE 54296,15
    • Eye-Candy Controls
Re: Lockfree_mpmc and scalability ...
« Reply #3 on: May 29, 2012, 01:20:05 am »
I don't have such CPU.
So here I paste link to list of CPUs to see which one has L3 cache:
http://en.wikipedia.org/wiki/Comparison_of_Intel_processors
(there is also link to AMD CPUs below)
Lazarus 2.1.0 r64115 FPC 3.3.1 r40507 x86_64-linux-qt Chakra, Qt 4.8.7/5.13.2, Plasma 5.17.3
Lazarus 1.8.2 r57369 FPC 3.0.4 i386-win32-win32/win64 Wine 3.21

Try Eye-Candy Controls: https://sourceforge.net/projects/eccontrols/files/

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Lockfree_mpmc and scalability ...
« Reply #4 on: May 29, 2012, 08:03:55 am »
Thanks Blaazen - on a Core i7 quad core, Windows Vista x64, I get:

push1.exe
Code: [Select]
throughput is: 6045617,03 push transations/s

push4.exe
Code: [Select]
throughput is: 6148518,32 push transations/s
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

ttomas

  • Full Member
  • ***
  • Posts: 198
Re: Lockfree_mpmc and scalability ...
« Reply #5 on: May 29, 2012, 09:46:34 am »
Win 7 Pro 64b, CPU Core i5 2500k, 4 Cores, 6M L3 Cashe

D:\test>push1.exe
throughput is: 10993436,92 push transations/s
D:\test>push1.exe
throughput is: 10997357,05 push transations/s
D:\test>push1.exe
throughput is: 11093107,64 push transations/s

D:\test>push4.exe
throughput is: 8976093,31 push transations/s
D:\test>push4.exe
throughput is: 9159408,90 push transations/s
D:\test>push4.exe
throughput is: 9161709,71 push transations/s

D:\test>

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Lockfree_mpmc and scalability ...
« Reply #6 on: May 29, 2012, 03:28:00 pm »


Hello,


I think that it doesn't scale , cause when we are using 4 threads
lockfree-mpmc is reading and updating from the memory
also.

But i will try to align the array on 64 bytes (cache line length) and 
making the data inside the array the same size as a cache line and
i will ask you to retest and see if this will enhance the scalability.


Thank you.

Amine Moulay Ramdane.




aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Lockfree_mpmc and scalability ...
« Reply #7 on: May 29, 2012, 04:57:07 pm »

Hello,

Here it is, i have aligned the array on 64 bytes (cache line length)
and  have made the data inside the array to the same size as
the cache line and i have compiled push.pas to
push1.exe (one thread) and push2.exe (two threads) and
push4.exe (four threads)

Here they are:

http://pages.videotron.com/aminer/push1.exe

http://pages.videotron.com/aminer/push2.exe

http://pages.videotron.com/aminer/push4.exe



If your computer is an x86 and you have an L3 cache
and your computer have  4 or more cores  , can you please
restest with those programs and give me there output...



Thank you.

Amine Moulay Ramdane.



SlowCoder

  • New member
  • *
  • Posts: 8
Re: Lockfree_mpmc and scalability ...
« Reply #8 on: May 29, 2012, 08:02:14 pm »
Hello,

On WinXP-pro-SP3, i3 CPU 540@3.07GHz;

10001

throughput is: 5663080.66 push transations/s

10001

throughput is: 5780925.11 push transations/s

10001

throughput is: 5797681.42 push transations/s

20002

throughput is: 4144633.42 push transations/s

20002

throughput is: 4541780.40 push transations/s

20002

throughput is: 3922730.12 push transations/s

40004

throughput is: 242588.17 push transations/s

40004

throughput is: 203511.24 push transations/s

40004

throughput is: 586285.25 push transations/s

Last 4004 result made me do some more runs of push4:
258,959.10
2,452,427.77
2,585,406.95
1,477,034.48
207,278.92
2,720,620.36
2,561,731.67
1,491,295.58
1,277,307.76
268,322.96
2,581,236.40
2,540,743.21

Good luck

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Lockfree_mpmc and scalability ...
« Reply #9 on: May 29, 2012, 08:10:16 pm »


Hello,

I don't think i3 CPU 540 have an L3 cache.


Can somebody that has an L3 cache with 4 cores or more
redo this tests for me ?

Here they are:

http://pages.videotron.com/aminer/push1.exe

http://pages.videotron.com/aminer/push2.exe

http://pages.videotron.com/aminer/push4.exe



Amine Moulay Ramdane.



SlowCoder

  • New member
  • *
  • Posts: 8
Re: Lockfree_mpmc and scalability ...
« Reply #10 on: May 29, 2012, 08:18:47 pm »
Hello,

i3 540 has 4M L3 cache, but as I see now, only 2 cores.

Anyways
Good luck

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Lockfree_mpmc and scalability ...
« Reply #11 on: May 29, 2012, 08:50:00 pm »

Hello,

Do you know why this lock free fifo doesn't scale, cause
look at the following code on the push() side:

--

function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

 if getlength >= fsize
  then
      begin
          result:=false;
          exit;
      end;
result:=true;
newTemp:=LockedIncLong(temp);

lastTail:=newTemp-1;
setObject(lastTail,tm);

repeat

 if CAS(tail,lasttail,newtemp)
   then
      begin
        exit;
      end;
asm pause end;

 until false;


end;

---



You have two thinks:

[1] newTemp:=LockedIncLong(temp);

[2] CAS(tail,lasttail,newtemp)

In the 4 threads scenario , as you can see
in [1] temp has to be loaded from the L2 cache
of the other cores, and that's slow compared
to the single thread version that loads the value
from the local L1 cache.

that's the same for [2] , tail has to be loaded from
the L2 cache of the other cores, and that's slow
compared to the single thread version that loads
the value from the L1 cahe.

It's why you are getting a retrograde throughput on
the four threads test.

In the two thread scenario, you have to do a load
from the local L2 cache in [1] and [2] and this loads makes
the S part of the Amadahl equation much bigger than
the P part, it's why the two threads  version doens't scale
either.

So in general i think it's not possible  to make lockfree
fifo queues to scale when the lockfree code is sharing variables
between the cores, cause sharing variables is so expensive..


Thank you.

Amine Moulay Ramdane.




 






 






















aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Lockfree_mpmc and scalability ...
« Reply #12 on: May 29, 2012, 09:04:12 pm »

Hello,


Please read again...

I have receaived the benchmarks from some persons
that have an L3 cache, and i have noticed that lockfree_mpmc
doesn't scale either on with an L3 cache.
Do you know why this lock free fifo doesn't scale, cause
look at the following code on the push() side:

--

function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
then
begin
result:=false;
exit;
end;
result:=true;
newTemp:=LockedIncLong(temp);

lastTail:=newTemp-1;
setObject(lastTail,tm);

repeat

if CAS(tail,lasttail,newtemp)
then
begin
exit;
end;
asm pause end;

until false;


end;

---


You have two thinks:

[1] newTemp:=LockedIncLong(temp);

[2] CAS(tail,lasttail,newtemp)

In the 4 threads scenario , as you can see
in [1] temp has to be loaded from the L3 cache
of the other cores on computers that have an L3 cache
but on my also from memory on my Intel Core 2 Quad Q6600
that doesn't have an L2 cache(just an L2 cache for every two cores) ,
so that will make the the four thread test with an L3 cache a little bit
slower than the single thread version and much slower without an
L3 cache compared to the single thread version that loads the values
from the L1 cache. That's the same for [2] , tail has to be loaded the same
way.

It's why i am getting a retrograde throughput on my
Intel Core 2 Quad Q6600 and alomost the same thoughput
as the single thread on a computer with an L3 cache.

In the two thread scenario, you have to do a load
from the local L2 cache in [1] and [2] and this loads makes
the S part of the Amadahl equation much bigger than
the P part, it's why the two threads version doens't scale
either.

So in general i think it's not possible to make lockfree
fifo queues to scale when the lockfree code is sharing variables
between the cores, cause sharing variables is so expensive..


Thank you.

Amine Moulay Ramdane.

aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Lockfree_mpmc and scalability ...
« Reply #13 on: May 29, 2012, 09:07:29 pm »

Hello,

Of course i was speaking about the x86 architecture...


Amine Moulay Ramdane.


aminer

  • Hero Member
  • *****
  • Posts: 956
Re: Lockfree_mpmc and scalability ...
« Reply #14 on: May 29, 2012, 09:23:02 pm »

I have corrected some typos , please read again...

Hello,


I have received the benchmarks from some persons
that have an L3 cache, and i have noticed that lockfree_mpmc
doesn't scale either on with an L3 cache.
Do you know why this lock free fifo doesn't scale, cause
look at the following code on the push() side:

--

function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
then
begin
result:=false;
exit;
end;
result:=true;
newTemp:=LockedIncLong(temp);

lastTail:=newTemp-1;
setObject(lastTail,tm);

repeat

if CAS(tail,lasttail,newtemp)
then
begin
exit;
end;
asm pause end;

until false;


end;

---


You have two thinks:

[1] newTemp:=LockedIncLong(temp);

[2] CAS(tail,lasttail,newtemp)

In the 4 threads scenario , as you can see in [1] temp has to be
loaded from the L3 cache on computers that have an L3 cache ,
but on my  Intel Core 2 Quad Q6600 that doesn't have an
L3 cache(just an L2 cache for every two cores)  i think it has to
be loaded  from memory, so that will make  the four thread test
with an L3 cache a little bit slower than the single thread version
that loads the values from the L1 cache and much slower on a computer
without an L3 cache. That's the same for [2] , tail has to be loaded the
same way.

It's why i am getting a retrograde throughput with four threads
on my Intel Core 2 Quad Q6600 and almost the same thoughput
as the single thread on a computer with an L3 cache.

In the two thread scenario, you have to do a load
from the local L2 cache in [1] and [2] and this loads makes
the S part of the Amadahl equation much bigger than
the P part, it's why the two threads version doens't scale
either.

So in general i think it's not possible to make lockfree
fifo queues to scale when the lockfree code is sharing variables
between the cores, cause sharing variables is so expensive..


Thank you.

Amine Moulay Ramdane.



 

TinyPortal © 2005-2018