Lazarus
Announcements => Third party => Topic started by: aminer on May 07, 2010, 07:35:52 am
-
Hello all,
I have used my Parallel Sort Library and benchmarked an Object Pascal
program that sorts a dynamic array of 10 millions of strings.
Please look at the benchmarks in following page:
http://pages.videotron.com/aminer/parallelsort/parallelsort.htm
On four cores four threads, here is the results:
Parallel Quicksort gives 5.31x speed and 2.877 seconds
Parallel Heapsort gave me 4.72x and 7.452 seconds
Note: Parallel Quicksort is much faster in pratice than parallel heapsort
or parallel mergesort.
Sincerely
Amine Moulay Ramdane.
-
really nice job!
-
Interesting.
How can quicksort became 3.35 times faster with 2 processor cores?
I would guess the theoretical maximum speed gain is *2.
The same thing with 4 cores. How can it become more than 4 times faster?
I looked at the home page: http://pages.videotron.com/aminer/
but didn't see any download link.
How can I get the source code? Does it even exist?
Regards,
Juha
-
I looked at the home page: http://pages.videotron.com/aminer/
but didn't see any download link.
How can I get the source code? Does it even exist?
There is a ZIP link at the right side of the table (column:Files)
-
There is a ZIP link at the right side of the table (column:Files)
Oops, yes, it does exist.
Has someone tested it with FPC? There seem to be some Delphi units copied which could easily be replaced with what FPC offers.
Can anyone explain the speed difference. How can 2 cores make it more than 2 times faster? Is it some cache hit issue?
Now I don't have a multi-core CPU around. Some day I will test it myself.
Juha
-
I spent some time trying to get a decent test case done. When I try to run it on my Ubuntu 9.10:
An unhandled exception occurred at $0809F010 :
EAccessViolation : Access violation
$0809F010 GETTHREADID, of LockFreePrimitives.pas
$0809E9C2 TGJRINGBUFFER__MEASUREEXECUTIONTIMES, line 240 of RingBuffer.pas
I noticed, that there's quite a bit of assembler code in these files:
classespatch.pas
lockfree_mpmc.pas
LockFreePrimitives.pas
ParallelQueue.pas
parallelqueuerb.pas
ParallelSort.pas
threadpool.pas
I've never done any asm-programming on Delphi/FPC, but I presume that the library is Windows-only?
I also noticed that the test runs with different cores were not done with identical data. That's why I wanted to make a test case, that sorts identical data with different number of cores.
If someone could try this on Windows, here's my testcase:
http://www.starsoft.fi/jarto/ParallelSortTest.zip (http://www.starsoft.fi/jarto/ParallelSortTest.zip)
In my code, you can decide the number of names and number of cores to test here:
TestParallelSort(10000,1,4); //Tests 10000 names with 1 -> 4 cores.
-
Can anyone explain the speed difference. How can 2 cores make it more than 2 times faster? Is it some cache hit issue?
I suppose that this differenced is based on the base performance.
In the single core test the core is not 100% working on the sort, in fact while it testing there are a lot of other programs that windows must execute, think at an antivirus, or any other software.
So the performance is not right one core, but 0.75 core.
In the other tests, we can think that the first core work at 75% but the other can run at 100% so you have e final result that is bigger than the number of core.
This is what I suppose with my knowledge of System Administrator, I don't know if in the algoritm of quicksort there is something that take advance with parallel processing, making it result better.
p.s. sorry for my english
Roberto Gibertini
-
Here's an updated version of my test program. In this one you can specify the number of names as a command line argument. It also saves the original names and results as text files as long as number of names does not exceed 100000.
http://www.starsoft.fi/jarto/ParallelSortTest2.zip (http://www.starsoft.fi/jarto/ParallelSortTest2.zip)
With my Dell Latitute E5500 with 2,40GHz P8600 I got this kind of results:
Cores: 2
4000000 names sorted in 00:13:360
Cores: 1
4000000 names sorted in 00:24:000
But the biggest concern was, that the result files did not always match. One or a few names were not sorted properly in some of the test runs.
-
With my Dell Latitute E5500 with 2,40GHz P8600 I got this kind of results:
Cores: 2
4000000 names sorted in 00:13:360
Cores: 1
4000000 names sorted in 00:24:000
But the biggest concern was, that the result files did not always match. One or a few names were not sorted properly in some of the test runs.
Thanks for testing!
So 2 cores didn't make it more than 2 times faster. This looks more realistic.
It is a big concern if the result is not sorted correctly, yes.
I hope the original author, Aminer, could comment these results.
BTW: The assembly code doesn't necessarily make this "Windows only" but it makes this "Intel only". It is used for locking threads and for atomic moves. I guess it could be replaced with portable code.
Juha
-
>It is a big concern if the result is not sorted correctly.
I can not reproduce this...
Also in the TParallelSort.create(n,ctQuicksort);
the number of cores must be 2^N: 1,2,4 etc.
>BTW: The assembly code doesn't necessarily
>make this "Windows >only" but it makes this "Intel only"
rigth.
Amine.
-
> hope the original author, Aminer, could comment these results.
As i have noticed , you are using this:
IntelĀ® PentiumĀ® Processor E5500
(2M Cache, 2.80 GHz, 800 MHz FSB)
http://ark.intel.com/Product.aspx?id=42800
It has only *1* L2 cache of 2M.
I have used CPU-Z on my hardware:
http://www.cpuid.com/cpuz.php
and it gives me this:
Level 1 Data: 4 x 34 KBytes 8ways
Level 1 Inst.: 4 x 34 KBytes 8 ways
Level 2: 2 x 4096 KBytes 16 ways
So, mine has *2* L2 caches of 4M
So , i think this explains the difference...
Amine.
-
I have on my hardware:
Level 1 Data: 4 x 34 KBytes 8ways
Level 1 Inst.: 4 x 34 KBytes 8 ways
Level 2: 2 x 4096 KBytes 16 ways
So, when i am quicksorting a big array on 1 core ,
there is much higher cache misses than
on 2 threads/2 cores - cause i am breaking the array
in 4 parts and quicksorting in parallel, also ,
i have 2 x L2 caches of 4M - and you have only one
one on your E5500 - and this explains my results...
Amine.
-
>But the biggest concern was, that the result files did not always >match. One or a few names were not sorted properly in some of >the test runs.
As i have noticed, you are using this:
TestParallelSort(10000,1,4); //Tests 10000 names with 1 -> 4 cores.
As i said before, the number of cores must be 2^N: 1,2,4 etc.
So, you have to avoid *3*, just: 1,2,4,6,8 etc.
I can not reproduce your problem here.
Amine.
-
>As i have noticed, you are using this:
>TestParallelSort(10000,1,4); //Tests 10000 names with 1 -> 4 cores.
>As i said before, the number of cores must be 2^N: 1,2,4 etc.
>So, you have to avoid *3*, just: 1,2,4,6,8 etc.
I have reproduced your problem with a number of cores
equal to 3.
So, try to use a number that is not 3 and that is 2^N,
example: 1,2,4,6,8 ,10,12,14,16 etc.
and you will be safe.
You are welcome :)
Sincerely,
Amine.
-
>I spent some time trying to get a decent test case done. When I >try to run it on my Ubuntu 9.10:
>An unhandled exception occurred at $0809F010 :
>EAccessViolation : Access violation
> $0809F010 GETTHREADID, of LockFreePrimitives.pas
> $0809E9C2 TGJRINGBUFFER__MEASUREEXECUTIONTIMES, line >240 of RingBuffer.pas
I think my ParallelQueue will compile correctly...
RingBuffer is from
http://www.odsrv.com/RingBuffer/RingBuffer.htm
So, if you want to run it on Ubuntu 9.10 just
delete the 'uses parallelqueuerb,ringbuffer' inside
threadpool.pas, after that compile it and tell me
if it's working on Ubuntu 9.10 correctly.
Amine.
-
>An unhandled exception occurred at $0809F010 :
>EAccessViolation : Access violation
> $0809F010 GETTHREADID, of LockFreePrimitives.pas
> $0809E9C2 TGJRINGBUFFER__MEASUREEXECUTIONTIMES, line >240 of RingBuffer.pas
[...]
So, if you want to run it on Ubuntu 9.10 just
delete the 'uses parallelqueuerb,ringbuffer' inside
threadpool.pas, after that compile it and tell me
if it's working on Ubuntu 9.10 correctly.
I think the same applies to any Linux, not just Ubuntu. I have OpenSuse.
The code doesn't even compile as is.
Amine, if you plan to support other platforms than Windows, please do the following changes:
1. Replace "IFDEF FreePascal" with "IFDEF FPC".
2. Units are not found in a casesensitive file system. Either rename all units to lowercase or match their (casesensitive) names exactly in USES section.
After those changes I could compile and run and then I also got Access violation.
Then I commented out {parallelqueuerb,ringbuffer} as you suggested and got an exception + SIGSEGV in TThreadPool.Terminate on line :
for I := 0 to FThreadCount - 1 do FThreads.Terminate;
This code is not inherently Windows dependent so I think it is worth porting and testing it under other platforms, too.
Juha
-
JuhaManninen wrote
>I think the same applies to any Linux, not just Ubuntu. I have >OpenSuse.
>The code doesn't even compile as is.
>Amine, if you plan to support other platforms than Windows, >please do the following changes:
>1. Replace "IFDEF FreePascal" with "IFDEF FPC".
>2. Units are not found in a casesensitive file system. Either rename >all units to lowercase or match their (casesensitive) names exactly >in USES section.
I understand , and i will make those changes..
>After those changes I could compile and run and then I also got >Access violation.
>Then I commented out {parallelqueuerb,ringbuffer} as you >suggested and got an exception + SIGSEGV in >TThreadPool.Terminate on line :
> for I := 0 to FThreadCount - 1 do FThreads.Terminate;
>This code is not inherently Windows dependent so I think it is >worth porting and testing it under other platforms, too.
That's strange, why are you getting SIGSEGV on
FThreads.Terminate ?
Why it's working on Windows and not working on Linux ?
I don't have Linux or MacOSX , to test it here.
I will think more about this...
Amine.
-
That's strange, why are you getting SIGSEGV on
FThreads.Terminate ?
Why it's working on Windows and not working on Linux ?
I don't have Linux or MacOSX , to test it here.
Actually the exception happens before FThreads.Terminate, inside a thread I guess.
The cursor was just placed on FThreads.Terminate.
Debugging threaded code can be little tricky.
Well, Linux is free and easy to install (nowadays). :-)
Have you considered publishing the code on a revision control system, to make contributions easier?
Juha
-
>As i have noticed, you are using this:
>TestParallelSort(10000,1,4); //Tests 10000 names with 1 -> 4 cores.
>As i said before, the number of cores must be 2^N: 1,2,4 etc.
>So, you have to avoid *3*, just: 1,2,4,6,8 etc.
I have reproduced your problem with a number of cores
equal to 3.
My first test case (which I made blind) didn't notice that. However, my second test case takes care of avoiding cores 3, 5 etc..
I was able to reproduce the sorting problem by running my test program many times in a bat file:
test.bat:
parallelsorttest 100000
diff results-1.txt results-2.txt
parallelsorttest 1000
diff results-1.txt results-2.txt
parallelsorttest 100000
diff results-1.txt results-2.txt
parallelsorttest 20000
diff results-1.txt results-2.txt
parallelsorttest 20000
diff results-1.txt results-2.txt
parallelsorttest 30000
diff results-1.txt results-2.txt
parallelsorttest 100000
diff results-1.txt results-2.txt
parallelsorttest 100000
diff results-1.txt results-2.txt
parallelsorttest 60000
diff results-1.txt results-2.txt
parallelsorttest 100000
diff results-1.txt results-2.txt
parallelsorttest 100000
diff results-1.txt results-2.txt
parallelsorttest 100000
diff results-1.txt results-2.txt
parallelsorttest 100000
diff results-1.txt results-2.txt
When I run the bat file, in some test runs I get a lot of output from the diff program.
Just out of curiosity, could you post here some results from your 4 core CPU using my test program?
-
JuhaManninen wrote:
>Actually the exception happens before FThreads.Terminate, inside >a thread I guess.
>The cursor was just placed on FThreads.Terminate.
>Debugging threaded code can be little tricky.
>Well, Linux is free and easy to install (nowadays). :-)
Can you advice me where i can download a *good* version
of Linux from the *Internet* - for example a .iso image fo dvd
or cd - , so that i can test my libraries on it ?
>Have you considered publishing the code on a revision control >system, to make contributions easier?
Not yet.
>Juha
Aminw.
-
jarto wrote
>My first test case (which I made blind) didn't notice
>that. However, my second test case takes care of
>avoiding cores 3, 5 etc..
>I was able to reproduce the sorting problem by running
>my test program many times in a bat file:
Are you using Parallel QuickSort ?
Tell me if you can reproduce the same thing with Parallel ctHeapSort?
>Just out of curiosity, could you post here some results
>from your 4 core CPU using my test program?
My hardware is too fast that it gives 00:00:000 on 2 cores.
Amine.
-
jarto write:
>I was able to reproduce the sorting problem by running my test >program many times in a bat file:
I see. It is the 'merge' procedure that is not working
properly , i will take a look at it , and see what's the
problem...
Take care.
Amine.
-
Can you advice me where i can download a *good* version
of Linux from the *Internet* - for example a .iso image fo dvd
or cd - , so that i can test my libraries on it ?
The popular desktop distros: Ubuntu, OpenSuse, Fedora, Mandriva
are all good choices. You can find them by simple google search or such.
They provide live CD / DVDs so you can test before installing.
They support dual boot and your Windows still remains functional.
If you have lots of memory you can even run different operating systems under virtual machines at the same time.
Juha
-
jarto wrote:
>My first test case (which I made blind) didn't notice that. >However, my second test case takes care of avoiding cores 3, 5 >etc..
>I was able to reproduce the sorting problem by running my test >program many times in a bat file:
I have tested it on Delphi 7 and i think this problem doesn't
show on Delphi 7.
Can you please test it on delphi and confirm to me ?
Amine.
-
Tell me if you can reproduce the same thing with Parallel ctHeapSort?
Yes, I can reproduce the same problem with ctHeapSort. It can also be reproduced with Delphi 7.
My hardware is too fast that it gives 00:00:000 on 2 cores.
Just increase the number of names to sort. You can specify the namecount and cores to test on this line:
TestParallelSort(NameCount, 1, 4)
And make sure to code from my 2nd version:
http://www.starsoft.fi/jarto/ParallelSortTest2.zip (http://www.starsoft.fi/jarto/ParallelSortTest2.zip)
/jarto
-
jarto wrote:
>Yes, I can reproduce the same problem with ctHeapSort. It can >also be reproduced with Delphi 7.
Here it is jarto , i have solved the problem, you can
download version 2.11 from my website:
http://pages.videotron.com/aminer/
Please test it and tell me if all is ok...
Note: the problem was not on my algorithm , it was
very difficult to spot the problem ...
Sincerely,
Amine.
-
Please JuhaMannin, can you test version 2.11 and
tell me if you have the same exception + SIGSEGV
in Linux ?
Amine.
-
Please JuhaMannin, can you test version 2.11 and tell me if you have the same exception + SIGSEGV in Linux ?
I downloaded:
http://pages.videotron.com/aminer/zip/parallelsort.zip
I hope the link is correct.
Then I had to make the same changes as before to make it compile and run.
And Yes, I got the same exception + SIGSEGV still.
I have no idea what you have changed because I had to download the whole package and replace the old files.
Amine, you should really consider putting your code into a revision control system. It would help you to track your changes. It would help any tester (like me now) to see the changes. It would allow someone to send you patches. You would still have full control over your code, nobody could change it without your permission.
If you for example open a GitHub repository, I promise to fork it and send you some changes.
Juha
-
Hello JuhaMannine,
I have installed OpenSUSE , but when i have tried to
install FreePascal it complains that libinfo.so is not present,
in wich package i can find this dynamic library ?
Amine.
-
I have installed OpenSUSE , but when i have tried to install FreePascal it complains that libinfo.so is not present, in wich package i can find this dynamic library ?
There is no such library in Suse. I guess the RPM file is for another distro and you could ignore the dependency and force install.
The easiest way to install FPC 2.2.4 and Lazarus 0.9.28.2 is from Suse repositories (add Packman repo there).
I use the latest SVN versions of FPC and Lazarus myself. That requires installing many development packages. It is not difficult either because they are all found in Suse repositories.
Juha
-
You could have quick help for your installation in #lazarus-ide chat channel. I also will be present there now for ~2 hours.
Juha
-
Hello all, hello JuhaManninen,
I have installed Linux OpenSUSE and i have solved
the problem, you have to add cthreads to the uses
and delete ringbuffer and parallequeuerb.
You can download version 2.12 from my website..
http://pages.videotron.com/aminer/
and don't forget to compile with -dUnix
Amine.
-
Here it is jarto , i have solved the problem, you can
download version 2.11 from my website:
http://pages.videotron.com/aminer/
Please test it and tell me if all is ok...
This version works well. I could not reproduce the bug any more. I'm still interested in seeing results from you using my test app...
-
Hello all,
Now, i have tested my Parallel Sort Library with
Windows and Linux and it is working correctly..
Can someone else test it on Mac OSX (x86)... ?
Amine.
-
Next step i will compile and test my Parallel Compression
Library etc. in Linux...
FreePascal is fun...
Amine.
-
Some feedback about your uses-lines:
You don't have to add cmem and cthreads to the unit files. They only have to be once in the project and they are in the *.lpr -file. Now you have cmem ParallelSort.pas and ThreadPool.pas, which is completely unnecessary.
SysUtils, Classes, SyncObjs do not need any ifdefs either. They are all available in both Delphi and FPC/Lazarus on all platforms.
This should clean up your code somewhat.
-
jarto wrote
>This version works well. I could not reproduce
>the bug any more.
Happy for you.
>I'm still interested in seeing results from you using my test app...
Here is it is with i have tested it with 1 and 2 cores:
With 10 millions it's still ~3x on two cores.
D:\tmp10>parallelsorttest 10000
Cores: 2
10000 names sorted in 00:00:015
Cores: 1
10000 names sorted in 00:00:032
D:\tmp10>parallelsorttest 50000
Cores: 2
50000 names sorted in 00:00:063
Cores: 1
50000 names sorted in 00:00:141
D:\tmp10>parallelsorttest 200000
Cores: 2
200000 names sorted in 00:00:375
Cores: 1
200000 names sorted in 00:00:719
D:\tmp10>parallelsorttest 300000
Cores: 2
300000 names sorted in 00:00:609
Cores: 1
300000 names sorted in 00:01:110
D:\tmp10>diff results-1.txt results-2.txt
D:\tmp10>del results-1.txt
D:\tmp10>del results-2.txt
D:\tmp10>parallelsorttest 5000000
Cores: 2
5000000 names sorted in 00:15:047
Cores: 1
5000000 names sorted in 00:27:688
D:\tmp10>parallelsorttest 10000000
Cores: 2
10000000 names sorted in 00:32:890
Cores: 1
10000000 names sorted in 01:01:797
Amine.
-
Hello all,
I have updated Parallel Sort Library to version 2.13
http://pages.videotron.com/aminer/
Inside ParallelSort.pas , i have corrected a
minor memory leak in the following part:
------------------------------------
if rest<>0
then
begin
for i:=0 to nbrprocs-1 do TParams(arr1).free;
end
else
begin
for i:=0 to nbrprocs-1 do TParams(arr1).free;
end;
-----------------------------------
I have changed it to
------------------------------------
if rest<>0
then
begin
for i:=0 to nbrprocs do TParams(arr1).free;
end
else
begin
for i:=0 to nbrprocs-1 do TParams(arr1).free;
end;
-----------------------------------
Sincerely,
Amine.
-
Here is it is with i have tested it with 1 and 2 cores:
With 10 millions it's still ~3x on two cores.
...
D:\tmp10>parallelsorttest 10000000
Cores: 2
10000000 names sorted in 00:32:890
Cores: 1
10000000 names sorted in 01:01:797
Are you saying that 01:01:797 is 3x of 00:32:890?
-
Are you saying that 01:01:797 is 3x of 00:32:890?
Nice catch.. ;)
-
All I can say is; O:-)
Thank you very much :D
-
xenablaise wrote:
>All I can say is;
>Thank you very much
Thank you...
And as you have noticed it's very easy to work
with my Parallel Sort Library...
Long life to FreePascal !
You are welcome:
http://pages.videotron.com/aminer/
Sincerely,
Amine Moulay Ramadne.
-
aminer;
I'm a delphi/lazarus newbie, all I know are basic programming in delphi/lazarus, basic file handling, basic multi-threading, basic database development, basics. But I'm learning each day, repeating task.
To tell you frankly, I don't know how to use your units, only your samples/demos/links made me convinced that your a delphi/lazarus genius.
And I'm keeping this post in my favorites.
To let you know, I'm glad participating on this post.
I hope you can help me if I apply your paralleled ideas to my app [in-case] if I want a speed up. PM would be better.
thank you :D