ANN: Parallel Sort library version 3.3 [Edit] #2

Hello....


My Parallel Sort library version 3.3 is here:

https://sites.google.com/site/aminer68/parallel-sort-library


As i have told you , my new parallel Sort algorithm has become more cache-aware, and since it has become more cache-aware it has induced a super linear speedup and super linear scalability when using more cores and more L2 caches, i have done some benchmarks on my Quadcore that uses two L2 caches and it has given a super linear speedup of 5X scalability on my Quadcore when sorting strings even though i am using only 4 cores, that's easy to understand cause when you use only one thread it will use only one
 L2 cache, but when you use more threads on multiple cores and with multiple L2 caches it will use more L2 caches and it will parallelize the access to those multiple L2 caches , this is why my new parallel algorithm has given a super linear speedup when sorting strings.

I have to be frank, we have to be smart when inventing or doing parallel sort algorithms, my new parallel sort algorithm is smart, cause it is more cache-aware, but you have to be carefull cause look
at my other parallel quicksort algorithm here:

https://sites.google.com/site/aminer68/parallel-quicksort

You have to know that this parallel quicksort that uses the classical way of sorting is not so good, cause  when its partition function is used recursively, this parallel quicksort algorithm will dispatch the arrays to be partitioned each time to different threads, and this will make it less cache-aware than my new parallel sort algorithm, and since it will make it less cache-aware , so you will not get much than 3X scalability by sorting strings and you will get less that 3X scalability by sorting intege
rs or doubles for example,  So i advice you to use my new parallel sort algorithm of my parallel sort library version 3.3 that is more cache-aware and that gives you super linear scalability when sorting strings and it gives you good scalability when sorting integers and doubles etc. 

Let's talk more computer science...

Finally i have arrived to an important subject that is "optimization", and you will notice that computer science learns you one of the most important thing, is how to amortize and optimize by optimizing more your time complexity that we express mathematically with a O(), this is how i have done it with my new parallel sort algorithm that has a "very" good time complexity, and what learns you also computer science is how to optimize more in the source code's part or parts that takes the greater percentage 
of running time, and what learns you also computer science is to do  the greater percentage of all the optimizations and to not  do the smaller optimizations, this is also a good way to follow to optimize your code, so if you look at my new parallel conjugate gradient solver library , i am not using SIMD SSE2 instructions , but i am just parallelizing the memory transfers cause my parallel algorithm is NUMA-aware and i am also parallelizing also the multiplication and addition of the double's type etc. an
d also my parallel algorithm is cache-aware so i am also parallelizing more the L2 cache memory transfers , so this optimization will make it a very greater percentage of all the optimization that we can do on my parallel algorithm, so in large scale industrial problems i don't think that my parallel conjugate gradient algorithm needs to do SIMD SSE2 or AVX optimization , cause SIMD SSE2 or AVX will give just a small improvement to my parallel algorithm. 

I have finally arrived to an important subject...it's still on the subject of parallel optimization, but this time i will talk more about the Cilk-style scheduler.. as you have seen me in my previous post i have talked about the Dmitry Vyukov's Cilk-style scheduler that supports system-topology awareness, hyper-threading awareness, affinity-awareness, batch-spawn capability and manual task-depth control, here it is:

http://www.1024cores.net/home/parallel-computing/radix-sort


and as you have seen me talking about my parallel quicksort, i have said that it's not so good cause since it dispatches the indexes of the arrays to be partitionned  each time to different threads of my classical threadpool, since the my threadpool don't look like Dmitry Vyukov's scheduler , so this will render the parallel quicksort algorithm less cache-aware, so this will not scale much than 3X for strings and it will scale less for integers and doubles... but if we are using the Dmitry Vyukov's schedu
ler, as we recursively dispatch the array's indexes to the Scheduler's threads that will call the partition function, the Dmitry Vyukov's Cilk-style scheduler will dispatch those array's indexes to threads that will reuse "efficiently" the data of the caches , so this will reuse efficiently the L2-cache's data and L3-cache's data efficiently , but even though it will reuse the data of caches efficiently, this optimizations of the Dmitry Vyukov's scheduler will not bring more than a 2X improvement , in fac
t you will get less and less improvement when using "bigger" and bigger data, so this is not so important improvement that will bring the Dmitry Vyukov's scheduler, so what you have to do is to render your parallel sort algorithm into a NUMA-aware algorithm this way you will render your parallel algorithm into a scalable algorithm, this is why i have told you in my previous post that the Dmitry's Sheduler will not bring an important improvement, and if you take a look at my new parallel sort algorithm of 
my parallel sort library version 3.3, its parallel  sorting part doesn't contain parallel recursive calls and its parallel merging part doesn't contain parallel recursive calls, so the Dmitry Vyukov's scheduler will not bring improvement to my parallel algorithm, so for all those reasons this why i have decided to not implement a scheduler that look like the Dmitry Vyukov's scheduler and this why i have decided to use my classical threadpool engine instead.

Finally Parallel Sort Library that supports Parallel Quicksort, Parallel HeapSort and Parallel MergeSort on Multicores systems is here.

Parallel Sort Library uses my Thread Pool Engine and sort many array parts - of your array -  in parallel using Quicksort or HeapSort or MergeSort and after that it finally merge them - with the merge() procedure - 

In the previous parallelsort version i have parallelized only the sort part, but in this new parallelsort version i have parallelized also the merge procedure part and it gives better performance.

My new parallel sort algorithm has become more cache-aware, and i have done some benchmarks with my new parallel algorithm and it has given up to 5X scalability on a Quadcore when sorting strings, other than that i have cleaned more the code and i think my parallel Sort library has become a more professional and industrial parallel Sort library , you can be confident cause i have tested it thoroughly and no bugs have showed , so i hope you will be happy with my new Parallel Sort library. 


I have also included a "test.pas" example, just compile first the "gendata.pas" inside the zip file and run it first, after that compile the "test.pas" example and run it and do your benchmarks. 

I have implemented a Parallel hybrid divide-and-conquer merge algorithm that performs 0.9-5.8 times better than sequential merge, on a quad-core processor, with larger arrays outperforming by over 5 times. Parallel processing combined with a hybrid algorithm approach provides a powerful high performance result.

The best case complexity of ParallelSort using mergesort  is:

 ((n/p)* log(n/p)) + O(n/p)

p: is the number of cores

the ((n/p)* log(n/p)) is the complexity of the sorting part.

O(n/p) is the best case complexity of the merging part.

so the best case complexity  is:  ((n/p)* log(n/p))

The worst case complexity of parallel sort library using mergesort is:

 ((n/p)* log(n/p)) +  O(n)

the ((n/p)* log(n/p)) is the complexity of the sorting part.

O(n) is the worst  case complexity of the merging part.

so the worst case complexity of parallelsort using mergesort is approximatly: O(n)

I have done some tests with my ParallelSort library and i have noticed that it can give up to 5X scalability with strings, and it gives 3x scalability with integers on a quad cores.

So, why it scales to 5X with strings and only 3x with integers on a quad cores ?


I explain:

In the SequentialMerge() method and QSort() method inside Parallel Sort library, i am calling the Scompare() method and also in both of them i am copying to the memory system.

So when i am using strings the SCompare() method is more expensive, so the parallel part p in the Amdahl equation 1/ S + P/N (S: the serial part, P: parallel part and N: the number of cores) is bigger than with integers so the Amdahl equation will scale better, but when we are using integers the SCompare() method is less expensive than the SCompare() with strings, so the parallel part p in the Amdahl equation is less bigger than with strings. so this is why parallel sorting with strings scales better than
 with integers.

I have implemented mergsort and quicksort, but as you know the complexity of mergesort in the worst case is better than quicksort , and the mergesort that i have implemented is faster than quicksort, but mergesort takes more space..

The reccurence equation of the complexity of mergesort is:

T(n) = 2 * T(n/2) + n

cause it take O(n) for the merging part.

It gives:

T(n) = 2 * T(n/2) + n

= 2 (2T(n/4) +n/2) + n

=4T(n/4)+n+n

=4T(n/4)+2*n

=4 (2T(n/8) +n/4) + 2*n

=8T(n/8)+n+2n

=8T(n/8)+3*n

=2k T(n/2^k) + k*n

We want:

n/2^k = 1

n = 2^k

log n = k

so the reccurence equation gives:

= n*T(1) +n*log(n)

= n+ (n * log(n))

So the mergesort complexity in the best case and in the worst case is:

n * log(n)

But the complexity of the quicksort in the worst case is:

T(n)= n + T(n-1)

it gives:

T(n) = n + (n-1) + T(n-2)

T(n) = n + (n-1) + (n-2)+ T(n-3)

T(n) = 1 + 2+ 3+.+N

..T(n) = O(n^2)

And Quicksort in best case performance is: 
 
T(n) = T(n/2) + T(n/2) + O(n)
       = 2T(n/2) + (n)

this gives: T(n) = (n lg n)

Parallelizing the Sorts

One way to parallelize the sorts is:

Divide the data among the processors
Sort the data on the individual processors.
Merge the various data
Note that the merge operation is a reduction operation !



You can dowload my Parallel Sort library that is much more faster than my Parallel Quickort library from:

https://sites.google.com/site/aminer68/parallel-sort-library




Thank you for your time.


Amine Moulay Ramdane.

Edited by: Amine Moulay Ramdane on Feb 7, 2015 1:33 PM
0
Amine
2/7/2015 9:33:28 PM
embarcadero.delphi.tools 5366 articles. 3 followers. Follow

0 Replies
652 Views

Similar Articles

[PageSpeed] 0

Reply:

Similar Artilces:

ANN: Parallel Sort library version 3.3 [Edit]
Hello, My Parallel Sort library was updated to version 3.3, my new parallel sort algorithm has become more cache-aware, and i have done some benchmarks with my new parallel algorithm and it has given up to 5X scalability on a Quadcore when sorting strings, other than that i have cleaned more the code and i think my parallel Sort library has become a more professional and industrial parallel Sort library , you can be confident cause i have tested it thoroughly and no bugs have showed , so i hope you will be happy with my new Parallel Sort library. I have also included a "test.pa...

ANN: Parallel Sort library version 3.3 [Edit]
Hello, My Parallel Sort library was updated to version 3.3, my new parallel sort algorithm has become more cache-aware, and i have done some benchmarks on my new parallel algorithm and it has given up to 5X scalability on a Quadcore when sorting strings, other than that i have cleaned more the code and i think my parallel Sort library has become a more professional and industrial parallel Sort library , you can be confident cause i have tested it thoroughly and no bugs have showed , so i hope you will be happy w ith my new Parallel Sort library. I have also included a "test.pas&...

ANN: Parallel Compression Library 3.3 and Parallel archiver 2.2 [Edit]
Hello, Parallel Compression Library was updated to version 3.3 and Parallel archiver was updated to version 2.2 I have stress tested all the day my Parallel archiver and my parallel compression library, and i have discovered a bug that i have corrected , and i have also optimized them more and now parallel LZMA compression have the same speed as the parallel compression LZMA of 7Zip.. i have also cleaned more the code and i can tell you now that Parallel archiver and Parallel Compression library are very stable now, you can really be confident with both my Parallel archiver and ...

ANN: Parallel Sort library version 3.3
Hello, My Parallel Sort library was updated to version 3.3, my new parallel sort algorithm has become more cache-aware, and i have done some benchmarks on my new parallel algorithm and it has given up to 5X scalability on a Quadcore when sorting strings, other than that i have cleaned more the code and i think my parallel Sort library has become a more professional and industrial parallel Sort library , you can be confident cause i have tested it thoroughly and no bugs have showed , so i hope you will be happy w ith my new Parallel Sort library. I have also included a "test.pas&...

ANN: Parallel Sort library version 3.3
Hello.... My Parallel Sort library version 3.3 is here: https://sites.google.com/site/aminer68/parallel-sort-library As i have told you , my new parallel Sort algorithm has become more cache-aware, and since it has become more cache-aware it have induced a super linear speedup and super linear scalability when using more cores and more L2 caches, i have done some benchmarks on my Quadcore that uses two L2 caches and it has given a super linear speedup of 5X scalability on my Quadcore when sorting strings even though i am using only 4 cores, that's easy to understand cause when...

[ANN] Release of Bugzilla 3.2.2, 3.0.8, and 3.3.3
Bugzilla 3.2.1, 3.0.7, and 3.3.2 contained a bug that was critical for any installation running under mod_perl, due to an unintentional interaction between the various security fixes in those releases. We are releasing three new releases today to fix the critical issue: 3.2.2, 3.0.8, and 3.3.3. They are identical to the previous release except that they have this one fix for installations running under mod_perl. Download -------- Bugzilla is available at: http://www.bugzilla.org/download/ Security Advisory ----------------- Details of the fix are in the Security Adviso...

[ANN] Parallel Sort library version 3.32
Hello, I have updated my Parallel Sort library to version 3.32, i have just corrected a minor bug and i have stress tested it and it i think it is now really stable. You can download my Parallel Sort library version 3.32 from: https://sites.google.com/site/aminer68/parallel-sort-library But don't forget to put the "cmem" unit as the first unit in your "uses" statement when you want to use my Parallel Sort library with the Delphi graphical user interface, that's mandatory, and that's mandatory also with my other parallel libraries. Than...

[ANN] Release of Bugzilla 3.2.8, 3.4.8, 3.6.2, and 3.7.3
Today we have four new releases. One new development snapshot (3.7.3), one new stable release (3.6.2) and two security updates for the old stable releases (3.4.8 and 3.2.8). Bugzilla 3.6.2 is our latest stable release. It contains various useful bug fixes and security improvements for the 3.6 branch. Bugzilla 3.4.8 and 3.2.8 are security updates for the 3.4 branch and the 3.2 branch, respectively. Bugzilla 3.7.3 is our third unstable development release leading to Bugzilla 4.0. We have done a fair amount of QA on this release. However, QA found many bugs that have not ye...

[ANN] Release of Bugzilla 3.2.1, 3.0.7, 2.22.7, and 3.3.2
Today we have some major security improvements for Bugzilla in the form of four releases. We strongly recommend that all Bugzilla administrators read the Security Advisory for these releases, which is linked below in this email. Bugzilla 3.2.1 is our latest stable release. It contains various useful bug fixes in addition to major security improvements. Bugzilla 3.0.7 and Bugzilla 2.22.7 are security updates for their branches. Bugzilla 3.3.2 is an unstable development release. In addition to the security fixes that all the other releases contain, this release contains n...

ANN: Parallel compression library version 3.33 [Edit]
Hello, My Parallel compression library was updated to version 3.33, like in design by contract, i have added preconditions to the methods etc. You can download Parallel compression library 3.33 from: https://sites.google.com/site/aminer68/parallel-compression-library Thank you, Amine Moulay Ramdane. Edited by: Amine Moulay Ramdane on Apr 19, 2015 11:28 AM ...

ANN: Parallel archiver version 3.1 is here... [Edit] #2
Hello.. Parallel archiver version 3.1 is here... I have optimized more my Parallel archiver, now in this new version 3.1, the loading of the index is done 3X times faster than before, and i have done a benchmark with my new Parallel archiver version 3.1 and i have noticed that it can load now an index of 80000 files in about 9.5 seconds on an SSD drive, that's really fast ! so i think that this new version 3.1 has attained now a good quality and it can be considered an industrial library now and i think that it's now a stable version, so i think that yo u can be confident wi...

[ANN] Security Advisory for Bugzilla 3.2.8, 3.4.8, 3.6.2, and 3.7.3
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Summary ======= Bugzilla is a Web-based bug-tracking system used by a large number of software projects. The following security issues have been discovered in Bugzilla: * There is a way to inject both headers and content to users, causing a serious Cross-Site Scripting vulnerability. * It was possible to see graphs from Old Charts even if you did not have access to a particular product, and you could browse a particular URL to see all product names. * YUI 2.8.1, which shipped with Bugzilla starting with 3.7.x, contain...

ANN: Parallel Sort library was ported to Delphi XE versions [Edit]
Hello, I have ported my Parallel Sort library to the delphi XE versions on the Win32 and Win64 platforms... Since i have only a computer with a Quadcore, i have tried to do a scalability prediction of my Parallel Sortt library using the Amdahl equation and the numbers are good, and since i have completly parallelized the sorting part and the merging part, my Parallel Sort library is more scalable than my Parallel Quicksort library since on the parallel quicksort you can not completly parallelize the partition() procedure and you will still have a serial part in the partition() proced...

Web resources about - ANN: Parallel Sort library version 3.3 [Edit] #2 - embarcadero.delphi.tools

Parallel - Wikipedia, the free encyclopedia
Text is available under the Creative Commons Attribution-ShareAlike License ;additional terms may apply. By using this site, you agree to the ...

Does the opportunity to lead a parallel life online mean we are more likely to cheat?
Here's the bad news: According to some experts, you're more likely to find yourself with a cheating partner these days.  

Woman accusing Daw of rape was teased over parallels with novel
Being made fun of by classmates when they studied a novel in 2007 led woman who accuses former AFL player of raping her to report the incident ...

Harper review: Government faces stoush with authors over removal of parallel book imports
The Turnbull government is set for a fiery showdown with some of Australia's most popular authors over its plan to loosen restrictions on book ...

Parallels $79.99 Black Friday Bundle: Free 1Password, Evernote, Camtasia, Pocket, PDFPen, Snagit, more ...
Parallels has one of the most impressive bundles we’ve seen launching today . For the normal price of Parallels ( $79.99 / $50 upgrade ), you ...

LLVM to get FORTRAN compiler that targets parallel GPUs in clusters
... NVIDIA that would be enhancing the LLVM compiler collection. The goal will be to port an existing FORTRAN compiler that targets massively parallel ...

Windows 10 on Mac: Boot Camp vs Parallels
... run Windows 10 on a Mac there are really only two options worth considering: a native install using Boot Camp or virtualization through Parallels. ...

5 Reasons Why Parallel Parenting Works Better Than Co-Parenting
... to have a co-parenting relationship with my ex. I stumbled upon an article over at the Huffington Post that introduced me to a new term: Parallel ...

"Every year at the holidays, millions of Americans partake in a whole parallel tradition: avoiding the ...
"... that one exhausting relative who treats every event as a chance to assault you with their fringe political ideas, hector you about your ...

Tesla automatic parallel parking video - Business Insider
... can now activate " Autopilot " — a mode that enables the car to steer, change lanes, and avoid other cars and obstacles. It can even parallel ...

Resources last updated: 12/4/2015 5:00:25 PM