Performance of parallel computing.

--Apple-Mail=_13BEE445-C1BB-45F3-9160-8756EA9FB615
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

Hi everybody!

I have recently played a bit with somewhat intense computations and =
tried to parallelize them among a couple of threaded workers. The =
results were somewhat... eh... discouraging. To sum up my findings I =
wrote a simple demo benchmark:

     use Digest::SHA;
     use Bench;

     sub worker ( Str:D $str ) {
         my $digest =3D $str;

         for 1..100 {
             $digest =3D sha256 $digest;
         }
     }

     sub run ( Int $workers ) {
         my $c =3D Channel.new;

         my @w;
         @w.push: start {
             for 1..50 {
                 $c.send(
                     (1..1024).map( { (' '..'Z').pick } ).join
                 );
             }
             LEAVE $c.close;
         }

         for 1..$workers {
             @w.push: start {
                 react {
                     whenever $c -> $str {
                         worker( $str );
                     }
                 }
             }
         }

         await @w;
     }

     my $b =3D Bench.new;
     $b.cmpthese(
         1,
         {
             workers1 =3D> sub { run( 1 ) },
             workers5 =3D> sub { run( 5 ) },
             workers10 =3D> sub { run( 10 ) },
             workers15 =3D> sub { run( 15 ) },
         }
     );

I tried this code with a macOS installation of Rakudo and with a Linux =
in a VM box. Here is macOS results (6 CPU cores):

Timing 1 iterations of workers1, workers10, workers15, workers5...
  workers1: 27.176 wallclock secs (28.858 usr 0.348 sys 29.206 cpu) @ =
0.037/s (n=3D1)
		(warning: too few iterations for a reliable count)
 workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @ =
0.133/s (n=3D1)
		(warning: too few iterations for a reliable count)
 workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @ =
0.126/s (n=3D1)
		(warning: too few iterations for a reliable count)
  workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @ =
0.106/s (n=3D1)
		(warning: too few iterations for a reliable count)
O-----------O----------O----------O-----------O-----------O----------O
|           | s/iter   | workers1 | workers10 | workers15 | workers5 |
O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO
| workers1  | 27176370 | --       | -72%      | -71%      | -65%     |
| workers10 | 7503726  | 262%     | --        | 6%        | 26%      |
| workers15 | 7938428  | 242%     | -5%       | --        | 19%      |
| workers5  | 9452421  | 188%     | -21%      | -16%      | --       |
----------------------------------------------------------------------

And Linux (4 virtual cores):

Timing 1 iterations of workers1, workers10, workers15, workers5...
  workers1: 27.240 wallclock secs (29.143 usr 0.129 sys 29.272 cpu) @ =
0.037/s (n=3D1)
		(warning: too few iterations for a reliable count)
 workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @ =
0.097/s (n=3D1)
		(warning: too few iterations for a reliable count)
 workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @ =
0.098/s (n=3D1)
		(warning: too few iterations for a reliable count)
  workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) @ =
0.094/s (n=3D1)
		(warning: too few iterations for a reliable count)
O-----------O----------O----------O----------O-----------O-----------O
|           | s/iter   | workers5 | workers1 | workers15 | workers10 |
O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO
| workers5  | 10663102 | --       | 155%     | -4%       | -3%       |
| workers1  | 27240221 | -61%     | --       | -62%      | -62%      |
| workers15 | 10220862 | 4%       | 167%     | --        | 1%        |
| workers10 | 10338829 | 3%       | 163%     | -1%       | --        |
----------------------------------------------------------------------

Am I missing something here? Do I do something wrong? Because it just =
doesn't fit into my mind...

As a side done: by playing with 1-2-3 workers I see that each new thread =
gradually adds atop of the total run time until a plato is reached. The =
plato is seemingly defined by the number of cores or, more correctly, by =
the number of supported threads. Proving this hypothesis wold require =
more time than I have on my hands right now. And not even sure if such =
proof ever makes sense.

Best regards,
Vadim Belman


--Apple-Mail=_13BEE445-C1BB-45F3-9160-8756EA9FB615
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;" class=3D"">Hi =
everybody!<div class=3D""><br class=3D""></div><div class=3D"">I have =
recently played a bit with somewhat intense computations and tried to =
parallelize them among a couple of threaded workers. The results were =
somewhat... eh... discouraging. To sum up my findings I wrote a simple =
demo benchmark:</div><div class=3D""><br class=3D""></div><div =
class=3D""><div class=3D"">&nbsp; &nbsp; &nbsp;use =
Digest::SHA;</div><div class=3D"">&nbsp; &nbsp; &nbsp;use =
Bench;</div><div class=3D""><br class=3D""></div><div class=3D"">&nbsp; =
&nbsp; &nbsp;sub worker ( Str:D $str ) {</div><div class=3D"">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;my $digest =3D $str;</div><div class=3D""><br =
class=3D""></div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;for =
1..100 {</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;$digest =3D sha256 $digest;</div><div class=3D"">&nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp;}</div><div class=3D"">&nbsp; &nbsp; =
&nbsp;}</div><div class=3D""><br class=3D""></div><div class=3D"">&nbsp; =
&nbsp; &nbsp;sub run ( Int $workers ) {</div><div class=3D"">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;my $c =3D Channel.new;</div><div class=3D""><br=
 class=3D""></div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;my =
@w;</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;@w.push: =
start {</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;for 1..50 {</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;$c.send(</div><div class=3D"">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;(1..1024).map( { (' '..'Z').pick } ).join</div><div =
class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;);</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;}</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;LEAVE $c.close;</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;}</div><div class=3D""><br class=3D""></div><div class=3D"">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;for 1..$workers {</div><div class=3D"">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;@w.push: start {</div><div =
class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;react {</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;whenever $c -&gt; $str =
{</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;worker( $str );</div><div =
class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp;}</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}</div><div class=3D"">&nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}</div><div class=3D"">&nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp;}</div><div class=3D""><br class=3D""></div><div =
class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;await @w;</div><div =
class=3D"">&nbsp; &nbsp; &nbsp;}</div><div class=3D""><br =
class=3D""></div><div class=3D"">&nbsp; &nbsp; &nbsp;my $b =3D =
Bench.new;</div><div class=3D"">&nbsp; &nbsp; =
&nbsp;$b.cmpthese(</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;1,</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;{</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;workers1 =3D&gt; sub { run( 1 ) },</div><div class=3D"">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;workers5 =3D&gt; sub { run( 5 ) =
},</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;workers10 =3D&gt; sub { run( 10 ) },</div><div class=3D"">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;workers15 =3D&gt; sub { run( 15 =
) },</div><div class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}</div><div =
class=3D"">&nbsp; &nbsp; &nbsp;);</div></div><div class=3D""><br =
class=3D""></div><div class=3D"">I tried this code with a macOS =
installation of Rakudo and with a Linux in a VM box. Here is macOS =
results (6 CPU cores):</div><div class=3D""><br class=3D""></div><div =
class=3D""><div class=3D""><font face=3D"Andale Mono" class=3D"">Timing =
1 iterations of workers1, workers10, workers15, =
workers5...</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; workers1: 27.176 wallclock secs (28.858 usr 0.348 sys =
29.206 cpu) @ 0.037/s (n=3D1)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">		</span>(warning: too few =
iterations for a reliable count)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D"">&nbsp;workers10: 7.504 wallclock secs =
(56.903 usr 10.127 sys 67.030 cpu) @ 0.133/s (n=3D1)</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">		=
</span>(warning: too few iterations for a reliable =
count)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp;workers15: 7.938 wallclock secs (63.357 usr 9.483 sys =
72.840 cpu) @ 0.126/s (n=3D1)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">		</span>(warning: too few =
iterations for a reliable count)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D"">&nbsp; workers5: 9.452 wallclock secs =
(40.185 usr 4.807 sys 44.992 cpu) @ 0.106/s (n=3D1)</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">		=
</span>(warning: too few iterations for a reliable =
count)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">O-----------O----------O----------O-----------O-----------O----=
------O</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">| &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | s/iter &nbsp; | =
workers1 | workers10 | workers15 | workers5 |</font></div><div =
class=3D""><font face=3D"Andale Mono" =
class=3D"">O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">| workers1 &nbsp;| =
27176370 | -- &nbsp; &nbsp; &nbsp; | -72% &nbsp; &nbsp; &nbsp;| -71% =
&nbsp; &nbsp; &nbsp;| -65% &nbsp; &nbsp; |</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">| workers10 | 7503726 =
&nbsp;| 262% &nbsp; &nbsp; | -- &nbsp; &nbsp; &nbsp; &nbsp;| 6% &nbsp; =
&nbsp; &nbsp; &nbsp;| 26% &nbsp; &nbsp; &nbsp;|</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">| workers15 | 7938428 =
&nbsp;| 242% &nbsp; &nbsp; | -5% &nbsp; &nbsp; &nbsp; | -- &nbsp; &nbsp; =
&nbsp; &nbsp;| 19% &nbsp; &nbsp; &nbsp;|</font></div><div class=3D""><font=
 face=3D"Andale Mono" class=3D"">| workers5 &nbsp;| 9452421 &nbsp;| 188% =
&nbsp; &nbsp; | -21% &nbsp; &nbsp; &nbsp;| -16% &nbsp; &nbsp; &nbsp;| -- =
&nbsp; &nbsp; &nbsp; |</font></div><div class=3D""><font face=3D"Andale =
Mono" =
class=3D"">---------------------------------------------------------------=
-------</font></div></div><div class=3D""><br class=3D""></div><div =
class=3D"">And Linux (4 virtual cores):</div><div class=3D""><br =
class=3D""></div><div class=3D""><div class=3D""><font face=3D"Andale =
Mono" class=3D"">Timing 1 iterations of workers1, workers10, workers15, =
workers5...</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; workers1: 27.240 wallclock secs (29.143 usr 0.129 sys =
29.272 cpu) @ 0.037/s (n=3D1)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">		</span>(warning: too few =
iterations for a reliable count)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D"">&nbsp;workers10: 10.339 wallclock secs =
(37.964 usr 0.611 sys 38.575 cpu) @ 0.097/s (n=3D1)</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">		=
</span>(warning: too few iterations for a reliable =
count)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp;workers15: 10.221 wallclock secs (35.452 usr 1.432 sys =
36.883 cpu) @ 0.098/s (n=3D1)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">		</span>(warning: too few =
iterations for a reliable count)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D"">&nbsp; workers5: 10.663 wallclock secs =
(36.983 usr 0.848 sys 37.831 cpu) @ 0.094/s (n=3D1)</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">		=
</span>(warning: too few iterations for a reliable =
count)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">O-----------O----------O----------O----------O-----------O-----=
------O</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">| &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | s/iter &nbsp; | =
workers5 | workers1 | workers15 | workers10 |</font></div><div =
class=3D""><font face=3D"Andale Mono" =
class=3D"">O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">| workers5 &nbsp;| =
10663102 | -- &nbsp; &nbsp; &nbsp; | 155% &nbsp; &nbsp; | -4% &nbsp; =
&nbsp; &nbsp; | -3% &nbsp; &nbsp; &nbsp; |</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">| workers1 &nbsp;| =
27240221 | -61% &nbsp; &nbsp; | -- &nbsp; &nbsp; &nbsp; | -62% &nbsp; =
&nbsp; &nbsp;| -62% &nbsp; &nbsp; &nbsp;|</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">| workers15 | 10220862 =
| 4% &nbsp; &nbsp; &nbsp; | 167% &nbsp; &nbsp; | -- &nbsp; &nbsp; &nbsp; =
&nbsp;| 1% &nbsp; &nbsp; &nbsp; &nbsp;|</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D"">| workers10 | 10338829 | 3% &nbsp; =
&nbsp; &nbsp; | 163% &nbsp; &nbsp; | -1% &nbsp; &nbsp; &nbsp; | -- =
&nbsp; &nbsp; &nbsp; &nbsp;|</font></div><div class=3D""><font =
face=3D"Andale Mono" =
class=3D"">---------------------------------------------------------------=
-------</font></div></div><div class=3D""><br class=3D""></div><div =
class=3D"">Am I missing something here? Do I do something wrong? Because =
it just doesn't fit into my mind...</div><div class=3D""><br =
class=3D""></div><div class=3D"">As a side done: by playing with 1-2-3 =
workers I see that each new thread gradually adds atop of the total run =
time until a plato is reached. The plato is seemingly defined by the =
number of cores or, more correctly, by the number of supported threads. =
Proving this hypothesis wold require more time than I have on my hands =
right now. And not even sure if such proof ever makes sense.</div><div =
class=3D""><br class=3D""><div class=3D"">
<div class=3D"">Best regards,</div>Vadim Belman

</div>

<br class=3D""></div></body></html>=

--Apple-Mail=_13BEE445-C1BB-45F3-9160-8756EA9FB615--
0
vrurg
12/7/2018 1:39:23 AM
perl.perl6.users 1110 articles. 0 followers. Follow

6 Replies
8 Views

Similar Articles

[PageSpeed] 14

Not sure if your test is measuring what you expect- the setup of
generating 50 x 1k strings is taking 2.7sec on my laptop, and that's
reducing the apparent effect of parllelism.

$ perl6
To exit type 'exit' or '^D'
> my $c = Channel.new;
Channel.new
> { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say now - ENTER now; }
2.7289092

I'd move the setup outside the "cmpthese" and try again, re-think the
new results.



On 12/6/18, Vadim Belman <vrurg@lflat.org> wrote:
> Hi everybody!
>
> I have recently played a bit with somewhat intense computations and tried to
> parallelize them among a couple of threaded workers. The results were
> somewhat... eh... discouraging. To sum up my findings I wrote a simple demo
> benchmark:
>
>      use Digest::SHA;
>      use Bench;
>
>      sub worker ( Str:D $str ) {
>          my $digest = $str;
>
>          for 1..100 {
>              $digest = sha256 $digest;
>          }
>      }
>
>      sub run ( Int $workers ) {
>          my $c = Channel.new;
>
>          my @w;
>          @w.push: start {
>              for 1..50 {
>                  $c.send(
>                      (1..1024).map( { (' '..'Z').pick } ).join
>                  );
>              }
>              LEAVE $c.close;
>          }
>
>          for 1..$workers {
>              @w.push: start {
>                  react {
>                      whenever $c -> $str {
>                          worker( $str );
>                      }
>                  }
>              }
>          }
>
>          await @w;
>      }
>
>      my $b = Bench.new;
>      $b.cmpthese(
>          1,
>          {
>              workers1 => sub { run( 1 ) },
>              workers5 => sub { run( 5 ) },
>              workers10 => sub { run( 10 ) },
>              workers15 => sub { run( 15 ) },
>          }
>      );
>
> I tried this code with a macOS installation of Rakudo and with a Linux in a
> VM box. Here is macOS results (6 CPU cores):
>
> Timing 1 iterations of workers1, workers10, workers15, workers5...
>   workers1: 27.176 wallclock secs (28.858 usr 0.348 sys 29.206 cpu) @
> 0.037/s (n=1)
> 		(warning: too few iterations for a reliable count)
>  workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @
> 0.133/s (n=1)
> 		(warning: too few iterations for a reliable count)
>  workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @ 0.126/s
> (n=1)
> 		(warning: too few iterations for a reliable count)
>   workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @ 0.106/s
> (n=1)
> 		(warning: too few iterations for a reliable count)
> O-----------O----------O----------O-----------O-----------O----------O
> |           | s/iter   | workers1 | workers10 | workers15 | workers5 |
> O===========O==========O==========O===========O===========O==========O
> | workers1  | 27176370 | --       | -72%      | -71%      | -65%     |
> | workers10 | 7503726  | 262%     | --        | 6%        | 26%      |
> | workers15 | 7938428  | 242%     | -5%       | --        | 19%      |
> | workers5  | 9452421  | 188%     | -21%      | -16%      | --       |
> ----------------------------------------------------------------------
>
> And Linux (4 virtual cores):
>
> Timing 1 iterations of workers1, workers10, workers15, workers5...
>   workers1: 27.240 wallclock secs (29.143 usr 0.129 sys 29.272 cpu) @
> 0.037/s (n=1)
> 		(warning: too few iterations for a reliable count)
>  workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @
> 0.097/s (n=1)
> 		(warning: too few iterations for a reliable count)
>  workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @
> 0.098/s (n=1)
> 		(warning: too few iterations for a reliable count)
>   workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) @
> 0.094/s (n=1)
> 		(warning: too few iterations for a reliable count)
> O-----------O----------O----------O----------O-----------O-----------O
> |           | s/iter   | workers5 | workers1 | workers15 | workers10 |
> O===========O==========O==========O==========O===========O===========O
> | workers5  | 10663102 | --       | 155%     | -4%       | -3%       |
> | workers1  | 27240221 | -61%     | --       | -62%      | -62%      |
> | workers15 | 10220862 | 4%       | 167%     | --        | 1%        |
> | workers10 | 10338829 | 3%       | 163%     | -1%       | --        |
> ----------------------------------------------------------------------
>
> Am I missing something here? Do I do something wrong? Because it just
> doesn't fit into my mind...
>
> As a side done: by playing with 1-2-3 workers I see that each new thread
> gradually adds atop of the total run time until a plato is reached. The
> plato is seemingly defined by the number of cores or, more correctly, by the
> number of supported threads. Proving this hypothesis wold require more time
> than I have on my hands right now. And not even sure if such proof ever
> makes sense.
>
> Best regards,
> Vadim Belman
>
>


-- 
-y
0
not
12/7/2018 6:56:44 AM
That was a bit vague- meant that I suspect the workers are being
starved, since you have many consumers, and only a single thread
generating the 1k strings. I would prime the channel to be  full - or
other restructuring the ensure all threads are kept busy.

-y

On Thu, Dec 6, 2018 at 10:56 PM yary <not.com@gmail.com> wrote:
>
> Not sure if your test is measuring what you expect- the setup of
> generating 50 x 1k strings is taking 2.7sec on my laptop, and that's
> reducing the apparent effect of parllelism.
>
> $ perl6
> To exit type 'exit' or '^D'
> > my $c = Channel.new;
> Channel.new
> > { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say now - ENTER now; }
> 2.7289092
>
> I'd move the setup outside the "cmpthese" and try again, re-think the
> new results.
>
>
>
> On 12/6/18, Vadim Belman <vrurg@lflat.org> wrote:
> > Hi everybody!
> >
> > I have recently played a bit with somewhat intense computations and tried to
> > parallelize them among a couple of threaded workers. The results were
> > somewhat... eh... discouraging. To sum up my findings I wrote a simple demo
> > benchmark:
> >
> >      use Digest::SHA;
> >      use Bench;
> >
> >      sub worker ( Str:D $str ) {
> >          my $digest = $str;
> >
> >          for 1..100 {
> >              $digest = sha256 $digest;
> >          }
> >      }
> >
> >      sub run ( Int $workers ) {
> >          my $c = Channel.new;
> >
> >          my @w;
> >          @w.push: start {
> >              for 1..50 {
> >                  $c.send(
> >                      (1..1024).map( { (' '..'Z').pick } ).join
> >                  );
> >              }
> >              LEAVE $c.close;
> >          }
> >
> >          for 1..$workers {
> >              @w.push: start {
> >                  react {
> >                      whenever $c -> $str {
> >                          worker( $str );
> >                      }
> >                  }
> >              }
> >          }
> >
> >          await @w;
> >      }
> >
> >      my $b = Bench.new;
> >      $b.cmpthese(
> >          1,
> >          {
> >              workers1 => sub { run( 1 ) },
> >              workers5 => sub { run( 5 ) },
> >              workers10 => sub { run( 10 ) },
> >              workers15 => sub { run( 15 ) },
> >          }
> >      );
> >
> > I tried this code with a macOS installation of Rakudo and with a Linux in a
> > VM box. Here is macOS results (6 CPU cores):
> >
> > Timing 1 iterations of workers1, workers10, workers15, workers5...
> >   workers1: 27.176 wallclock secs (28.858 usr 0.348 sys 29.206 cpu) @
> > 0.037/s (n=1)
> >               (warning: too few iterations for a reliable count)
> >  workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @
> > 0.133/s (n=1)
> >               (warning: too few iterations for a reliable count)
> >  workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @ 0.126/s
> > (n=1)
> >               (warning: too few iterations for a reliable count)
> >   workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @ 0.106/s
> > (n=1)
> >               (warning: too few iterations for a reliable count)
> > O-----------O----------O----------O-----------O-----------O----------O
> > |           | s/iter   | workers1 | workers10 | workers15 | workers5 |
> > O===========O==========O==========O===========O===========O==========O
> > | workers1  | 27176370 | --       | -72%      | -71%      | -65%     |
> > | workers10 | 7503726  | 262%     | --        | 6%        | 26%      |
> > | workers15 | 7938428  | 242%     | -5%       | --        | 19%      |
> > | workers5  | 9452421  | 188%     | -21%      | -16%      | --       |
> > ----------------------------------------------------------------------
> >
> > And Linux (4 virtual cores):
> >
> > Timing 1 iterations of workers1, workers10, workers15, workers5...
> >   workers1: 27.240 wallclock secs (29.143 usr 0.129 sys 29.272 cpu) @
> > 0.037/s (n=1)
> >               (warning: too few iterations for a reliable count)
> >  workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @
> > 0.097/s (n=1)
> >               (warning: too few iterations for a reliable count)
> >  workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @
> > 0.098/s (n=1)
> >               (warning: too few iterations for a reliable count)
> >   workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) @
> > 0.094/s (n=1)
> >               (warning: too few iterations for a reliable count)
> > O-----------O----------O----------O----------O-----------O-----------O
> > |           | s/iter   | workers5 | workers1 | workers15 | workers10 |
> > O===========O==========O==========O==========O===========O===========O
> > | workers5  | 10663102 | --       | 155%     | -4%       | -3%       |
> > | workers1  | 27240221 | -61%     | --       | -62%      | -62%      |
> > | workers15 | 10220862 | 4%       | 167%     | --        | 1%        |
> > | workers10 | 10338829 | 3%       | 163%     | -1%       | --        |
> > ----------------------------------------------------------------------
> >
> > Am I missing something here? Do I do something wrong? Because it just
> > doesn't fit into my mind...
> >
> > As a side done: by playing with 1-2-3 workers I see that each new thread
> > gradually adds atop of the total run time until a plato is reached. The
> > plato is seemingly defined by the number of cores or, more correctly, by the
> > number of supported threads. Proving this hypothesis wold require more time
> > than I have on my hands right now. And not even sure if such proof ever
> > makes sense.
> >
> > Best regards,
> > Vadim Belman
> >
> >
>
>
> --
> -y
0
not
12/7/2018 7:06:01 AM
--Apple-Mail=_21EABCB3-7F66-408E-9CA0-0772F9A10B3C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

There is not need for filling in the channel prior to starting workers. =
First of all, 100 repetitions of SHA256 per worker makes takes ~0.7sec =
on my system. I didn't do benchmarking of the generator thread, but =
considering that even your timing gives 0.054sec/per string =E2=80=93 I =
will most definitely remain fast enough to provide all workers with =
data. But even with this in mind I re-run the test with only 100 =
characters long strings being generated. Here is what I've got:

Benchmark:
Timing 1 iterations of workers1, workers10, workers15, workers2, =
workers3, workers5...
  workers1: 22.473 wallclock secs (22.609 usr 0.231 sys 22.840 cpu) @ =
0.044/s (n=3D1)
		(warning: too few iterations for a reliable count)
 workers10: 6.154 wallclock secs (44.087 usr 11.149 sys 55.236 cpu) @ =
0.162/s (n=3D1)
		(warning: too few iterations for a reliable count)
 workers15: 6.165 wallclock secs (50.206 usr 9.540 sys 59.745 cpu) @ =
0.162/s (n=3D1)
		(warning: too few iterations for a reliable count)
  workers2: 14.102 wallclock secs (26.524 usr 0.618 sys 27.142 cpu) @ =
0.071/s (n=3D1)
		(warning: too few iterations for a reliable count)
  workers3: 10.553 wallclock secs (27.808 usr 1.404 sys 29.213 cpu) @ =
0.095/s (n=3D1)
		(warning: too few iterations for a reliable count)
  workers5: 7.650 wallclock secs (31.099 usr 3.803 sys 34.902 cpu) @ =
0.131/s (n=3D1)
		(warning: too few iterations for a reliable count)
=
O-----------O----------O----------O-----------O----------O-----------O----=
------O----------O
|           | s/iter   | workers3 | workers15 | workers5 | workers10 | =
workers2 | workers1 |
=
O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO
| workers3  | 10553022 | --       | -42%      | -28%     | -42%      | =
34%      | 113%     |
| workers15 | 6165235  | 71%      | --        | 24%      | -0%       | =
129%     | 265%     |
| workers5  | 7650413  | 38%      | -19%      | --       | -20%      | =
84%      | 194%     |
| workers10 | 6154300  | 71%      | 0%        | 24%      | --        | =
129%     | 265%     |
| workers2  | 14101512 | -25%     | -56%      | -46%     | -56%      | =
--       | 59%      |
| workers1  | 22473185 | -53%     | -73%      | -66%     | -73%      | =
-37%     | --       |
=
--------------------------------------------------------------------------=
------------------

What's more important is the observation for the CPU consumption by the =
moar process. Depending on the number of workers I was getting numbers =
from 100% load for a single one up to 1000% for the whole bunch of 15. =
This perfectly corresponds with 6 cores/2 threads per core of my CPU.

> On Dec 7, 2018, at 02:06, yary <not.com@gmail.com> wrote:
>=20
> That was a bit vague- meant that I suspect the workers are being
> starved, since you have many consumers, and only a single thread
> generating the 1k strings. I would prime the channel to be  full - or
> other restructuring the ensure all threads are kept busy.
>=20
> -y
>=20
> On Thu, Dec 6, 2018 at 10:56 PM yary <not.com@gmail.com> wrote:
>>=20
>> Not sure if your test is measuring what you expect- the setup of
>> generating 50 x 1k strings is taking 2.7sec on my laptop, and that's
>> reducing the apparent effect of parllelism.
>>=20
>> $ perl6
>> To exit type 'exit' or '^D'
>>> my $c =3D Channel.new;
>> Channel.new
>>> { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; =
say now - ENTER now; }
>> 2.7289092
>>=20
>> I'd move the setup outside the "cmpthese" and try again, re-think the
>> new results.
>>=20
>>=20
>>=20
>> On 12/6/18, Vadim Belman <vrurg@lflat.org> wrote:
>>> Hi everybody!
>>>=20
>>> I have recently played a bit with somewhat intense computations and =
tried to
>>> parallelize them among a couple of threaded workers. The results =
were
>>> somewhat... eh... discouraging. To sum up my findings I wrote a =
simple demo
>>> benchmark:
>>>=20
>>>     use Digest::SHA;
>>>     use Bench;
>>>=20
>>>     sub worker ( Str:D $str ) {
>>>         my $digest =3D $str;
>>>=20
>>>         for 1..100 {
>>>             $digest =3D sha256 $digest;
>>>         }
>>>     }
>>>=20
>>>     sub run ( Int $workers ) {
>>>         my $c =3D Channel.new;
>>>=20
>>>         my @w;
>>>         @w.push: start {
>>>             for 1..50 {
>>>                 $c.send(
>>>                     (1..1024).map( { (' '..'Z').pick } ).join
>>>                 );
>>>             }
>>>             LEAVE $c.close;
>>>         }
>>>=20
>>>         for 1..$workers {
>>>             @w.push: start {
>>>                 react {
>>>                     whenever $c -> $str {
>>>                         worker( $str );
>>>                     }
>>>                 }
>>>             }
>>>         }
>>>=20
>>>         await @w;
>>>     }
>>>=20
>>>     my $b =3D Bench.new;
>>>     $b.cmpthese(
>>>         1,
>>>         {
>>>             workers1 =3D> sub { run( 1 ) },
>>>             workers5 =3D> sub { run( 5 ) },
>>>             workers10 =3D> sub { run( 10 ) },
>>>             workers15 =3D> sub { run( 15 ) },
>>>         }
>>>     );
>>>=20
>>> I tried this code with a macOS installation of Rakudo and with a =
Linux in a
>>> VM box. Here is macOS results (6 CPU cores):
>>>=20
>>> Timing 1 iterations of workers1, workers10, workers15, workers5...
>>>  workers1: 27.176 wallclock secs (28.858 usr 0.348 sys 29.206 cpu) @
>>> 0.037/s (n=3D1)
>>>              (warning: too few iterations for a reliable count)
>>> workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @
>>> 0.133/s (n=3D1)
>>>              (warning: too few iterations for a reliable count)
>>> workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @ =
0.126/s
>>> (n=3D1)
>>>              (warning: too few iterations for a reliable count)
>>>  workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @ =
0.106/s
>>> (n=3D1)
>>>              (warning: too few iterations for a reliable count)
>>> =
O-----------O----------O----------O-----------O-----------O----------O
>>> |           | s/iter   | workers1 | workers10 | workers15 | workers5 =
|
>>> =
O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO
>>> | workers1  | 27176370 | --       | -72%      | -71%      | -65%     =
|
>>> | workers10 | 7503726  | 262%     | --        | 6%        | 26%      =
|
>>> | workers15 | 7938428  | 242%     | -5%       | --        | 19%      =
|
>>> | workers5  | 9452421  | 188%     | -21%      | -16%      | --       =
|
>>> =
----------------------------------------------------------------------
>>>=20
>>> And Linux (4 virtual cores):
>>>=20
>>> Timing 1 iterations of workers1, workers10, workers15, workers5...
>>>  workers1: 27.240 wallclock secs (29.143 usr 0.129 sys 29.272 cpu) @
>>> 0.037/s (n=3D1)
>>>              (warning: too few iterations for a reliable count)
>>> workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @
>>> 0.097/s (n=3D1)
>>>              (warning: too few iterations for a reliable count)
>>> workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @
>>> 0.098/s (n=3D1)
>>>              (warning: too few iterations for a reliable count)
>>>  workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) @
>>> 0.094/s (n=3D1)
>>>              (warning: too few iterations for a reliable count)
>>> =
O-----------O----------O----------O----------O-----------O-----------O
>>> |           | s/iter   | workers5 | workers1 | workers15 | workers10 =
|
>>> =
O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO
>>> | workers5  | 10663102 | --       | 155%     | -4%       | -3%       =
|
>>> | workers1  | 27240221 | -61%     | --       | -62%      | -62%      =
|
>>> | workers15 | 10220862 | 4%       | 167%     | --        | 1%        =
|
>>> | workers10 | 10338829 | 3%       | 163%     | -1%       | --        =
|
>>> =
----------------------------------------------------------------------
>>>=20
>>> Am I missing something here? Do I do something wrong? Because it =
just
>>> doesn't fit into my mind...
>>>=20
>>> As a side done: by playing with 1-2-3 workers I see that each new =
thread
>>> gradually adds atop of the total run time until a plato is reached. =
The
>>> plato is seemingly defined by the number of cores or, more =
correctly, by the
>>> number of supported threads. Proving this hypothesis wold require =
more time
>>> than I have on my hands right now. And not even sure if such proof =
ever
>>> makes sense.
>>>=20
>>> Best regards,
>>> Vadim Belman
>>>=20
>>>=20
>>=20
>>=20
>> --
>> -y
>=20

Best regards,
Vadim Belman


--Apple-Mail=_21EABCB3-7F66-408E-9CA0-0772F9A10B3C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;" class=3D"">There=
 is not need for filling in the channel prior to starting workers. First =
of all, 100 repetitions of SHA256 per worker makes takes ~0.7sec on my =
system. I didn't do benchmarking of the generator thread, but =
considering that even your timing gives 0.054sec/per string =E2=80=93 I =
will most definitely remain fast enough to provide all workers with =
data. But even with this in mind I re-run the test with only 100 =
characters long strings being generated. Here is what I've got:<div =
class=3D""><br class=3D""></div><div class=3D""><div class=3D""><font =
face=3D"Andale Mono" class=3D"">Benchmark:</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">Timing 1 iterations of =
workers1, workers10, workers15, workers2, workers3, =
workers5...</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; workers1: 22.473 wallclock secs (22.609 usr 0.231 sys =
22.840 cpu) @ 0.044/s (n=3D1)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">		</span>(warning: too few =
iterations for a reliable count)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D"">&nbsp;workers10: 6.154 wallclock secs =
(44.087 usr 11.149 sys 55.236 cpu) @ 0.162/s (n=3D1)</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">		=
</span>(warning: too few iterations for a reliable =
count)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp;workers15: 6.165 wallclock secs (50.206 usr 9.540 sys =
59.745 cpu) @ 0.162/s (n=3D1)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">		</span>(warning: too few =
iterations for a reliable count)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D"">&nbsp; workers2: 14.102 wallclock secs =
(26.524 usr 0.618 sys 27.142 cpu) @ 0.071/s (n=3D1)</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">		=
</span>(warning: too few iterations for a reliable =
count)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; workers3: 10.553 wallclock secs (27.808 usr 1.404 sys =
29.213 cpu) @ 0.095/s (n=3D1)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">		</span>(warning: too few =
iterations for a reliable count)</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D"">&nbsp; workers5: 7.650 wallclock secs =
(31.099 usr 3.803 sys 34.902 cpu) @ 0.131/s (n=3D1)</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">		=
</span>(warning: too few iterations for a reliable =
count)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">O-----------O----------O----------O-----------O----------O-----=
------O----------O----------O</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D"">| &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | =
s/iter &nbsp; | workers3 | workers15 | workers5 | workers10 | workers2 | =
workers1 |</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">| workers3 &nbsp;| =
10553022 | -- &nbsp; &nbsp; &nbsp; | -42% &nbsp; &nbsp; &nbsp;| -28% =
&nbsp; &nbsp; | -42% &nbsp; &nbsp; &nbsp;| 34% &nbsp; &nbsp; &nbsp;| =
113% &nbsp; &nbsp; |</font></div><div class=3D""><font face=3D"Andale =
Mono" class=3D"">| workers15 | 6165235 &nbsp;| 71% &nbsp; &nbsp; &nbsp;| =
-- &nbsp; &nbsp; &nbsp; &nbsp;| 24% &nbsp; &nbsp; &nbsp;| -0% &nbsp; =
&nbsp; &nbsp; | 129% &nbsp; &nbsp; | 265% &nbsp; &nbsp; =
|</font></div><div class=3D""><font face=3D"Andale Mono" class=3D"">| =
workers5 &nbsp;| 7650413 &nbsp;| 38% &nbsp; &nbsp; &nbsp;| -19% &nbsp; =
&nbsp; &nbsp;| -- &nbsp; &nbsp; &nbsp; | -20% &nbsp; &nbsp; &nbsp;| 84% =
&nbsp; &nbsp; &nbsp;| 194% &nbsp; &nbsp; |</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">| workers10 | 6154300 =
&nbsp;| 71% &nbsp; &nbsp; &nbsp;| 0% &nbsp; &nbsp; &nbsp; &nbsp;| 24% =
&nbsp; &nbsp; &nbsp;| -- &nbsp; &nbsp; &nbsp; &nbsp;| 129% &nbsp; &nbsp; =
| 265% &nbsp; &nbsp; |</font></div><div class=3D""><font face=3D"Andale =
Mono" class=3D"">| workers2 &nbsp;| 14101512 | -25% &nbsp; &nbsp; | -56% =
&nbsp; &nbsp; &nbsp;| -46% &nbsp; &nbsp; | -56% &nbsp; &nbsp; &nbsp;| -- =
&nbsp; &nbsp; &nbsp; | 59% &nbsp; &nbsp; &nbsp;|</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">| workers1 &nbsp;| =
22473185 | -53% &nbsp; &nbsp; | -73% &nbsp; &nbsp; &nbsp;| -66% &nbsp; =
&nbsp; | -73% &nbsp; &nbsp; &nbsp;| -37% &nbsp; &nbsp; | -- &nbsp; =
&nbsp; &nbsp; |</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">---------------------------------------------------------------=
-----------------------------</font></div><div class=3D""><div =
class=3D""><div><br class=3D""></div><div>What's more important is the =
observation for the CPU consumption by the moar process. Depending on =
the number of workers I was getting numbers from 100% load for a single =
one up to 1000% for the whole bunch of 15. This perfectly corresponds =
with 6 cores/2 threads per core of my CPU.</div><div><br =
class=3D""><blockquote type=3D"cite" class=3D""><div class=3D"">On Dec =
7, 2018, at 02:06, yary &lt;<a href=3D"mailto:not.com@gmail.com" =
class=3D"">not.com@gmail.com</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div class=3D"">That =
was a bit vague- meant that I suspect the workers are being<br =
class=3D"">starved, since you have many consumers, and only a single =
thread<br class=3D"">generating the 1k strings. I would prime the =
channel to be &nbsp;full - or<br class=3D"">other restructuring the =
ensure all threads are kept busy.<br class=3D""><br class=3D"">-y<br =
class=3D""><br class=3D"">On Thu, Dec 6, 2018 at 10:56 PM yary &lt;<a =
href=3D"mailto:not.com@gmail.com" class=3D"">not.com@gmail.com</a>&gt; =
wrote:<br class=3D""><blockquote type=3D"cite" class=3D""><br =
class=3D"">Not sure if your test is measuring what you expect- the setup =
of<br class=3D"">generating 50 x 1k strings is taking 2.7sec on my =
laptop, and that's<br class=3D"">reducing the apparent effect of =
parllelism.<br class=3D""><br class=3D"">$ perl6<br class=3D"">To exit =
type 'exit' or '^D'<br class=3D""><blockquote type=3D"cite" class=3D"">my =
$c =3D Channel.new;<br class=3D""></blockquote>Channel.new<br =
class=3D""><blockquote type=3D"cite" class=3D"">{ for 1..50 =
{$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say now - ENTER =
now; }<br class=3D""></blockquote>2.7289092<br class=3D""><br =
class=3D"">I'd move the setup outside the "cmpthese" and try again, =
re-think the<br class=3D"">new results.<br class=3D""><br class=3D""><br =
class=3D""><br class=3D"">On 12/6/18, Vadim Belman &lt;<a =
href=3D"mailto:vrurg@lflat.org" class=3D"">vrurg@lflat.org</a>&gt; =
wrote:<br class=3D""><blockquote type=3D"cite" class=3D"">Hi =
everybody!<br class=3D""><br class=3D"">I have recently played a bit =
with somewhat intense computations and tried to<br class=3D"">parallelize =
them among a couple of threaded workers. The results were<br =
class=3D"">somewhat... eh... discouraging. To sum up my findings I wrote =
a simple demo<br class=3D"">benchmark:<br class=3D""><br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;use Digest::SHA;<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;use Bench;<br class=3D""><br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;sub worker ( Str:D $str ) {<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;my $digest =3D $str;<br =
class=3D""><br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for 1..100 {<br =
class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$d=
igest =3D sha256 $digest;<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;}<br class=3D""><br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;sub run ( Int $workers ) {<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;my $c =3D =
Channel.new;<br class=3D""><br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;my @w;<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;@w.push: start {<br =
class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;fo=
r 1..50 {<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;$c.send(<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(1..1024).map( { (' =
'..'Z').pick } ).join<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;);<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<=
br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;LE=
AVE $c.close;<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br class=3D""><br =
class=3D""> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for =
1..$workers {<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;@w=
..push: start {<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;react {<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;whenever $c -&gt; $str =
{<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;work=
er( $str );<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;}<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<=
br class=3D""> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br =
class=3D""><br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;await @w;<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;}<br class=3D""><br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;my $b =3D Bench.new;<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;$b.cmpthese(<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1,<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;wo=
rkers1 =3D&gt; sub { run( 1 ) },<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;wo=
rkers5 =3D&gt; sub { run( 5 ) },<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;wo=
rkers10 =3D&gt; sub { run( 10 ) },<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;wo=
rkers15 =3D&gt; sub { run( 15 ) },<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;);<br class=3D""><br class=3D"">I tried this =
code with a macOS installation of Rakudo and with a Linux in a<br =
class=3D"">VM box. Here is macOS results (6 CPU cores):<br class=3D""><br =
class=3D"">Timing 1 iterations of workers1, workers10, workers15, =
workers5...<br class=3D""> &nbsp;workers1: 27.176 wallclock secs (28.858 =
usr 0.348 sys 29.206 cpu) @<br class=3D"">0.037/s (n=3D1)<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;(warning: too few iterations for a reliable count)<br class=3D""> =
workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @<br =
class=3D"">0.133/s (n=3D1)<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;(warning: too few iterations for a reliable count)<br class=3D""> =
workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @ =
0.126/s<br class=3D"">(n=3D1)<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;(warning: too few iterations for a reliable count)<br class=3D""> =
&nbsp;workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @ =
0.106/s<br class=3D"">(n=3D1)<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;(warning: too few iterations for a reliable count)<br =
class=3D"">O-----------O----------O----------O-----------O-----------O----=
------O<br class=3D"">| =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| s/iter =
&nbsp;&nbsp;| workers1 | workers10 | workers15 | workers5 |<br =
class=3D"">O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO<br class=3D"">| =
workers1 &nbsp;| 27176370 | -- &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| =
-72% &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| -71% =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| -65% &nbsp;&nbsp;&nbsp;&nbsp;|<br =
class=3D"">| workers10 | 7503726 &nbsp;| 262% &nbsp;&nbsp;&nbsp;&nbsp;| =
-- &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| 6% =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| 26% =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<br class=3D"">| workers15 | 7938428 =
&nbsp;| 242% &nbsp;&nbsp;&nbsp;&nbsp;| -5% =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| -- =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| 19% =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<br class=3D"">| workers5 &nbsp;| 9452421 =
&nbsp;| 188% &nbsp;&nbsp;&nbsp;&nbsp;| -21% =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| -16% &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| -- =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<br =
class=3D"">---------------------------------------------------------------=
-------<br class=3D""><br class=3D"">And Linux (4 virtual cores):<br =
class=3D""><br class=3D"">Timing 1 iterations of workers1, workers10, =
workers15, workers5...<br class=3D""> &nbsp;workers1: 27.240 wallclock =
secs (29.143 usr 0.129 sys 29.272 cpu) @<br class=3D"">0.037/s (n=3D1)<br =
class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;(warning: too few iterations for a reliable count)<br class=3D""> =
workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @<br =
class=3D"">0.097/s (n=3D1)<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;(warning: too few iterations for a reliable count)<br class=3D""> =
workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @<br =
class=3D"">0.098/s (n=3D1)<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;(warning: too few iterations for a reliable count)<br class=3D""> =
&nbsp;workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) =
@<br class=3D"">0.094/s (n=3D1)<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;(warning: too few iterations for a reliable count)<br =
class=3D"">O-----------O----------O----------O----------O-----------O-----=
------O<br class=3D"">| =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| s/iter =
&nbsp;&nbsp;| workers5 | workers1 | workers15 | workers10 |<br =
class=3D"">O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO<br class=3D"">| =
workers5 &nbsp;| 10663102 | -- &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| =
155% &nbsp;&nbsp;&nbsp;&nbsp;| -4% &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| =
-3% &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<br class=3D"">| workers1 =
&nbsp;| 27240221 | -61% &nbsp;&nbsp;&nbsp;&nbsp;| -- =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| -62% =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| -62% &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<br =
class=3D"">| workers15 | 10220862 | 4% =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| 167% &nbsp;&nbsp;&nbsp;&nbsp;| -- =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| 1% =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<br class=3D"">| workers10 | =
10338829 | 3% &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| 163% =
&nbsp;&nbsp;&nbsp;&nbsp;| -1% &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| -- =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<br =
class=3D"">---------------------------------------------------------------=
-------<br class=3D""><br class=3D"">Am I missing something here? Do I =
do something wrong? Because it just<br class=3D"">doesn't fit into my =
mind...<br class=3D""><br class=3D"">As a side done: by playing with =
1-2-3 workers I see that each new thread<br class=3D"">gradually adds =
atop of the total run time until a plato is reached. The<br =
class=3D"">plato is seemingly defined by the number of cores or, more =
correctly, by the<br class=3D"">number of supported threads. Proving =
this hypothesis wold require more time<br class=3D"">than I have on my =
hands right now. And not even sure if such proof ever<br class=3D"">makes =
sense.<br class=3D""><br class=3D"">Best regards,<br class=3D"">Vadim =
Belman<br class=3D""><br class=3D""><br class=3D""></blockquote><br =
class=3D""><br class=3D"">--<br class=3D"">-y<br =
class=3D""></blockquote><br class=3D""></div></div></blockquote></div><br =
class=3D""><div class=3D"">
<div class=3D"">Best regards,</div>Vadim Belman

</div>
<br class=3D""></div></div></div></body></html>=

--Apple-Mail=_21EABCB3-7F66-408E-9CA0-0772F9A10B3C--
0
vrurg
12/7/2018 3:27:41 PM
Is it possible that OS caching is having an effect on the performance?
It's sometimes necessary to run the same code several times before it
settles down to consistent results.

On 12/7/18, Vadim Belman <vrurg@lflat.org> wrote:
> There is not need for filling in the channel prior to starting workers.
> First of all, 100 repetitions of SHA256 per worker makes takes ~0.7sec on=
 my
> system. I didn't do benchmarking of the generator thread, but considering
> that even your timing gives 0.054sec/per string =E2=80=93 I will most def=
initely
> remain fast enough to provide all workers with data. But even with this i=
n
> mind I re-run the test with only 100 characters long strings being
> generated. Here is what I've got:
>
> Benchmark:
> Timing 1 iterations of workers1, workers10, workers15, workers2, workers3=
,
> workers5...
>   workers1: 22.473 wallclock secs (22.609 usr 0.231 sys 22.840 cpu) @
> 0.044/s (n=3D1)
> 		(warning: too few iterations for a reliable count)
>  workers10: 6.154 wallclock secs (44.087 usr 11.149 sys 55.236 cpu) @
> 0.162/s (n=3D1)
> 		(warning: too few iterations for a reliable count)
>  workers15: 6.165 wallclock secs (50.206 usr 9.540 sys 59.745 cpu) @ 0.16=
2/s
> (n=3D1)
> 		(warning: too few iterations for a reliable count)
>   workers2: 14.102 wallclock secs (26.524 usr 0.618 sys 27.142 cpu) @
> 0.071/s (n=3D1)
> 		(warning: too few iterations for a reliable count)
>   workers3: 10.553 wallclock secs (27.808 usr 1.404 sys 29.213 cpu) @
> 0.095/s (n=3D1)
> 		(warning: too few iterations for a reliable count)
>   workers5: 7.650 wallclock secs (31.099 usr 3.803 sys 34.902 cpu) @ 0.13=
1/s
> (n=3D1)
> 		(warning: too few iterations for a reliable count)
> O-----------O----------O----------O-----------O----------O-----------O---=
-------O----------O
> |           | s/iter   | workers3 | workers15 | workers5 | workers10 |
> workers2 | workers1 |
> O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO
> | workers3  | 10553022 | --       | -42%      | -28%     | -42%      | 34=
%
>    | 113%     |
> | workers15 | 6165235  | 71%      | --        | 24%      | -0%       | 12=
9%
>    | 265%     |
> | workers5  | 7650413  | 38%      | -19%      | --       | -20%      | 84=
%
>    | 194%     |
> | workers10 | 6154300  | 71%      | 0%        | 24%      | --        | 12=
9%
>    | 265%     |
> | workers2  | 14101512 | -25%     | -56%      | -46%     | -56%      | --
>    | 59%      |
> | workers1  | 22473185 | -53%     | -73%      | -66%     | -73%      | -3=
7%
>    | --       |
> -------------------------------------------------------------------------=
-------------------
>
> What's more important is the observation for the CPU consumption by the m=
oar
> process. Depending on the number of workers I was getting numbers from 10=
0%
> load for a single one up to 1000% for the whole bunch of 15. This perfect=
ly
> corresponds with 6 cores/2 threads per core of my CPU.
>
>> On Dec 7, 2018, at 02:06, yary <not.com@gmail.com> wrote:
>>
>> That was a bit vague- meant that I suspect the workers are being
>> starved, since you have many consumers, and only a single thread
>> generating the 1k strings. I would prime the channel to be  full - or
>> other restructuring the ensure all threads are kept busy.
>>
>> -y
>>
>> On Thu, Dec 6, 2018 at 10:56 PM yary <not.com@gmail.com> wrote:
>>>
>>> Not sure if your test is measuring what you expect- the setup of
>>> generating 50 x 1k strings is taking 2.7sec on my laptop, and that's
>>> reducing the apparent effect of parllelism.
>>>
>>> $ perl6
>>> To exit type 'exit' or '^D'
>>>> my $c =3D Channel.new;
>>> Channel.new
>>>> { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say
>>>> now - ENTER now; }
>>> 2.7289092
>>>
>>> I'd move the setup outside the "cmpthese" and try again, re-think the
>>> new results.
>>>
>>>
>>>
>>> On 12/6/18, Vadim Belman <vrurg@lflat.org> wrote:
>>>> Hi everybody!
>>>>
>>>> I have recently played a bit with somewhat intense computations and
>>>> tried to
>>>> parallelize them among a couple of threaded workers. The results were
>>>> somewhat... eh... discouraging. To sum up my findings I wrote a simple
>>>> demo
>>>> benchmark:
>>>>
>>>>     use Digest::SHA;
>>>>     use Bench;
>>>>
>>>>     sub worker ( Str:D $str ) {
>>>>         my $digest =3D $str;
>>>>
>>>>         for 1..100 {
>>>>             $digest =3D sha256 $digest;
>>>>         }
>>>>     }
>>>>
>>>>     sub run ( Int $workers ) {
>>>>         my $c =3D Channel.new;
>>>>
>>>>         my @w;
>>>>         @w.push: start {
>>>>             for 1..50 {
>>>>                 $c.send(
>>>>                     (1..1024).map( { (' '..'Z').pick } ).join
>>>>                 );
>>>>             }
>>>>             LEAVE $c.close;
>>>>         }
>>>>
>>>>         for 1..$workers {
>>>>             @w.push: start {
>>>>                 react {
>>>>                     whenever $c -> $str {
>>>>                         worker( $str );
>>>>                     }
>>>>                 }
>>>>             }
>>>>         }
>>>>
>>>>         await @w;
>>>>     }
>>>>
>>>>     my $b =3D Bench.new;
>>>>     $b.cmpthese(
>>>>         1,
>>>>         {
>>>>             workers1 =3D> sub { run( 1 ) },
>>>>             workers5 =3D> sub { run( 5 ) },
>>>>             workers10 =3D> sub { run( 10 ) },
>>>>             workers15 =3D> sub { run( 15 ) },
>>>>         }
>>>>     );
>>>>
>>>> I tried this code with a macOS installation of Rakudo and with a Linux
>>>> in a
>>>> VM box. Here is macOS results (6 CPU cores):
>>>>
>>>> Timing 1 iterations of workers1, workers10, workers15, workers5...
>>>>  workers1: 27.176 wallclock secs (28.858 usr 0.348 sys 29.206 cpu) @
>>>> 0.037/s (n=3D1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @
>>>> 0.133/s (n=3D1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @
>>>> 0.126/s
>>>> (n=3D1)
>>>>              (warning: too few iterations for a reliable count)
>>>>  workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @
>>>> 0.106/s
>>>> (n=3D1)
>>>>              (warning: too few iterations for a reliable count)
>>>> O-----------O----------O----------O-----------O-----------O----------O
>>>> |           | s/iter   | workers1 | workers10 | workers15 | workers5 |
>>>> O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO
>>>> | workers1  | 27176370 | --       | -72%      | -71%      | -65%     |
>>>> | workers10 | 7503726  | 262%     | --        | 6%        | 26%      |
>>>> | workers15 | 7938428  | 242%     | -5%       | --        | 19%      |
>>>> | workers5  | 9452421  | 188%     | -21%      | -16%      | --       |
>>>> ----------------------------------------------------------------------
>>>>
>>>> And Linux (4 virtual cores):
>>>>
>>>> Timing 1 iterations of workers1, workers10, workers15, workers5...
>>>>  workers1: 27.240 wallclock secs (29.143 usr 0.129 sys 29.272 cpu) @
>>>> 0.037/s (n=3D1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @
>>>> 0.097/s (n=3D1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @
>>>> 0.098/s (n=3D1)
>>>>              (warning: too few iterations for a reliable count)
>>>>  workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) @
>>>> 0.094/s (n=3D1)
>>>>              (warning: too few iterations for a reliable count)
>>>> O-----------O----------O----------O----------O-----------O-----------O
>>>> |           | s/iter   | workers5 | workers1 | workers15 | workers10 |
>>>> O=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO
>>>> | workers5  | 10663102 | --       | 155%     | -4%       | -3%       |
>>>> | workers1  | 27240221 | -61%     | --       | -62%      | -62%      |
>>>> | workers15 | 10220862 | 4%       | 167%     | --        | 1%        |
>>>> | workers10 | 10338829 | 3%       | 163%     | -1%       | --        |
>>>> ----------------------------------------------------------------------
>>>>
>>>> Am I missing something here? Do I do something wrong? Because it just
>>>> doesn't fit into my mind...
>>>>
>>>> As a side done: by playing with 1-2-3 workers I see that each new
>>>> thread
>>>> gradually adds atop of the total run time until a plato is reached. Th=
e
>>>> plato is seemingly defined by the number of cores or, more correctly, =
by
>>>> the
>>>> number of supported threads. Proving this hypothesis wold require more
>>>> time
>>>> than I have on my hands right now. And not even sure if such proof eve=
r
>>>> makes sense.
>>>>
>>>> Best regards,
>>>> Vadim Belman
>>>>
>>>>
>>>
>>>
>>> --
>>> -y
>>
>
> Best regards,
> Vadim Belman
>
>
0
1parrota
12/7/2018 3:55:18 PM
OK... going back to the hypothesis in the OP

> The plateau is seemingly defined by the number of cores or, more correctly, by the number of supported threads.

This suggests that the benchmark is CPU-bound, which is supported by
your more recent observation "100% load for a single one"

Also, you mentioned running MacOS with two threads per core, which
implies Intel's hyperthreading. Depending on the workload, CPU-bound
processes sharing a hyperthreaded core see a speedup of 0-30%, as
opposed to running on separate cores which can give a speedup of 100%.
(Back when I searched for large primes, HT gave a 25% speed boost.) So
with 6 cores, 2 HT per core, I would expect a max parallel boost of 6
* (1x +0.30x) = 7.8x - and your test is only giving half that.

-y
0
not
12/7/2018 5:04:00 PM
--Apple-Mail=_91D03591-8FEC-4A01-8590-99AB60BF2050
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

You're damn right here. First of all, I must admit that I've =
misinterpreted the benchmark results (guilty). Yet, anyway, I think I =
know what's really happening here. To make things really clear I ran the =
benchmark for all number of workers from 1 to 9. Here is a cleaned up =
output:

     Timing 1 iterations of worker1, worker2, worker3, worker4, worker5, =
worker6, worker7, worker8, worker9...
        worker1: 22.125 wallclock secs (22.296 usr 0.248 sys 22.544 cpu) =
@ 0.045/s (n=3D1)
        worker2: 12.554 wallclock secs (24.221 usr 0.715 sys 24.936 cpu) =
@ 0.080/s (n=3D1)
        worker3: 9.330 wallclock secs (25.708 usr 1.316 sys 27.024 cpu) =
@ 0.107/s (n=3D1)
        worker4: 8.221 wallclock secs (28.151 usr 2.676 sys 30.827 cpu) =
@ 0.122/s (n=3D1)
        worker5: 7.131 wallclock secs (30.395 usr 3.658 sys 34.053 cpu) =
@ 0.140/s (n=3D1)
        worker6: 7.180 wallclock secs (34.496 usr 4.479 sys 38.975 cpu) =
@ 0.139/s (n=3D1)
        worker7: 7.050 wallclock secs (38.267 usr 5.453 sys 43.720 cpu) =
@ 0.142/s (n=3D1)
        worker8: 6.668 wallclock secs (41.607 usr 5.586 sys 47.194 cpu) =
@ 0.150/s (n=3D1)
        worker9: 7.220 wallclock secs (46.762 usr 11.647 sys 58.409 cpu) =
@ 0.139/s (n=3D1)
     O---------O----------O---------O
     |         | s/iter   | worker1 |
     O=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=
=3D=3D=3D=3D=3D=3DO
     | worker1 | 22125229 | --      |
     | worker2 | 12554094 | 76%     |
     | worker3 | 9329865  | 137%    |
     | worker4 | 8221486  | 169%    |
     | worker5 | 7130758  | 210%    |
     | worker6 | 7180343  | 208%    |
     | worker7 | 7049935  | 214%    |
     | worker8 | 6667794  | 232%    |
     | worker9 | 7219864  | 206%    |
     --------------------------------

The plateau is there but it's been reached even before we ran out of all =
the available cores: 5 workers takes all of the CPU power already. Yet, =
the speedup achieved is really much less that it'd expected... But then =
I realized that there is another player on the field: throttling. And =
that actually makes any other measurements useless on my notebook.

This is also an answer to Parrot's suggestion about possible caches =
involvement: that's not it, for sure. Especially if we take into account =
that the numbers were +/- the same on every benchmark run.

> On Dec 7, 2018, at 12:04, yary <not.com@gmail.com> wrote:
>=20
> OK... going back to the hypothesis in the OP
>=20
>> The plateau is seemingly defined by the number of cores or, more =
correctly, by the number of supported threads.
>=20
> This suggests that the benchmark is CPU-bound, which is supported by
> your more recent observation "100% load for a single one"
>=20
> Also, you mentioned running MacOS with two threads per core, which
> implies Intel's hyperthreading. Depending on the workload, CPU-bound
> processes sharing a hyperthreaded core see a speedup of 0-30%, as
> opposed to running on separate cores which can give a speedup of 100%.
> (Back when I searched for large primes, HT gave a 25% speed boost.) So
> with 6 cores, 2 HT per core, I would expect a max parallel boost of 6
> * (1x +0.30x) =3D 7.8x - and your test is only giving half that.
>=20
> -y
>=20

Best regards,
Vadim Belman


--Apple-Mail=_91D03591-8FEC-4A01-8590-99AB60BF2050
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;" class=3D""><div =
class=3D"">You're damn right here. First of all, I must admit that I've =
misinterpreted the benchmark results (guilty). Yet, anyway, I think I =
know what's really happening here. To make things really clear I ran the =
benchmark for all number of workers from 1 to 9. Here is a cleaned up =
output:</div><div class=3D""><br class=3D""></div><div class=3D""><div =
class=3D""><font face=3D"Andale Mono" class=3D"">&nbsp; &nbsp; =
&nbsp;Timing 1 iterations of worker1, worker2, worker3, worker4, =
worker5, worker6, worker7, worker8, worker9...</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">&nbsp; &nbsp; &nbsp; =
&nbsp; worker1: 22.125 wallclock secs (22.296 usr 0.248 sys 22.544 cpu) =
@ 0.045/s (n=3D1)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; worker2: 12.554 wallclock secs =
(24.221 usr 0.715 sys 24.936 cpu) @ 0.080/s (n=3D1)</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">&nbsp; &nbsp; &nbsp; =
&nbsp; worker3: 9.330 wallclock secs (25.708 usr 1.316 sys 27.024 cpu) @ =
0.107/s (n=3D1)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; worker4: 8.221 wallclock secs =
(28.151 usr 2.676 sys 30.827 cpu) @ 0.122/s (n=3D1)</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">&nbsp; &nbsp; &nbsp; =
&nbsp; worker5: 7.131 wallclock secs (30.395 usr 3.658 sys 34.053 cpu) @ =
0.140/s (n=3D1)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; worker6: 7.180 wallclock secs =
(34.496 usr 4.479 sys 38.975 cpu) @ 0.139/s (n=3D1)</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">&nbsp; &nbsp; &nbsp; =
&nbsp; worker7: 7.050 wallclock secs (38.267 usr 5.453 sys 43.720 cpu) @ =
0.142/s (n=3D1)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; &nbsp; &nbsp; worker8: 6.668 wallclock secs =
(41.607 usr 5.586 sys 47.194 cpu) @ 0.150/s (n=3D1)</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">&nbsp; &nbsp; &nbsp; =
&nbsp; worker9: 7.220 wallclock secs (46.762 usr 11.647 sys 58.409 cpu) =
@ 0.139/s (n=3D1)</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; =
&nbsp;O---------O----------O---------O</font></div><div class=3D""><font =
face=3D"Andale Mono" class=3D"">&nbsp; &nbsp; &nbsp;| &nbsp; &nbsp; =
&nbsp; &nbsp; | s/iter &nbsp; | worker1 |</font></div><div =
class=3D""><font face=3D"Andale Mono" class=3D"">&nbsp; &nbsp; =
&nbsp;O=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DO=3D=3D=3D=
=3D=3D=3D=3D=3D=3DO</font></div><div class=3D""><font face=3D"Andale =
Mono" class=3D"">&nbsp; &nbsp; &nbsp;| worker1 | 22125229 | -- &nbsp; =
&nbsp; &nbsp;|</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; &nbsp;| worker2 | 12554094 | 76% &nbsp; &nbsp; =
|</font></div><div class=3D""><font face=3D"Andale Mono" class=3D"">&nbsp;=
 &nbsp; &nbsp;| worker3 | 9329865 &nbsp;| 137% &nbsp; =
&nbsp;|</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; &nbsp;| worker4 | 8221486 &nbsp;| 169% &nbsp; =
&nbsp;|</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; &nbsp;| worker5 | 7130758 &nbsp;| 210% &nbsp; =
&nbsp;|</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; &nbsp;| worker6 | 7180343 &nbsp;| 208% &nbsp; =
&nbsp;|</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; &nbsp;| worker7 | 7049935 &nbsp;| 214% &nbsp; =
&nbsp;|</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; &nbsp;| worker8 | 6667794 &nbsp;| 232% &nbsp; =
&nbsp;|</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; &nbsp;| worker9 | 7219864 &nbsp;| 206% &nbsp; =
&nbsp;|</font></div><div class=3D""><font face=3D"Andale Mono" =
class=3D"">&nbsp; &nbsp; =
&nbsp;--------------------------------</font></div></div><div =
class=3D""><br class=3D""></div><div class=3D"">The plateau is there but =
it's been reached even before we ran out of all the available cores: 5 =
workers takes all of the CPU power already. Yet, the speedup achieved is =
really much less that it'd expected... But then I realized that there is =
another player on the field: throttling. And that actually makes any =
other measurements useless on my notebook.</div><div class=3D""><br =
class=3D""></div><div class=3D"">This is also an answer to Parrot's =
suggestion about possible caches involvement: that's not it, for sure. =
Especially if we take into account that the numbers were +/- the same on =
every benchmark run.</div><div class=3D""><br =
class=3D""></div><div><blockquote type=3D"cite" class=3D""><div =
class=3D"">On Dec 7, 2018, at 12:04, yary &lt;<a =
href=3D"mailto:not.com@gmail.com" class=3D"">not.com@gmail.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""><div =
class=3D"">OK... going back to the hypothesis in the OP<br class=3D""><br =
class=3D""><blockquote type=3D"cite" class=3D"">The plateau is seemingly =
defined by the number of cores or, more correctly, by the number of =
supported threads.<br class=3D""></blockquote><br class=3D"">This =
suggests that the benchmark is CPU-bound, which is supported by<br =
class=3D"">your more recent observation "100% load for a single one"<br =
class=3D""><br class=3D"">Also, you mentioned running MacOS with two =
threads per core, which<br class=3D"">implies Intel's hyperthreading. =
Depending on the workload, CPU-bound<br class=3D"">processes sharing a =
hyperthreaded core see a speedup of 0-30%, as<br class=3D"">opposed to =
running on separate cores which can give a speedup of 100%.<br =
class=3D"">(Back when I searched for large primes, HT gave a 25% speed =
boost.) So<br class=3D"">with 6 cores, 2 HT per core, I would expect a =
max parallel boost of 6<br class=3D"">* (1x +0.30x) =3D 7.8x - and your =
test is only giving half that.<br class=3D""><br class=3D"">-y<br =
class=3D""><br class=3D""></div></div></blockquote></div><br =
class=3D""><div class=3D"">
<div class=3D"">Best regards,</div>Vadim Belman

</div>
<br class=3D""></body></html>=

--Apple-Mail=_91D03591-8FEC-4A01-8590-99AB60BF2050--
0
vrurg
12/8/2018 3:18:12 AM
Reply: