- About Scala
- Documentation
- Code Examples
- Software
- Scala Developers
REPL: null-values in parallel ranges?
Hi scalaists
this evening I started playing a little with the parallel collections and this
crossed my way...
This is most likely only a REPL display effect - but it felt disturbing
somehow (thinking about races and uninitialized values):
-----------------
bjo@noordx:~/Tools/scala-2.9.0.RC1> bin/scala -J-Xmx3000m
Welcome to Scala version 2.9.0.RC1 (Java HotSpot(TM) 64-Bit Server VM, Java
1.6.0_24).
Type in expressions to have them evaluated.
Type :help for more information.
scala> val big = (1 to 10000000)
big: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102,
103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132,
133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147,
148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162,
163, 164, 165, 166, 167, 168, 169, 170,...
scala> val bigger = big.map(e => -e)
bigger: scala.collection.immutable.IndexedSeq[Int] = Vector(-1, -2, -3, -4,
-5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16, -17, -18, -19, -20,
-21, -22, -23, -24, -25, -26, -27, -28, -29, -30, -31, -32, -33, -34, -35,
-36, -37, -38, -39, -40, -41, -42, -43, -44, -45, -46, -47, -48, -49, -50,
-51, -52, -53, -54, -55, -56, -57, -58, -59, -60, -61, -62, -63, -64, -65,
-66, -67, -68, -69, -70, -71, -72, -73, -74, -75, -76, -77, -78, -79, -80,
-81, -82, -83, -84, -85, -86, -87, -88, -89, -90, -91, -92, -93, -94, -95,
-96, -97, -98, -99, -100, -101, -102, -103, -104, -105, -106, -107, -108,
-109, -110, -111, -112, -113, -114, -115, -116, -117, -118, -119, -120, -121,
-122, -123, -124, -125, -126, -127, -128, -129, -130, -131, -132, -133, -134,
-135, -136, -137, -138, -139, -140, -141,...
scala> val bigpar = big.par
bigpar: scala.collection.parallel.immutable.ParRange = ParRange(1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 251, 27, 252, 28, 253, 29, 30, null, null, null, null, null, null, null,
null, null, null, null, null, null, null, null, null, null, null, null, null,
null, null, null, null, null, null, null, null, null, null, null, 511, 261,
46, 512, 47, 513, 262, 48, 514, 49, 515, 50, 263, 51, 516, 52, 264, 53, 517,
54, 518, 55, 265, 520, 56, 521, 57, 522, 58, 523, 59, 524, 60, 525, 266, 61,
526, 62, 527, 267, 528, 529, 530, 268, 531, 63, 269, 532, 533, 64, 270, 534,
535, 65, 536, 271, 537, 66, 538, 272, 539, 67, 540, 273, 541, 68, 542, 274,
543, 69, 544, 545, 275, 546, 547, 70, 548, 276, 549, 71, 126, 550, 277, 551,
72, 552, 127, 278, 553, 73,...
------------------
needless to say that I cannot reproduce this faithfully... most of the time it
looks unsuspicious. But as I saw it once I retried several times - this was
the second time I got this - directly after the start of the REPL session (the
transcript is unaltered)
I'm not sure if this qualifies as bug, as I was not able to get a hold on the
"nulls" during range processing. So it's maybe only a REPL print thingy.
Greetings
Bernd










Re: REPL: null-values in parallel ranges?
Am Samstag, 2. April 2011, 00:13:39 schrieb Alex Cruise:
> I think a lot of people have a mental model of parallel collections as
> "just like sequential, only hopefully faster." Whether this is similar to
> the intent of the designers of 2.9's parallel collections is an
> interesting question.
My first concern goes towards documentation: if foreach on Seq in a parallel
context is going to be evaluated out of order (i.e. parallelized) there should
be a reminder in the documentation for Seq.foreach(). I.e. if somebody is
going to write something which accepts a Seq and is going to execute side
effects on this Seq by means of foreach has to cope with out of order
execution because the Seq might be a parallelized Seq.
> But when side effects are the whole point, I think many people really will
> care about the order in which the actions are taken, so for whatever it's
> worth, I think I'll go along with the "keep foreach sequential, and come up
> something new that works like foreach but makes you sign an explicit
> waiver that you won't get upset when things happen out of order" camp. At
> least until convinced otherwise. :)
I support this. In my experience (as a member of regression test camps) many
errors in production code can be traced back to unexpected "out of order"
execution or unexpected sort odering of things. It's amacing how often I have
seen code break on the expectation that a "select * from whatever" imposes
some order naturally. And its even more amacing how long this error can go
unnoticed... Had the inventor of SQL specified that "select" always requires a
"order by" whereas "unorderedselect" takes no order at all we would have been
spared from some of the most severe errors.
With this experience in the past I recommend to keep the "ordered" behaviour
of foreach() because breaking this will introduce a whole bunch of errors in
parallel context. Errors for which scala is not to be blamed... but it's
terribly error prune to switch from guaranteed ordered / sequencial to
unordered behaviour in parallel context.
I would even go as far as to recommend an artifical "unordered" behaviour of
the sequencial foreach if the scala team decides to keep the parallel
behaviour.
Greetings
Bernd
Re: REPL: null-values in parallel ranges?
On 04/02/11 04:24, Bernd Johannes wrote:
> (...) I.e. if somebody is
> going to write something which accepts a Seq and is going to execute side
> effects on this Seq by means of foreach(...)
given foreach's signature(s):
(f: (A) => Unit) : Unit
[U] (f: (A) => U) : Unit
(mind the trailing ':Unit's) - *every* application of foreach is going to be
invoked for its side-effects.
Regards
Re: REPL: null-values in parallel ranges?
col.foreach(f => ()) // LOOK MAH, NO SIDE EFFECTS!
On Mon, Apr 4, 2011 at 1:00 PM, Martin S. Weber <martin [dot] weber [at] nist [dot] gov> wrote:
Re: REPL: null-values in parallel ranges?
another example is logging. while it does have an effect, the effect does only apply to the "outside world"
Am 04.04.2011 19:14, schrieb Josh Suereth:
Re: REPL: null-values in parallel ranges?
In which case what will happen is that people will come complain that they changed their collection to parallel, but nothing happened! At which point we'll tell them, "No, it's not just changing the collection to parallel. You'll have to rewrite your whole code as well."
I someone uses a parallel collection, he _expects_ things to be parallel. Sure, they'll break code that worked for serial collection in new and interesting ways, but that is easily fixed: don't use parallel collections. You can even call .seq before sending a collection to that part of the code, and call .par on anything getting back.
--
Daniel C. Sobral
I travel to the future all the time.
Re: REPL: null-values in parallel ranges?
If foreach is executed in parallel in some cases, that is going to
cause a lot of bugs from users in a couple related ways:
* Code that took a generic sequence and expected the order to be
consistent.
* Code that builds up _any_ data structure that is not thread-safe.
Someone may have some legacy code that traverses a sequence and put
them into a Btree. If they are now passed a parallel collection
(which they knew nothing about at the time of the original code), and
were a bit careless, that code will likely explode. It may have been
bad code in the first place, but that is entirely besides the point.
The semantics of an API are changing in a way that will break existing
code. The compiler doesn't yell at you and break with "sorry, that
is not a pure function!"
How much code out there using foreach on a sequence was written in a
way that won't break if executed with parallel threads?
Users of parallel collections can call .seq before handing off to
elsewhere, but this is not type safe and error prone. Are parallel
collections meant to be type safe with respect to ordinary
collections? Or should consumers add boilerplate to check for
parallel or call .seq for every one they receive? Should all legacy
code be converted to check?
I'm far too new to Scala to know what can and can't be done with types
here, but my gut tells me that legacy code shouldn't break or require
change, and new code shouldn't require boilerplate type checks or
conversions in order to be safe.
On Apr 2, 11:03 am, Daniel Sobral wrote:
> On Sat, Apr 2, 2011 at 05:24, Bernd Johannes wrote:
>
> > With this experience in the past I recommend to keep the "ordered"
> > behaviour
> > of foreach() because breaking this will introduce a whole bunch of errors
> > in
> > parallel context. Errors for which scala is not to be blamed... but it's
> > terribly error prune to switch from guaranteed ordered / sequencial to
> > unordered behaviour in parallel context.
>
> > I would even go as far as to recommend an artifical "unordered" behaviour
> > of
> > the sequencial foreach if the scala team decides to keep the parallel
> > behaviour.
>
> In which case what will happen is that people will come complain that they
> changed their collection to parallel, but nothing happened! At which point
> we'll tell them, "No, it's not just changing the collection to parallel.
> You'll have to rewrite your whole code as well."
>
> I someone uses a parallel collection, he _expects_ things to be parallel.
> Sure, they'll break code that worked for serial collection in new and
> interesting ways, but that is easily fixed: don't use parallel collections.
> You can even call .seq before sending a collection to that part of the code,
> and call .par on anything getting back.
>
> --
> Daniel C. Sobral
>
> I travel to the future all the time.
Re: REPL: null-values in parallel ranges?
if you need a sequencial foreach, make your parallel collection a
sequential one. according to the documentation, this conversion is
basically free.
for side effect free code, the order doesn't matter at all. i don't see
a big problem here.
Am 02.04.2011 10:24, schrieb Bernd Johannes:
> Am Samstag, 2. April 2011, 00:13:39 schrieb Alex Cruise:
>> I think a lot of people have a mental model of parallel collections as
>> "just like sequential, only hopefully faster." Whether this is similar to
>> the intent of the designers of 2.9's parallel collections is an
>> interesting question.
> My first concern goes towards documentation: if foreach on Seq in a parallel
> context is going to be evaluated out of order (i.e. parallelized) there should
> be a reminder in the documentation for Seq.foreach(). I.e. if somebody is
> going to write something which accepts a Seq and is going to execute side
> effects on this Seq by means of foreach has to cope with out of order
> execution because the Seq might be a parallelized Seq.
>
>> But when side effects are the whole point, I think many people really will
>> care about the order in which the actions are taken, so for whatever it's
>> worth, I think I'll go along with the "keep foreach sequential, and come up
>> something new that works like foreach but makes you sign an explicit
>> waiver that you won't get upset when things happen out of order" camp. At
>> least until convinced otherwise. :)
> I support this. In my experience (as a member of regression test camps) many
> errors in production code can be traced back to unexpected "out of order"
> execution or unexpected sort odering of things. It's amacing how often I have
> seen code break on the expectation that a "select * from whatever" imposes
> some order naturally. And its even more amacing how long this error can go
> unnoticed... Had the inventor of SQL specified that "select" always requires a
> "order by" whereas "unorderedselect" takes no order at all we would have been
> spared from some of the most severe errors.
>
> With this experience in the past I recommend to keep the "ordered" behaviour
> of foreach() because breaking this will introduce a whole bunch of errors in
> parallel context. Errors for which scala is not to be blamed... but it's
> terribly error prune to switch from guaranteed ordered / sequencial to
> unordered behaviour in parallel context.
>
> I would even go as far as to recommend an artifical "unordered" behaviour of
> the sequencial foreach if the scala team decides to keep the parallel
> behaviour.
>
> Greetings
> Bernd
>
>
Re: REPL: null-values in parallel ranges - Seq.foreach acting u
Am Samstag, 2. April 2011, 13:14:11 schrieb HamsterofDeath:
> if you need a sequencial foreach, make your parallel collection a
> sequential one. according to the documentation, this conversion is
> basically free.
> for side effect free code, the order doesn't matter at all. i don't see
> a big problem here.
It's true - it's just about operations where the order of sequencial traversal
matters (either globally or locally). That's why I don't see big issues with
map() and the like.
Maybe I am naiv - or the average coding quality has improved a lot beyond that
what I used to see during my testing years (that's been a while in the past
from now).
With the obvious and simple example below I want to illustrate where I fear
that parallelized foreach will wreak havoc. It's not obvious during
compilation - maybe it goes unnoticed on small test systems... but it will
fail in production environment at some point in time.
And that's what I am concerned about. It's not obvious, not fast failing and
may be introduced almost "silently" many many lines of code away...
Just my 5 cents...
Greetings
Bernd
scala> def writeResult(res: scala.collection.immutable.Seq[String]): Unit = {
| // normally does things like derive filename, cleanup old versions
| // opens streams, error handling and the like and then writes
| // the content
| res.foreach((s) => println(s))
| }
writeResult: (res: scala.collection.immutable.Seq[String])Unit
// just a make up result (and almost always in result files order matters)
scala> val a = List("a", "b", "c", "d", "e")
a: List[java.lang.String] = List(a, b, c, d, e)
// that one is ok
scala> writeResult(a)
a
b
c
d
e
// that one might be introduced because the Seq is a bigger one
// and there are many cores awaiting work...
scala> val pa = a.par
pa: scala.collection.parallel.immutable.ParSeq[java.lang.String] =
ParVector(a, b, c, d, e)
// after some processing the Seq is to be saved...
scala> writeResult(pa)
a
b
d
e
c
// an unexpected behaviour that spreads from the point of par
// introduction into
// formerly proper working pieces of code which silently assumed that
// foreach is acting faithfully on the sequence of the Seq
Re: REPL: null-values in parallel ranges - Seq.foreach acting u
if the order matters, you cannot use multiple threads. every
multithreading coding knows that. or am i wrong?
Am 02.04.2011 15:34, schrieb Bernd Johannes:
> Am Samstag, 2. April 2011, 13:14:11 schrieb HamsterofDeath:
>> if you need a sequencial foreach, make your parallel collection a
>> sequential one. according to the documentation, this conversion is
>> basically free.
>> for side effect free code, the order doesn't matter at all. i don't see
>> a big problem here.
> It's true - it's just about operations where the order of sequencial traversal
> matters (either globally or locally). That's why I don't see big issues with
> map() and the like.
>
> Maybe I am naiv - or the average coding quality has improved a lot beyond that
> what I used to see during my testing years (that's been a while in the past
> from now).
> With the obvious and simple example below I want to illustrate where I fear
> that parallelized foreach will wreak havoc. It's not obvious during
> compilation - maybe it goes unnoticed on small test systems... but it will
> fail in production environment at some point in time.
>
> And that's what I am concerned about. It's not obvious, not fast failing and
> may be introduced almost "silently" many many lines of code away...
>
> Just my 5 cents...
> Greetings
> Bernd
>
> scala> def writeResult(res: scala.collection.immutable.Seq[String]): Unit = {
> | // normally does things like derive filename, cleanup old versions
> | // opens streams, error handling and the like and then writes
> | // the content
> | res.foreach((s) => println(s))
> | }
> writeResult: (res: scala.collection.immutable.Seq[String])Unit
>
> // just a make up result (and almost always in result files order matters)
> scala> val a = List("a", "b", "c", "d", "e")
> a: List[java.lang.String] = List(a, b, c, d, e)
>
> // that one is ok
> scala> writeResult(a)
> a
> b
> c
> d
> e
>
> // that one might be introduced because the Seq is a bigger one
> // and there are many cores awaiting work...
> scala> val pa = a.par
> pa: scala.collection.parallel.immutable.ParSeq[java.lang.String] =
> ParVector(a, b, c, d, e)
>
> // after some processing the Seq is to be saved...
> scala> writeResult(pa)
> a
> b
> d
> e
> c
>
> // an unexpected behaviour that spreads from the point of par
> // introduction into
> // formerly proper working pieces of code which silently assumed that
> // foreach is acting faithfully on the sequence of the Seq
>
Re: REPL: null-values in parallel ranges - Seq.foreach acting u
… from which follows that parallel collections cannot be subtypes of Seq.
It has been said several times and is obviously true, so why not acknowledge the fact?
(pesky old me)
On Apr 2, 2011, at 16:00, HamsterofDeath wrote:
> if the order matters, you cannot use multiple threads. every
> multithreading coding knows that. or am i wrong?
>
> Am 02.04.2011 15:34, schrieb Bernd Johannes:
>> Am Samstag, 2. April 2011, 13:14:11 schrieb HamsterofDeath:
>>> if you need a sequencial foreach, make your parallel collection a
>>> sequential one. according to the documentation, this conversion is
>>> basically free.
>>> for side effect free code, the order doesn't matter at all. i don't see
>>> a big problem here.
>> It's true - it's just about operations where the order of sequencial traversal
>> matters (either globally or locally). That's why I don't see big issues with
>> map() and the like.
>>
>> Maybe I am naiv - or the average coding quality has improved a lot beyond that
>> what I used to see during my testing years (that's been a while in the past
>> from now).
>> With the obvious and simple example below I want to illustrate where I fear
>> that parallelized foreach will wreak havoc. It's not obvious during
>> compilation - maybe it goes unnoticed on small test systems... but it will
>> fail in production environment at some point in time.
>>
>> And that's what I am concerned about. It's not obvious, not fast failing and
>> may be introduced almost "silently" many many lines of code away...
>>
>> Just my 5 cents...
>> Greetings
>> Bernd
>>
>> scala> def writeResult(res: scala.collection.immutable.Seq[String]): Unit = {
>> | // normally does things like derive filename, cleanup old versions
>> | // opens streams, error handling and the like and then writes
>> | // the content
>> | res.foreach((s) => println(s))
>> | }
>> writeResult: (res: scala.collection.immutable.Seq[String])Unit
>>
>> // just a make up result (and almost always in result files order matters)
>> scala> val a = List("a", "b", "c", "d", "e")
>> a: List[java.lang.String] = List(a, b, c, d, e)
>>
>> // that one is ok
>> scala> writeResult(a)
>> a
>> b
>> c
>> d
>> e
>>
>> // that one might be introduced because the Seq is a bigger one
>> // and there are many cores awaiting work...
>> scala> val pa = a.par
>> pa: scala.collection.parallel.immutable.ParSeq[java.lang.String] =
>> ParVector(a, b, c, d, e)
>>
>> // after some processing the Seq is to be saved...
>> scala> writeResult(pa)
>> a
>> b
>> d
>> e
>> c
>>
>> // an unexpected behaviour that spreads from the point of par
>> // introduction into
>> // formerly proper working pieces of code which silently assumed that
>> // foreach is acting faithfully on the sequence of the Seq
>>
>
Re: REPL: null-values in parallel ranges - Seq.foreach acting u
A sequence simply means one can expect to tell member keep position relative to each other. One _can_ have a sequence and make operations on it and still have these operations processed in parallel.
What I don't understand is why people want to remove this desirable property just so people can actually call ".par" on something and see nothing happen?
I'll grant that foreach, which cannot do anything without side-effects, is a prime candidate to be left sequential... aside from the unfortunate fact that all Traversable methods are based on it.
--
Daniel C. Sobral
I travel to the future all the time.
Re: REPL: null-values in parallel ranges - Seq.foreach acting u
a parallel sequence can replace a sequencial sequence (feels weird
writing that) for every method except foreach. maybe there should be an
ordered-foreach-able trait and a foreachable-random-order-trait? then it
could be added to any collection that can do it.
Am 02.04.2011 16:28, schrieb Roland Kuhn:
> … from which follows that parallel collections cannot be subtypes of Seq.
>
> It has been said several times and is obviously true, so why not acknowledge the fact?
>
> (pesky old me)
>
> On Apr 2, 2011, at 16:00, HamsterofDeath wrote:
>
>> if the order matters, you cannot use multiple threads. every
>> multithreading coding knows that. or am i wrong?
>>
>> Am 02.04.2011 15:34, schrieb Bernd Johannes:
>>> Am Samstag, 2. April 2011, 13:14:11 schrieb HamsterofDeath:
>>>> if you need a sequencial foreach, make your parallel collection a
>>>> sequential one. according to the documentation, this conversion is
>>>> basically free.
>>>> for side effect free code, the order doesn't matter at all. i don't see
>>>> a big problem here.
>>> It's true - it's just about operations where the order of sequencial traversal
>>> matters (either globally or locally). That's why I don't see big issues with
>>> map() and the like.
>>>
>>> Maybe I am naiv - or the average coding quality has improved a lot beyond that
>>> what I used to see during my testing years (that's been a while in the past
>>> from now).
>>> With the obvious and simple example below I want to illustrate where I fear
>>> that parallelized foreach will wreak havoc. It's not obvious during
>>> compilation - maybe it goes unnoticed on small test systems... but it will
>>> fail in production environment at some point in time.
>>>
>>> And that's what I am concerned about. It's not obvious, not fast failing and
>>> may be introduced almost "silently" many many lines of code away...
>>>
>>> Just my 5 cents...
>>> Greetings
>>> Bernd
>>>
>>> scala> def writeResult(res: scala.collection.immutable.Seq[String]): Unit = {
>>> | // normally does things like derive filename, cleanup old versions
>>> | // opens streams, error handling and the like and then writes
>>> | // the content
>>> | res.foreach((s) => println(s))
>>> | }
>>> writeResult: (res: scala.collection.immutable.Seq[String])Unit
>>>
>>> // just a make up result (and almost always in result files order matters)
>>> scala> val a = List("a", "b", "c", "d", "e")
>>> a: List[java.lang.String] = List(a, b, c, d, e)
>>>
>>> // that one is ok
>>> scala> writeResult(a)
>>> a
>>> b
>>> c
>>> d
>>> e
>>>
>>> // that one might be introduced because the Seq is a bigger one
>>> // and there are many cores awaiting work...
>>> scala> val pa = a.par
>>> pa: scala.collection.parallel.immutable.ParSeq[java.lang.String] =
>>> ParVector(a, b, c, d, e)
>>>
>>> // after some processing the Seq is to be saved...
>>> scala> writeResult(pa)
>>> a
>>> b
>>> d
>>> e
>>> c
>>>
>>> // an unexpected behaviour that spreads from the point of par
>>> // introduction into
>>> // formerly proper working pieces of code which silently assumed that
>>> // foreach is acting faithfully on the sequence of the Seq
>>>
Re: REPL: null-values in parallel ranges - Seq.foreach acting u
A traversable class might or might not have two properties: strictness and orderedness. Neither is represented as a type.
If Traversable does not make any guarantees on strictness and orderedness, then consequently its methods cannot assume anything about that either. That includes the foreach method. Right?
2011/4/2 HamsterofDeath <h-star [at] gmx [dot] de>
Re: REPL: null-values in parallel ranges - Seq.foreach acting u
"If a collection is an instance of an ordered collection class, traversing
its elements with `foreach` will always visit elements in the
same order, even for different runs of the program."
-- Jim
On Mon, Apr 4, 2011 at 2:31 AM, Wilfred Springer <wilfredspringer [at] gmail [dot] com> wrote:
Re: REPL: null-values in parallel ranges - Seq.foreach acting u
"If the class is not ordered,
foreachcan visit elements in different orders for different runs (but it will keep the same order in the same run)."So `foreach` promises to be deterministic within a run of a program, even when no order is defined on the collection.
-- Jim
On Mon, Apr 4, 2011 at 3:31 AM, Jim Balter <Jim [at] balter [dot] name> wrote:
Re: REPL: null-values in parallel ranges - Seq.foreach acting un
Just to recap:
Also, it additionally implicitly promises that 2 elements will not be
traversed at the same time.
ParSeq's are special in the sense that while they have an order
implied by element indices, they process them in no prespecified
order, and concurrently.
Various methods, such as firstIndexOf, take that ordering into account
(the first element in the ordering is found).
Alex
On Apr 4, 12:35 pm, Jim Balter wrote:
> And I should have included the rest of it:
>
> "If the class is not ordered, foreach can visit elements in different orders
> for different runs (but it will keep the same order in the same run)."
>
> So `foreach` promises to be deterministic within a run of a program, even
> when no order is defined on the collection.
>
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Aleksandar,
I have an off-the-wall idea for parallel foreach. I do like the
consistency that letting foreach be parallel like all the other higher
order methods on parallel collections has, but I worry it is error
prone. I think it would be safer if by default foreach was sequential,
but there was some other way to get a parallel foreach if you want
one. pforeach is one way, but it feels inconsistent. The other problem
is pforeach can't be used from a for expression.
What if parallel collections by default did a sequential foreach, but
there was a method on parallel collections that you call to get
another parallel collection with everything the same but a parallel
foreach. So given a parallel array a, this would iterate sequentially:
for (i <- 1 until N)
println(a(i))
But this would execute a parallel foreach:
for (i <- (1 until N).pfor)
a(i) = a(i) * scale
Something like that. Sequential would still be the default on parallel
collections, which I think is frankly what people would really want
from foreach. It's simple to do the safe thing. But when you really
want a parallel foreach you can get it, and you don't have to tighten
the contract of foreach in Traversable or add pforeach to it. (Instead
you'd add a method like "pfor".) And you can use for expressions to
get parallel foreach.
Bill
On Mon, Apr 4, 2011 at 3:41 AM, Aleksandar Prokopec
wrote:
> Just to recap:
>
> Also, it additionally implicitly promises that 2 elements will not be
> traversed at the same time.
>
> ParSeq's are special in the sense that while they have an order
> implied by element indices, they process them in no prespecified
> order, and concurrently.
> Various methods, such as firstIndexOf, take that ordering into account
> (the first element in the ordering is found).
>
> Alex
>
>
> On Apr 4, 12:35 pm, Jim Balter wrote:
>> And I should have included the rest of it:
>>
>> "If the class is not ordered, foreach can visit elements in different orders
>> for different runs (but it will keep the same order in the same run)."
>>
>> So `foreach` promises to be deterministic within a run of a program, even
>> when no order is defined on the collection.
>>
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Bill,
A parallel foreach does make sense in quite a few situations, and it can be used within a `for`. For example:
val cmap = new java.util.concurrent.ConcurrentHashMap[Int, Int]
for (i <- (0 until N).par) { // this iterates over a ParRange
cmap.put(i, i)
}
or:
@volatile val flag = false
for (i <- (0 until N).par) {
if (check(i)) flag = true
}
In fact, any kind of properly synchronized side-effect access is ok, provided that the side-effect is commutative and associative. In the above example, that is the case. Having an AtomicInteger, AtomicDouble, ..., and incrementing it; adding a disjunct set of items in a set that is properly synchronized, removing a disjunct set of items to filter out some collection, doing a query and logging it somewhere within a synchronized block ... these are all examples where you want to traverse things in parallel.
The issue with a parallel foreach is not so much the issue of the out-of-order execution, as it is the issue of concurrent execution.
So, there is a very good use case for a parallel foreach - not all side-effects are exclusive with parallelism.
The problem, of course, as you've put it is in the most common use-case. People have the mindset that foreach can be used with any kind of a side-effecting closure.
I am still unsure whether this use case is common enough to introduce a non-uniformity which says that foreach behave differently.
Alex
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Alex,
On Wed, Apr 6, 2011 at 12:03 AM, Aleksandar Prokopec
wrote:
> On 06/04/11 00:32, Bill Venners wrote:
>
> Hi Aleksandar,
>
> I have an off-the-wall idea for parallel foreach. I do like the
> consistency that letting foreach be parallel like all the other higher
> order methods on parallel collections has, but I worry it is error
> prone. I think it would be safer if by default foreach was sequential,
> but there was some other way to get a parallel foreach if you want
> one. pforeach is one way, but it feels inconsistent. The other problem
> is pforeach can't be used from a for expression.
>
> What if parallel collections by default did a sequential foreach, but
> there was a method on parallel collections that you call to get
> another parallel collection with everything the same but a parallel
> foreach. So given a parallel array a, this would iterate sequentially:
>
> for (i <- 1 until N)
> println(a(i))
>
> But this would execute a parallel foreach:
>
> for (i <- (1 until N).pfor)
> a(i) = a(i) * scale
>
> Something like that. Sequential would still be the default on parallel
> collections, which I think is frankly what people would really want
> from foreach. It's simple to do the safe thing. But when you really
> want a parallel foreach you can get it, and you don't have to tighten
> the contract of foreach in Traversable or add pforeach to it. (Instead
> you'd add a method like "pfor".) And you can use for expressions to
> get parallel foreach.
>
> Hi Bill,
>
> A parallel foreach does make sense in quite a few situations, and it can be
> used within a `for`. For example:
>
> val cmap = new java.util.concurrent.ConcurrentHashMap[Int, Int]
>
> for (i <- (0 until N).par) { // this iterates over a ParRange
> cmap.put(i, i)
> }
>
> or:
>
> @volatile val flag = false
>
> for (i <- (0 until N).par) {
> if (check(i)) flag = true
> }
>
> In fact, any kind of properly synchronized side-effect access is ok,
> provided that the side-effect is commutative and associative. In the above
> example, that is the case. Having an AtomicInteger, AtomicDouble, ..., and
> incrementing it; adding a disjunct set of items in a set that is properly
> synchronized, removing a disjunct set of items to filter out some
> collection, doing a query and logging it somewhere within a synchronized
> block ... these are all examples where you want to traverse things in
> parallel.
>
> The issue with a parallel foreach is not so much the issue of the
> out-of-order execution, as it is the issue of concurrent execution.
> So, there is a very good use case for a parallel foreach - not all
> side-effects are exclusive with parallelism.
>
> The problem, of course, as you've put it is in the most common use-case.
> People have the mindset that foreach can be used with any kind of a
> side-effecting closure.
> I am still unsure whether this use case is common enough to introduce a
> non-uniformity which says that foreach behave differently.
>
I'm not sure either. I have notions of how I'd like to use parallel
collections, but I don't know what I'll really think about how I want
to use them until I get some experience with them. Nor do I have any
idea how the community at large will end up wanting to use them.
One way I think people will want to use them is as performance
optimizations applied when there's a performance problem. In this use
case I think a parallel foreach like you have now is probably fine,
because the .par's would be applied very locally and rarely. This is
probably how I'll start out, by dipping my toe in by seeing if I can
improve performance issues.
But I have felt the promise of parallel collections is that they make
it much easier to write programs that take performance advantage of
multiple cores by using them more generally. Starting out I won't
change my Seq's to AnySeq's in type signatures. Until experience
proves that's the right thing to do I'd rather keep sequential types
the lingua franca of interfaces, and maybe we'll discover sequential
the right way to do signtatures. Don't know yet. But inside methods,
I'd like to try using parallel collections generally (a little bit at
a time, to see how it goes).
The reason is that (foreach excepted) I already program in a
thread-safe style when I call higher order methods on collections. I
didn't do it to be thread-safe, but just because that's how I think
about those methods. They are data transformers, and I pass in a
function that takes one kind of data and transforms it to another,
with no side effects. Or I pass in a predicate that evaluates input
and returns true or false, with no side effects. Now I may have in the
past written some existing code passed to a map, or in a foreach with
a yield, that has a side effect, because I wasn't really trying to
make sure I had no side effects before. It's possible, but if it
happened at all it didn't happen much. And I think it wouldn't be that
difficult for me to be careful not to have side effects in my
passed-in functions once I start using parallel collections more
generally (other than to foreach).
I do fairly regularly use for without yield (a for "loop"), because
sometimes I need to program side effects, and I want to avoid indexing
with while, or avoid recursion, etc. So I use a for loop, and in that
case, I'd prefer to not have to say .seq because my guess is that this
is the most common thing I'll want to do with a for loop. I.e., if
foreach is parallel like the other methods, the way it is in .RC1, I
can just do this when I want a sequential foreach on an otherwise
parallel collection:
for (e <- pList.seq)
println(e)
That's not so terrible, but if a sequential foreach that's what I want
to do most of the time, and a parallel foreach rarely, then I'd rather
have a way to get a type that has all higher order functions other
than foreach run in parallel, and foreach run sequentially, with a
method like .pfor I could call to get a fully parallel variant. And
the latter I would use more as a performance optimization:
@volatile val flag = false
// performance optimization
for (i <- (0 until N).par.pfor) {
if (check(i)) flag = true
}
Otherwise, when I'm using parallel collections more generally,
whenever I do a for loop I'd need to either call .seq or use
java.util.concurrent, or volatile, etc. I think what I'm after as a
user is that I can program in the style I have been programming, just
being more careful to be sure I don't have side effects in functions
passed to non-foreach higher order functions, using parallel
collections instead of the regular ones. That also means that I don't
have to change how I think about for loops. I.e., I don't have to
worry about making those thread safe, unless I say something extra
like .pfor.
Bill
----
Bill Venners
Artima, Inc.
http://www.artima.com
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Bill,
These are good points about the uses of parallel collections.
Well, I guess the semantics of the `foreach` in parallel collections is
for the users to decide in the end.
From a uniformity and principle of least surprise point of view, I
think it's somewhat non-uniform.
From a use case point of view, you're right, `for` is mostly used for
side-effects.
From my point of view, this is a trivial refactoring, and it can be
changed anytime.
Cheers,
Alex
> I'm not sure either. I have notions of how I'd like to use parallel
> collections, but I don't know what I'll really think about how I want
> to use them until I get some experience with them. Nor do I have any
> idea how the community at large will end up wanting to use them.
>
> One way I think people will want to use them is as performance
> optimizations applied when there's a performance problem. In this use
> case I think a parallel foreach like you have now is probably fine,
> because the .par's would be applied very locally and rarely. This is
> probably how I'll start out, by dipping my toe in by seeing if I can
> improve performance issues.
>
> But I have felt the promise of parallel collections is that they make
> it much easier to write programs that take performance advantage of
> multiple cores by using them more generally. Starting out I won't
> change my Seq's to AnySeq's in type signatures. Until experience
> proves that's the right thing to do I'd rather keep sequential types
> the lingua franca of interfaces, and maybe we'll discover sequential
> the right way to do signtatures. Don't know yet. But inside methods,
> I'd like to try using parallel collections generally (a little bit at
> a time, to see how it goes).
>
> The reason is that (foreach excepted) I already program in a
> thread-safe style when I call higher order methods on collections. I
> didn't do it to be thread-safe, but just because that's how I think
> about those methods. They are data transformers, and I pass in a
> function that takes one kind of data and transforms it to another,
> with no side effects. Or I pass in a predicate that evaluates input
> and returns true or false, with no side effects. Now I may have in the
> past written some existing code passed to a map, or in a foreach with
> a yield, that has a side effect, because I wasn't really trying to
> make sure I had no side effects before. It's possible, but if it
> happened at all it didn't happen much. And I think it wouldn't be that
> difficult for me to be careful not to have side effects in my
> passed-in functions once I start using parallel collections more
> generally (other than to foreach).
>
> I do fairly regularly use for without yield (a for "loop"), because
> sometimes I need to program side effects, and I want to avoid indexing
> with while, or avoid recursion, etc. So I use a for loop, and in that
> case, I'd prefer to not have to say .seq because my guess is that this
> is the most common thing I'll want to do with a for loop. I.e., if
> foreach is parallel like the other methods, the way it is in .RC1, I
> can just do this when I want a sequential foreach on an otherwise
> parallel collection:
>
> for (e<- pList.seq)
> println(e)
>
> That's not so terrible, but if a sequential foreach that's what I want
> to do most of the time, and a parallel foreach rarely, then I'd rather
> have a way to get a type that has all higher order functions other
> than foreach run in parallel, and foreach run sequentially, with a
> method like .pfor I could call to get a fully parallel variant. And
> the latter I would use more as a performance optimization:
>
> @volatile val flag = false
>
> // performance optimization
> for (i<- (0 until N).par.pfor) {
> if (check(i)) flag = true
> }
>
> Otherwise, when I'm using parallel collections more generally,
> whenever I do a for loop I'd need to either call .seq or use
> java.util.concurrent, or volatile, etc. I think what I'm after as a
> user is that I can program in the style I have been programming, just
> being more careful to be sure I don't have side effects in functions
> passed to non-foreach higher order functions, using parallel
> collections instead of the regular ones. That also means that I don't
> have to change how I think about for loops. I.e., I don't have to
> worry about making those thread safe, unless I say something extra
> like .pfor.
>
> Bill
> ----
> Bill Venners
> Artima, Inc.
> http://www.artima.com
>
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
On Thu, Apr 7, 2011 at 4:22 AM, Aleksandar Prokopec <aleksandar [dot] prokopec [at] gmail [dot] com> wrote:
What is the purpose of uniformity?
In terms of least surprise, I think a sequential foreach gives the least surprise.
How about in 2.9 be more on the conservative side and "leave" foreach sequential, and if there is demand for a parallel foreach (e.g. people are surprised by the non-uniformity) then change it post-2.9.
If you take a risk of upsetting people, make it parallel, and find out that it did indeed upset people, it will be much more of a deal to change it back. Plus it could give the whole "parallel collections experiment" a bad name (perhaps among uninformed people, but that is still a factor).
In other words, since there is a doubt what to do, which means that either way we go we may potentially have regret, better to risk regretting conservativeness and correct it (make the change later), than risk regretting haste.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Now that old code can't be passed a parallel collection, I think it is a mistake to let programmers who are explicitly going for parallel code to get away with incorrect code. If a parallel sequence ever promises a sequential foreach, it will never be reverted.
So, what is the _gain_ of making foreach sequential? That it will allow people who explicitly ask for a parallel collection to do non-parallel things with it without asking for a non-parallel collection first? I fail to see the use in that.
It seems people are afraid that people trying to write parallel code will make mistakes, and that the solution to that is make their code non-parallel.
--
Daniel C. Sobral
I travel to the future all the time.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
I agree with this comment, Daniel. We now have a way to prevent (old)
code from using a parallel collection unless it explicitly states that.
Alex
> Now that old code can't be passed a parallel collection, I think it is
> a mistake to let programmers who are explicitly going for parallel
> code to get away with incorrect code. If a parallel sequence ever
> promises a sequential foreach, it will never be reverted.
>
> So, what is the _gain_ of making foreach sequential? That it will
> allow people who explicitly ask for a parallel collection to do
> non-parallel things with it without asking for a non-parallel
> collection first? I fail to see the use in that.
>
> It seems people are afraid that people trying to write parallel code
> will make mistakes, and that the solution to that is make their code
> non-parallel.
>
>
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Naftoli,
On Thu, Apr 7, 2011 at 6:01 PM, Naftoli Gugenheim wrote:
>
>
> On Thu, Apr 7, 2011 at 4:22 AM, Aleksandar Prokopec
> wrote:
>>
>> From a uniformity and principle of least surprise point of view, I think
>> it's somewhatnon-uniform.
>
> What is the purpose of uniformity?
> In terms of least surprise, I think a sequential foreach gives the least
> surprise.
>
I think the value of uniformity is that the code does what it looks
like it does. If you say .par, you get parallel. For example, I think
this code looks like it would do a parallel foreach:
for (i <- (1 until 10000).par)
a(i) = a(i) * scale
>> From a use case point of view, you're right, `for` is mostly used for
>> side-effects.
>
> How about in 2.9 be more on the conservative side and "leave" foreach
> sequential, and if there is demand for a parallel foreach (e.g. people are
> surprised by the non-uniformity) then change it post-2.9.
> If you take a risk of upsetting people, make it parallel, and find out that
> it did indeed upset people, it will be much more of a deal to change it
> back. Plus it could give the whole "parallel collections experiment" a bad
> name (perhaps among uninformed people, but that is still a factor).
> In other words, since there is a doubt what to do, which means that either
> way we go we may potentially have regret, better to risk regretting
> conservativeness and correct it (make the change later), than risk
> regretting haste.
>
Isn't it the other way around? If .par gives a sequential foreach in
2.9, then changing that to a parallel foreach later would break all
code written under the assumption foreach is sequential. But if .par
gives a parallel foreach in 2.9 and that turns out to be too
error-prone in practice, it could be changed to a sequential foreach
later without breaking code (other than slowing things down in some
cases). Maybe slowing things down would qualify as breaking code in
some situations, but that's far better than introducing concurrency
bugs. You really could never go from a sequential foreach to a
parallel one later with the same method name.
But I think the best thing to do in 2.9 is find some way to give
people both. For example, either let .par give a sequential foreach
with a .pfor method to get a parallel foreach, Or let .par give a
parallel foreach with a .sfor method to get a sequential foreach. If
that's the direction the hard question is which should be the default.
I think .par meaning parallel everything provides the clearest code.
That for expression I showed earlier would do exactly what it looks
like it does, run that for loop in parallel. There's value in that.
And "coll.par.sfor" look like it means "a parallel collection with a
sequential foreach" to me, which is exactly what it would mean. The
downside of that is the default you get with .par is the less safe of
the two variants. It is also possibly the less commonly desired of the
two. I think most times I use parallel collections I'll probably want
the sequential foreach, though I won't know for sure until I've
actually used these things for a good while.
If there's a solution that provides the clarity of .par giving
parallel everything including foreach with a subsequent .sfor giving a
parallel collection with a sequential foreach, but gives the
sequential one as the default, that would be ideal. Maybe there are
two method names, one of which would take you right to a parallel
collection with parallel foreach and the other to a parallel
collection with a sequential foreach, but I can't think of anything as
clear as .par.sfor. And frankly I think we could live with the default
being parallel everything so long as there's another step that would
produce the sequential variant. I think people would learn it pretty
quickly, and probably not forget to call an extra method. But the best
would be a solution that gives both clear code and a safer default.
Bill
----
Bill Venners
Artima, Inc.
http://www.artima.com
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
On Fri, Apr 8, 2011 at 5:18 AM, Bill Venners wrote:
> I think .par meaning parallel everything provides the clearest code.
> That for expression I showed earlier would do exactly what it looks
> like it does, run that for loop in parallel. There's value in that.
> And "coll.par.sfor" look like it means "a parallel collection with a
> sequential foreach" to me, which is exactly what it would mean.
Here, I'm not so sure if "the looks" are relevant here. Because, I
don't think that's the real use case. Let me say first, that I've
pretty much agreed with all of your well-reasoned contributions to
this discussion. Particularly, these points:
* even if all others operations are parallel, `foreach` seems like a
special case because it is likely that a concurrent foreach will
produce all sorts of problems
* however, there are still good use cases for both a sequential and
parallel version of `foreach`
* the question is now what `foreach` should do by default and how to
access the other version
That said, I don't see anyone using this particular construct:
`coll.par.sfor`. Why should anyone use it? Only if one wants to use
the same collection for both parallel operations and a sequential
foreach, but then again you could just use coll.seq before doing a
sequential foreach (if coll is not known to be sequential).
At the point where you would make the collection parallel you will
often not know how it is used later on. You know if you need a
sequential `foreach` when you are writing the loop body, not
necessarily when you are creating the collection. That seems like the
basic flaw in your suggestion: `sfor` looks like it would depend on
the collection if a parallel `foreach` is safe but in most (all?)
cases it depends on the loop body.
So, even without `sfor`, the rules how to handle `foreach` are pretty
simple: If your method accepts an AnySeq (or AnyWhatever) to be as
generic as possible and uses `foreach` to loop over the elements (or a
for comprehension), you have to make sure the loop body can be
executed concurrently. If it isn't safe to execute concurrently (or
you don't know), you have to use `coll.seq` to make it sequential
before. This way, the one actually writing the loop over an
AnyCollection is the one to make sure everything works concurrently.
In short:
AnyWhatever.foreach is like the proposed 'pforeach', i.e. possibly
executed in parallel
AnyWhatever.seq.foreach is like the old foreach: always executed sequential
AnyWhatever.par.foreach is always executed parallel
Do we really need more?
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Johannes,
On Fri, Apr 8, 2011 at 1:35 AM, Johannes Rudolph
wrote:
> On Fri, Apr 8, 2011 at 5:18 AM, Bill Venners wrote:
>> I think .par meaning parallel everything provides the clearest code.
>> That for expression I showed earlier would do exactly what it looks
>> like it does, run that for loop in parallel. There's value in that.
>> And "coll.par.sfor" look like it means "a parallel collection with a
>> sequential foreach" to me, which is exactly what it would mean.
>
> Here, I'm not so sure if "the looks" are relevant here. Because, I
> don't think that's the real use case. Let me say first, that I've
> pretty much agreed with all of your well-reasoned contributions to
> this discussion. Particularly, these points:
> * even if all others operations are parallel, `foreach` seems like a
> special case because it is likely that a concurrent foreach will
> produce all sorts of problems
> * however, there are still good use cases for both a sequential and
> parallel version of `foreach`
> * the question is now what `foreach` should do by default and how to
> access the other version
>
Agreed.
> That said, I don't see anyone using this particular construct:
> `coll.par.sfor`. Why should anyone use it? Only if one wants to use
> the same collection for both parallel operations and a sequential
> foreach, but then again you could just use coll.seq before doing a
> sequential foreach (if coll is not known to be sequential).
> At the point where you would make the collection parallel you will
> often not know how it is used later on. You know if you need a
> sequential `foreach` when you are writing the loop body, not
> necessarily when you are creating the collection. That seems like the
> basic flaw in your suggestion: `sfor` looks like it would depend on
> the collection if a parallel `foreach` is safe but in most (all?)
> cases it depends on the loop body.
>
Well in that design .par.sfor would give you a collection on which all
functional higher-order methods execute in parallel but foreach
executes sequentially. My guess is that the functional methods would
be sticky. If I call map on such a collection I'd get the same kind
back, all parallel but with a sequential foreach. So then many
functional data transformations later when you used a foreach it would
be sequential.
I'm guessing one of the styles that people will want to use with
parallel collections is parallel functional style methods with
sequential foreach, because it is easier and less error prone.
> So, even without `sfor`, the rules how to handle `foreach` are pretty
> simple: If your method accepts an AnySeq (or AnyWhatever) to be as
> generic as possible and uses `foreach` to loop over the elements (or a
> for comprehension), you have to make sure the loop body can be
> executed concurrently. If it isn't safe to execute concurrently (or
> you don't know), you have to use `coll.seq` to make it sequential
> before. This way, the one actually writing the loop over an
> AnyCollection is the one to make sure everything works concurrently.
>
> In short:
> AnyWhatever.foreach is like the proposed 'pforeach', i.e. possibly
> executed in parallel
> AnyWhatever.seq.foreach is like the old foreach: always executed sequential
> AnyWhatever.par.foreach is always executed parallel
>
> Do we really need more?
>
You can say .seq anytime you want a sequential for loop. But if
usually that's what you want it would be more convenient to get it by
default. Plus it is better to have the safer thing the default,
because that would make it harder for users to make mistakes.
But certainly if .par gives parallel everything in 2.9, including
foreach, you wouldn't need something like .sfor in 2.9. If it turned
out to be useful after people get real experience, it could be added
later. But once .par gives parallel everything it can't be changed,
and right there is the big decision for 2.9 I think.
I'm worried that if .par gives parallel everything it makes parallel
collections harder and more risky to use, and I think that trumps the
inconsistency. Scala collections are already inconsistent because
Scala itself leans towards the functional style, but allows the
imperative style. Every method in collections except foreach is a
functional style method, but foreach is imperative style. Foreach is
the odd man out and is non-uniform. Because of that difference, using
parallel foreach is an order of magnitude more difficult and
error-prone than using a parallel any other method on collections.
Parallel collections are power tools. They will make you more
productive at making things perform well, but can also give you
concurrency bugs when you make a mistake. Again this is because Scala
supports both functional and imperative styles. When you hand someone
a chain saw, you'd lock the chain so it is less likely that person
will accidentally pull the trigger and cut themselves when they grab
it. Let them unlock it. If you pass someone a gun, you'd make sure the
safety was on first. If you pass them a pocket knife, you'd close it
first. Let them open it when they are ready. I think when you pass
someone a parallel collection when they invoke .par, foreach should be
sequential. That means that .par would mean functional parallel -- all
the functional style methods run in parallel, foreach does not, unless
the user flips off the safety switch.
A pfor method could be something you call any collection (it could be
declared on AnyTraversable alongside seq and par) that gives you that
parallel foreach. So a use case would look like:
for (i <- (1 until 10000).pfor)
a(i) = a(i) * scale
That's pretty clear code. Downside is this code would be misleading:
for (i <- (1 until 10000).par)
a(i) = a(i) * scale
People might write that thinking the for loop would run in parallel.
People reading the code might think that. This would make people less
productive slightly, but I think not as less productive is they would
be if they accidentally get a parallel for loop when it looks like
sequential one.
So on AnyTraversable:
.seq would give you a fully sequential collection
.par would give you a functional parallel collection (foreach still sequential)
.pfor would give you a fully parallel collection (including foreach)
Alternatively, pfor could just give you the same thing with just a
parallel foreach. Then to get a fully parallel one you'd say .par.pfor
or .pfor.par, etc. But I don't see a use case for something with a
parallel foreach and sequential everything else, so I'd just have pfor
go fully parallel. pfor doesn't look like that so maybe there's a
better name.
Bill
----
Bill Venners
Artima, Inc.
http://www.artima.com
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
On Fri, Apr 8, 2011 at 7:00 PM, Bill Venners wrote:
> I'm guessing one of the styles that people will want to use with
> parallel collections is parallel functional style methods with
> sequential foreach, because it is easier and less error prone.
I understand where you are getting at, actually that's what I thought
before myself. But I changed my mind and I think we now disagree
whether one has to worry about programmers who deliberately used
'.par' to make their collection parallel and then fail to understand
the consequences of using `foreach` on this collection. Especially
when the choice is to give up consistency, instead, I'd wager for
consistency. Shooting yourself in the foot with concurrency is easy
enough anyways.
The rules I posted before were for the other cases (which I'm more
worried about to get things right from the beginning): the cases where
the person calling `.par` is not the one writing the code using the
collection in the end. Here we should have some advice how to handle
those cases correctly without running into the parallel foreach trap.
BTW: Given we had `.pfor`: In a library method you want to loop over a
collection passed into the method. Now you write the loop body and
know it is thread-safe. Still you don't want to make the traversal
parallel in every case (as parallel calculations come with a price)
but you want the caller of the method to decide which that is. How
would you do that elegantly with `.pfor`?
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Johannes,
On Sun, Apr 10, 2011 at 9:24 AM, Johannes Rudolph
wrote:
> On Fri, Apr 8, 2011 at 7:00 PM, Bill Venners wrote:
>> I'm guessing one of the styles that people will want to use with
>> parallel collections is parallel functional style methods with
>> sequential foreach, because it is easier and less error prone.
>
> I understand where you are getting at, actually that's what I thought
> before myself. But I changed my mind and I think we now disagree
> whether one has to worry about programmers who deliberately used
> '.par' to make their collection parallel and then fail to understand
> the consequences of using `foreach` on this collection. Especially
> when the choice is to give up consistency, instead, I'd wager for
> consistency. Shooting yourself in the foot with concurrency is easy
> enough anyways.
>
There may still be a design that gives you both consistency,
convenience, and safety that hasn't been suggested yet. That would be
ideal.
> The rules I posted before were for the other cases (which I'm more
> worried about to get things right from the beginning): the cases where
> the person calling `.par` is not the one writing the code using the
> collection in the end. Here we should have some advice how to handle
> those cases correctly without running into the parallel foreach trap.
>
> BTW: Given we had `.pfor`: In a library method you want to loop over a
> collection passed into the method. Now you write the loop body and
> know it is thread-safe. Still you don't want to make the traversal
> parallel in every case (as parallel calculations come with a price)
> but you want the caller of the method to decide which that is. How
> would you do that elegantly with `.pfor`?
>
AnySeq would mean a collection that could be either parallel or
sequential, and whose foreach could be parallel or sequential. So that
does let the caller decide. I'm not sure how often people will want to
use AnySeq, AnyMap, etc., on APIs versus letting the parallelism be an
implementation decision taken to improve performance. I'm certainly
not planning on changing the collection types up to AnyX in APIs any
time soon, though I will try and see what kind of performance benefits
I can get from them internally.
The trouble with pfor, though, is not just that it makes .par
inconsistent but that it also complicates the space of possibilities.
If you do take an AnySeq, you may want to make sure it has a
sequential for. So you'd want to be able to call .sfor on it. So
you'd need both sfor and pfor. I'm guessing we might want to use the
Parallel marker trait to recognize a parallel collection in a pattern
match. Well you might reasonably want to be able to recognize a
parallel collection with a sequential foreach also. So you'd need
ParallelForeach and SequentialForeach traits? There are too many
combinations.
Perhaps the best best design seen so far is simply the original one
EPFL had at some point prior to 2.9.0.RC1. Foreach is always
sequential, and parallel collections also have a pforeach that gives
you a parallel foreach. Now the space is just two kinds of collection,
sequential and parallel. Though the contract of foreach still would
allow for parallel collections, as it always has, foreach would always
be sequential on any collection coming from the Scala API.
To those worried about inconsistency of .par, I'd suggest you look at
your existing code and count how many uses of these methods are
already ready to be run in parallel. In my code I'd guess more than 99
out of 100 times my uses of functional-style methods like map,
flatMap, and filter, etc., are already ready to be run in parallel.
But on foreach, more than 99 times out of 100 times it would NOT be
ready to be run in parallel. I think that's not just because I haven't
been thinking about it. It's because that's how I want to use foreach.
99 times out of 100 I actually want a sequential foreach no matter
whether the rest of the collection is sequential or parallel. So to
force people to say .seq .seq .seq .seq .seq .seq .seq in their for
loops 99 times just so they don't have to say it the one time they
want a parallel for lop is inconvenient as well as error prone. Or put
another way, a .par that gives you a parallel foreach is inconsistent
with how people will most often want to *use* parallel collections.
I've seen three examples shown so far that justify the need for a
parallel foreach. There are others, but these are cherry picked. How
often do you think you'd need to use a parallel foreach to scale an
array?
for (i <- (0 until a.length).par)
a(i) = a(i) * scale
Probably in less than 1 out of a 100 for loops. And frankly, except on
large arrays for which you actually have evidence it matters to
performance, it would probably be premature optimization to do it that
way instead of:
a map (_ * scale)
How often would you feel the need to do an exists like this:
@volatile var flag = false
for (i <- (0 until a.length).par)
if (a(i) < 0) flag = true
Instead of:
a exists (_ < 0)
The earlier one is a performance optimization, which if you don't have
actual evidence of a performance problem I'd call a premature
optimization. It is more complicated to read than the latter, and more
error prone. Or how often would you feel the need to do this:
@volatile var sum = 0
for (i <- (0 until a.length).par)
sum += a(i)
Instead of:
a.sum
Again it is a performance optimization that is more verbose and error
prone than the functional alternative. And speaking of error prone,
are you sure you'd want to run that for loop in parallel?
This seems useful:
val cmap = new java.util.concurrent.ConcurrentHashMap[Int, Int]
for (i <- (0 until n).par)
cmap.put(i, i)
But looking at your existing code again, how often have you been
populating a concurrent collection like this? From time to time, but
likely less than 1 in a 100 for loops. Moreover, if you don't actually
have evidence there's a performance problem, I'd say a sequential
population would be just as good:
val cmap = new java.util.concurrent.ConcurrentHashMap[Int, Int]
for (i <- 0 until n)
cmap.put(i, i)
That said, one thing that bugged me about pforeach is I couldn't use a
parallel foreach via a for expression. But truth is, making pforeach a
tad uglier might be good from human factors perspective. It would
discourage its use. But on the other hand, if it is just being used
now and then when there is evidence of a performance problem, this
code isn't too bad looking:
(0 until a.length) pforeach { i =>
a(i) = a(i) * scale
}
Moreover, the pforeach design doesn't mean you absolutely can't get a
parallel foreach out of a for expression, just that you couldn't get
one out of the collections coming from the Scala API. The 2.8
collections design makes it quite straightforward to create your own
collection implementations, and the contract for foreach allows a
parellel implementation. If you're doing some image processing where
you need to do a bunch of scaling on large arrays, with just a few
lines of code you could create your own AnySeq on which foreach
forwards to pforeach. Then you could do this:
val a = MyParallelForeachAnySeq(originalA)
for (i <- 0 until a.length)
a(i) = a(i) * scale
So it would be possible. But requiring people to say pforeach or write
their own wrapper class to get a parallel for loop would be enough
resistence to push people towards using sequential foreach most of the
time, and I think that's important, because people make mistakes.
Making a parallel foreach the default for .par is kind of like putting
the pilot eject button right next to the intercom button on an
airplane cockpit. You probably want to put the eject button behind a
little door at the very least.
Everything has software in it nowadays, and over time more and more
apps will be running on multi-core CPUs, so I expect languages like
Scala, Clojure, Fortress, etc., are going to be used more and more.
The more Scala gets used, the greater chance that programming errors
will do serious harm. Scala code could show up in an airplane someday,
or an nuclear power plant, or a spacecraft, or a chain saw. Scala code
will be written by all kinds of programmers, including ones who don't
quite understand threads well. Even the programmers that do understand
threads well are human and will from time to time make mistakes. If
the convenience argument doesn't sway you, then hopefully the safety
one will.
Bill
>
> --
> Johannes
>
> -----------------------------------------------
> Johannes Rudolph
> http://virtual-void.net
>
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Right, so what you're saying is that we don't need parallel collections for this at all unless we do performance testing. Sequential is fine.
I don't think the earlier one is a performance optimization, not without a break anyway.
But again, your point is that parallel collections aren't necessary.
I'm sure I wouldn't. += isn't atomic, and even if it were, the accumulating is likely to be the bottleneck so you wouldn't speed it up anyway. a.par.sum could properly divide-and-conquer the problem. But why use parallel collections if you're not performance-bound?
Yet again, why use parallel collections _at all_ unless there's evidence of a performance problem?
Here's something I surely would rather not do slowly sequentially:
for (datum <- data.par) {
writeToFile( horriblySlowConversionToXML( datum ) )
}
Or maybe this:
for (imagename <- names.par) {
writeNewImage( processImage( readOldImage( imagename ) ) )
}
Right now, you can do these things with futures, but it's awkward. A simple for loop would be much easier. Thus, I disagree that sequential for loops are the overwhelmingly common use case. When you have sequential collections, of course your for loops look sequential. With parallel collections, different use cases arise.
Again, and I don't think I can stress this point enough, there is no reason to use a parallel collection in any form unless one has some sort of performance issue. How often are you going to have a performance problem when you use "map" but not when you use "foreach"? If you don't have a performance problem, why are you using a parallel collection with its concomitant issues with debugging, memory usage, profiling, reasoning about order of operations, and so on?
The principle, I think, should be: when you need parallel collections, know that you are using them, and then be careful to do it sensibly. Maybe a compiler plugin that that warned on using mutable structures in closures would help for those people who are still learning what one has to know when writing parallel code.
--Rex
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Rex,
> Right now, you can do these things with futures, but it's awkward. A simple
> for loop would be much easier. Thus, I disagree that sequential for loops
> are the overwhelmingly common use case. When you have sequential
> collections, of course your for loops look sequential. With parallel
> collections, different use cases arise.
>
> Again, and I don't think I can stress this point enough, there is no reason
> to use a parallel collection in any form unless one has some sort of
> performance issue. How often are you going to have a performance problem
> when you use "map" but not when you use "foreach"? If you don't have a
> performance problem, why are you using a parallel collection with its
> concomitant issues with debugging, memory usage, profiling, reasoning about
> order of operations, and so on?
>
I'm wondering this may be a main source of the different opinions
here. If very pointed solutions to performance problems is the main
way parallel collections will be used, then I agree that .par giving a
parallel foreach is fine. I'm imagining people will want to use
parallel collections more generally because the future is more and
more cores, and Scala will make it pretty easy to use them. Not
without risk, but pretty low risk. I think if you're careful, you can
avoid side effects when using parallel collections, other than with
foreach.
For example, imagine you're writing a library routine that takes a
Seq[Int] and does a half a page of of code processing on it, then
returns a Seq[String] say. This will be used by different apps. You're
not sure if it is a bottleneck, and you're not sure how big the Seqs
you get will usually be, but sometimes it could be large. You might
think you'd be shirking your duties to not run at least large datasets
in parallel. Maybe you'd check the size at the top of your routine. If
the size is greater than some threshold, you call .par on what was
passed, otherwise not, and the rest of your method uses that.
val anySeq = if (seq.size > 100) seq.par else seq
At the end you call .seq on what you return.
The parallelness of the collection passed in would be sticky. Each
time you transform a sequential collection you'd get another
sequential one as a result. But each time you transform a parallel
collection you'd get a parallel one as a result. So in the parallel
case, the entire body of your routine would be doing things in
parallel, and in there you may want to sprinkle a few for loops. Why?
Because they make for clearer code (or easier to write code) than
recursion, less error prone than a while loop, etc. If you need to
write .seq on them to make sure they execute sequentially, it's a pain
and error prone.
So that's where I'm coming from. I'm expecting that in the age of
multicore the parallel collections (excluding foreach) will be used
more generally than just as pointed performance optimization. But that
a parallel foreach should be used only as a performance optimization
because it is so much more difficult to get right. Some may consider
it premature optimization to use parallel collections more generally,
but the more cores that show up I expect some programmers will use
them more generally like this. This may be how we need to program in
the multi-core world.
And by the way, one other thing I realized this morning that would be
nice about having a different method name like pforeach or pareach is
that you can more easily search for them when you want to inspect the
code for potential errors.
Bill
----
Bill Venners
Artima, Inc.
http://www.artima.com
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
On Mon, Apr 11, 2011 at 6:49 PM, Bill Venners wrote:
> Some may consider it premature optimization to use parallel collections more generally,
> but the more cores that show up I expect some programmers will use
> them more generally like this. This may be how we need to program in
> the multi-core world.
Something to consider is that not all applications are "take-over the
machine" type. Just because there are many cores doesn't mean that
your app should use them all as they may be shared among many
applications (in some cases via a virtualisation layer).
Best,
Ismael
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
For that matter, I'm more inclined to work as you say than to resort to parallel only as optimization. However, I do not agree with how you follow it up.
And this is where I disagree with. If the goal is to get used to write multicore code, then having the code depend on sequential behavior is a problem. That dependency is what is error prone, not the parallelism. To paraphrase Miyagi, if you do guess-so parallelism, you are screwed. Either do it, or don't.
Or, in other words, if you need code to be sequential, mark it so. If you mark code to be parallel, make sure it truly is.
--
Daniel C. Sobral
I travel to the future all the time.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
On Mon, Apr 11, 2011 at 2:01 PM, Daniel Sobral <dcsobral [at] gmail [dot] com> wrote:
Mr Miyagi said that? Was that in the Karate Kid I or II? I'm vaguely remembering something like that from the "wax-on, wax-off" sequence...but maybe you're paraphrasing?
For sure "Either do it, or don't" you are confusing with Yoda: "Do or do not! There is no try." You really need to get your pop cultural references straight!
Brian Maso
--
Best regards,
Brian Maso
(949) 395-8551
brian [at] blumenfeld-maso [dot] com
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
http://www.imdb.com/title/tt0087538/quotes
Search for "road".
--
Daniel C. Sobral
I travel to the future all the time.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Daniel,
On Mon, Apr 11, 2011 at 3:25 PM, Daniel Sobral wrote:
> On Mon, Apr 11, 2011 at 18:27, Brian Maso wrote:
>>
>>
>> Mr Miyagi said that? Was that in the Karate Kid I or II? I'm vaguely
>> remembering something like that from the "wax-on, wax-off" sequence...but
>> maybe you're paraphrasing?
>>
>> For sure "Either do it, or don't" you are confusing with Yoda: "Do or do
>> not! There is no try." You really need to get your pop cultural references
>> straight!
>
> http://www.imdb.com/title/tt0087538/quotes
>
> Search for "road".
>
Ah, Daniel-san, you misunderstand what road mean. Left side of road is
imperative style. Program sequentially on left side. Safe. Right side
of road is functional style. Program parallel on right side. Safe.
Walk down middle, program parallel in functional style except for
foreach, which you program parallel in imperative style, sooner or
later get squished like grape.
A .par that provides a sequential foreach is not guess-so parallelism,
it is functional-style parallelism. The safe kind.
I think it is interesting that everyone else in this discussion seems
to prefer consistency in the sense that all methods on a collection
produced by .par are parallel. That is nice, but produces an
inconsistency in how people use the collections, because foreach is
already very inconsistent with the other methods. In the parallel code
you'd either need to call .seq or actually make the passed-in function
thread safe. In sequential code you don't need to worry about it. I
think that's a worse inconsistency than the one you're concerned
about.
With one .par you enter the world of parallel programming, but the
code thereafter looks exactly the same as sequential code. If .par
gives a parallel foreach, then every time I do a for without a yield
I'll feel a nudge to check to make sure I'm sure I'm sure this isn't
parallel code, because otherwise I need a .seq. That makes Scala
programming less smooth, and will produce errors. The two styles of
programming look identical after the .par.
I think what I'm hoping for as a Scala user is that I can pretty much
think and program the same way when using sequential and parallel
collections: program in a functional style when using the functional
style methods; program in an imperative style when using the lone
imperative style method, foreach.
There's no such thing as transparent parallelism. We do need to be
more careful when programming parallel that our passed-in functions
have no side effects. But it isn't that hard to pass in side-effect
free functions in sequential code for non-foreach methods. If you
remove the special case of foreach is different between sequential and
parallel, then I think programming in Scala post 2.9 will be a lot
more pleasant and less error prone.
Bill
----
Bill Venners
Artima, Inc.
http://www.artima.com
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Matthew
On 11 April 2011 22:01, Daniel Sobral <dcsobral [at] gmail [dot] com> wrote:
--
Matthew Pocockmailto: turingatemyhamster [at] gmail [dot] comgchat: turingatemyhamster [at] gmail [dot] com msn: matthew_pocock [at] yahoo [dot] co [dot] ukirc.freenode.net: drdozer(0191) 2566550
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Johannes,
On Sun, Apr 10, 2011 at 9:24 AM, Johannes Rudolph
wrote:
> On Fri, Apr 8, 2011 at 7:00 PM, Bill Venners wrote:
>> I'm guessing one of the styles that people will want to use with
>> parallel collections is parallel functional style methods with
>> sequential foreach, because it is easier and less error prone.
>
> I understand where you are getting at, actually that's what I thought
> before myself. But I changed my mind and I think we now disagree
> whether one has to worry about programmers who deliberately used
> '.par' to make their collection parallel and then fail to understand
> the consequences of using `foreach` on this collection. Especially
> when the choice is to give up consistency, instead, I'd wager for
> consistency. Shooting yourself in the foot with concurrency is easy
> enough anyways.
>
There may still be a design that gives you both consistency,
convenience, and safety that hasn't been suggested yet. That would be
ideal.
> The rules I posted before were for the other cases (which I'm more
> worried about to get things right from the beginning): the cases where
> the person calling `.par` is not the one writing the code using the
> collection in the end. Here we should have some advice how to handle
> those cases correctly without running into the parallel foreach trap.
>
> BTW: Given we had `.pfor`: In a library method you want to loop over a
> collection passed into the method. Now you write the loop body and
> know it is thread-safe. Still you don't want to make the traversal
> parallel in every case (as parallel calculations come with a price)
> but you want the caller of the method to decide which that is. How
> would you do that elegantly with `.pfor`?
>
AnySeq would mean a collection that could be either parallel or
sequential, and whose foreach could be parallel or sequential. So that
does let the caller decide. I'm not sure how often people will want to
use AnySeq, AnyMap, etc., on APIs versus letting the parallelism be an
implementation decision taken to improve performance. I'm certainly
not planning on changing the collection types up to AnyX in APIs any
time soon, though I will try and see what kind of performance benefits
I can get from them internally.
The trouble with pfor, though, is not just that it makes .par
inconsistent but that it also complicates the space of possibilities.
If you do take an AnySeq, you may want to make sure it has a
sequential for. So you'd want to be able to call .sfor on it. So
you'd need both sfor and pfor. I'm guessing we might want to use the
Parallel marker trait to recognize a parallel collection in a pattern
match. Well you might reasonably want to be able to recognize a
parallel collection with a sequential foreach also. So you'd need
ParallelForeach and SequentialForeach traits? There are too many
combinations.
Perhaps the best best design seen so far is simply the original one
EPFL had at some point prior to 2.9.0.RC1. Foreach is always
sequential, and parallel collections also have a pforeach that gives
you a parallel foreach. Now the space is just two kinds of collection,
sequential and parallel. Though the contract of foreach still would
allow for parallel collections, as it always has, foreach would always
be sequential on any collection coming from the Scala API.
To those worried about inconsistency of .par, I'd suggest you look at
your existing code and count how many uses of these methods are
already ready to be run in parallel. In my code I'd guess more than 99
out of 100 times my uses of functional-style methods like map,
flatMap, and filter, etc., are already ready to be run in parallel.
But on foreach, more than 99 times out of 100 times it would NOT be
ready to be run in parallel. I think that's not just because I haven't
been thinking about it. It's because that's how I want to use foreach.
99 times out of 100 I actually want a sequential foreach no matter
whether the rest of the collection is sequential or parallel. So to
force people to say .seq .seq .seq .seq .seq .seq .seq in their for
loops 99 times just so they don't have to say it the one time they
want a parallel for lop is inconvenient as well as error prone. Or put
another way, a .par that gives you a parallel foreach is inconsistent
with how people will most often want to *use* parallel collections.
I've seen three examples shown so far that justify the need for a
parallel foreach. There are others, but these are cherry picked. How
often do you think you'd need to use a parallel foreach to scale an
array?
for (i <- (0 until a.length).par)
a(i) = a(i) * scale
Probably in less than 1 out of a 100 for loops. And frankly, except on
large arrays for which you actually have evidence it matters to
performance, it would probably be premature optimization to do it that
way instead of:
a map (_ * scale)
How often would you feel the need to do an exists like this:
@volatile var flag = false
for (i <- (0 until a.length).par)
if (a(i) < 0) flag = true
Instead of:
a exists (_ < 0)
The earlier one is a performance optimization, which if you don't have
actual evidence of a performance problem I'd call a premature
optimization. It is more complicated to read than the latter, and more
error prone. Or how often would you feel the need to do this:
@volatile var sum = 0
for (i <- (0 until a.length).par)
sum += a(i)
Instead of:
a.sum
Again it is a performance optimization that is more verbose and error
prone than the functional alternative. And speaking of error prone,
are you sure you'd want to run that for loop in parallel?
This seems useful:
val cmap = new java.util.concurrent.ConcurrentHashMap[Int, Int]
for (i <- (0 until n).par)
cmap.put(i, i)
But looking at your existing code again, how often have you been
populating a concurrent collection like this? From time to time, but
likely less than 1 in a 100 for loops. Moreover, if you don't actually
have evidence there's a performance problem, I'd say a sequential
population would be just as good:
val cmap = new java.util.concurrent.ConcurrentHashMap[Int, Int]
for (i <- 0 until n)
cmap.put(i, i)
That said, one thing that bugged me about pforeach is I couldn't use a
parallel foreach via a for expression. But truth is, making pforeach a
tad uglier might be good from human factors perspective. It would
discourage its use. But on the other hand, if it is just being used
now and then when there is evidence of a performance problem, this
code isn't too bad looking:
(0 until a.length) pforeach { i =>
a(i) = a(i) * scale
}
Moreover, the pforeach design doesn't mean you absolutely can't get a
parallel foreach out of a for expression, just that you couldn't get
one out of the collections coming from the Scala API. The 2.8
collections design makes it quite straightforward to create your own
collection implementations, and the contract for foreach allows a
parellel implementation. If you're doing some image processing where
you need to do a bunch of scaling on large arrays, with just a few
lines of code you could create your own AnySeq on which foreach
forwards to pforeach. Then you could do this:
val a = MyParallelForeachAnySeq(originalA)
for (i <- 0 until a.length)
a(i) = a(i) * scale
So it would be possible. But requiring people to say pforeach or write
their own wrapper class to get a parallel for loop would be enough
resistence to push people towards using sequential foreach most of the
time, and I think that's important, because people make mistakes.
Making a parallel foreach the default for .par is kind of like putting
the pilot eject button right next to the intercom button on an
airplane cockpit. You probably want to put the eject button behind a
little door at the very least.
Everything has software in it nowadays, and over time more and more
apps will be running on multi-core CPUs, so I expect languages like
Scala, Clojure, Fortress, etc., are going to be used more and more.
The more Scala gets used, the greater chance that programming errors
will do serious harm. Scala code could show up in an airplane someday,
or an nuclear power plant, or a spacecraft, or a chain saw. Scala code
will be written by all kinds of programmers, including ones who don't
quite understand threads well. Even the programmers that do understand
threads well are human and will from time to time make mistakes. If
the convenience argument doesn't sway you, then hopefully the safety
one will.
Bill
>
> --
> Johannes
>
> -----------------------------------------------
> Johannes Rudolph
> http://virtual-void.net
>
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
>>>>> "Bill" == Bill Venners writes:
Bill> But certainly if .par gives parallel everything in 2.9, including
Bill> foreach, you wouldn't need something like .sfor in 2.9. If it
Bill> turned out to be useful after people get real experience, it
Bill> could be added later. But once .par gives parallel everything it
Bill> can't be changed, and right there is the big decision for 2.9 I
Bill> think.
".par gives parallel everything" is consistent and simple, and did I
mention consistent? I'm not seeing the case for doing it any other way.
As you point out:
Bill> Downside is this code would be misleading:
Bill> for (i <- (1 until 10000).par) a(i) = a(i) * scale
For this not to run in parallel would verge on bizarre, IMO.
Parallel is parallel.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Seth,
On Fri, Apr 8, 2011 at 11:59 AM, Seth Tisue wrote:
>>>>>> "Bill" == Bill Venners writes:
>
> Bill> But certainly if .par gives parallel everything in 2.9, including
> Bill> foreach, you wouldn't need something like .sfor in 2.9. If it
> Bill> turned out to be useful after people get real experience, it
> Bill> could be added later. But once .par gives parallel everything it
> Bill> can't be changed, and right there is the big decision for 2.9 I
> Bill> think.
>
> ".par gives parallel everything" is consistent and simple, and did I
> mention consistent? I'm not seeing the case for doing it any other way.
>
Simple to understand, yes, but to use?
> As you point out:
>
> Bill> Downside is this code would be misleading:
> Bill> for (i <- (1 until 10000).par) a(i) = a(i) * scale
>
> For this not to run in parallel would verge on bizarre, IMO.
>
> Parallel is parallel.
>
Yes, it sure looks like that should run in parallel. If .par means
fully parallel, then another method could mean "functional methods in
parallel and foreach sequential". fpar perhaps. 2.9 could provide just
.par and if experience shows people could use a second method it could
be added later.
Bill
> --
> Seth Tisue | Northwestern University | http://tisue.net
> lead developer, NetLogo: http://ccl.northwestern.edu/netlogo/
>
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Very well put, agreed without reservation.
-0xe1a
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Alex,
One other way you could consider doing it is letting .par mean
parallel everything, including foreach, then offer an sfor method on
that to get a parallel collection with sequential foreach. So what I'd
have to say to get what (I think) I want is:
mySet.par.sfor
Bill
On Thu, Apr 7, 2011 at 1:22 AM, Aleksandar Prokopec
wrote:
> Hi Bill,
>
> These are good points about the uses of parallel collections.
>
> Well, I guess the semantics of the `foreach` in parallel collections is for
> the users to decide in the end.
>
> From a uniformity and principle of least surprise point of view, I think
> it's somewhat non-uniform.
> From a use case point of view, you're right, `for` is mostly used for
> side-effects.
> From my point of view, this is a trivial refactoring, and it can be changed
> anytime.
>
> Cheers,
> Alex
>
>
>> I'm not sure either. I have notions of how I'd like to use parallel
>> collections, but I don't know what I'll really think about how I want
>> to use them until I get some experience with them. Nor do I have any
>> idea how the community at large will end up wanting to use them.
>>
>> One way I think people will want to use them is as performance
>> optimizations applied when there's a performance problem. In this use
>> case I think a parallel foreach like you have now is probably fine,
>> because the .par's would be applied very locally and rarely. This is
>> probably how I'll start out, by dipping my toe in by seeing if I can
>> improve performance issues.
>>
>> But I have felt the promise of parallel collections is that they make
>> it much easier to write programs that take performance advantage of
>> multiple cores by using them more generally. Starting out I won't
>> change my Seq's to AnySeq's in type signatures. Until experience
>> proves that's the right thing to do I'd rather keep sequential types
>> the lingua franca of interfaces, and maybe we'll discover sequential
>> the right way to do signtatures. Don't know yet. But inside methods,
>> I'd like to try using parallel collections generally (a little bit at
>> a time, to see how it goes).
>>
>> The reason is that (foreach excepted) I already program in a
>> thread-safe style when I call higher order methods on collections. I
>> didn't do it to be thread-safe, but just because that's how I think
>> about those methods. They are data transformers, and I pass in a
>> function that takes one kind of data and transforms it to another,
>> with no side effects. Or I pass in a predicate that evaluates input
>> and returns true or false, with no side effects. Now I may have in the
>> past written some existing code passed to a map, or in a foreach with
>> a yield, that has a side effect, because I wasn't really trying to
>> make sure I had no side effects before. It's possible, but if it
>> happened at all it didn't happen much. And I think it wouldn't be that
>> difficult for me to be careful not to have side effects in my
>> passed-in functions once I start using parallel collections more
>> generally (other than to foreach).
>>
>> I do fairly regularly use for without yield (a for "loop"), because
>> sometimes I need to program side effects, and I want to avoid indexing
>> with while, or avoid recursion, etc. So I use a for loop, and in that
>> case, I'd prefer to not have to say .seq because my guess is that this
>> is the most common thing I'll want to do with a for loop. I.e., if
>> foreach is parallel like the other methods, the way it is in .RC1, I
>> can just do this when I want a sequential foreach on an otherwise
>> parallel collection:
>>
>> for (e<- pList.seq)
>> println(e)
>>
>> That's not so terrible, but if a sequential foreach that's what I want
>> to do most of the time, and a parallel foreach rarely, then I'd rather
>> have a way to get a type that has all higher order functions other
>> than foreach run in parallel, and foreach run sequentially, with a
>> method like .pfor I could call to get a fully parallel variant. And
>> the latter I would use more as a performance optimization:
>>
>> @volatile val flag = false
>>
>> // performance optimization
>> for (i<- (0 until N).par.pfor) {
>> if (check(i)) flag = true
>> }
>>
>> Otherwise, when I'm using parallel collections more generally,
>> whenever I do a for loop I'd need to either call .seq or use
>> java.util.concurrent, or volatile, etc. I think what I'm after as a
>> user is that I can program in the style I have been programming, just
>> being more careful to be sure I don't have side effects in functions
>> passed to non-foreach higher order functions, using parallel
>> collections instead of the regular ones. That also means that I don't
>> have to change how I think about for loops. I.e., I don't have to
>> worry about making those thread safe, unless I say something extra
>> like .pfor.
>>
>> Bill
>> ----
>> Bill Venners
>> Artima, Inc.
>> http://www.artima.com
>>
>
>
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
On 07/04/11 16:17, Bill Venners wrote:
> Hi Alex,
>
> One other way you could consider doing it is letting .par mean
> parallel everything, including foreach, then offer an sfor method on
> that to get a parallel collection with sequential foreach. So what I'd
> have to say to get what (I think) I want is:
>
> mySet.par.sfor
>
> Bill
>
Hi Bill,
That's certainly something I believe can be done, though it may require
a bit of plumbing to always have a proper return type of `sfor`.
Alex
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
Hi Alex,
On Thu, Apr 7, 2011 at 8:50 AM, Aleksandar Prokopec
wrote:
> On 07/04/11 16:17, Bill Venners wrote:
>>
>> Hi Alex,
>>
>> One other way you could consider doing it is letting .par mean
>> parallel everything, including foreach, then offer an sfor method on
>> that to get a parallel collection with sequential foreach. So what I'd
>> have to say to get what (I think) I want is:
>>
>> mySet.par.sfor
>>
>> Bill
>>
>
> Hi Bill,
>
> That's certainly something I believe can be done, though it may require a
> bit of plumbing to always have a proper return type of `sfor`.
>
I could live with this. I think from the clarity perspective this is
better than pfor, because both .par and .par.sfor would mean exactly
what they look like. Downside is to some extent .par would be like
giving someone a loaded gun with the safety off by default.
Bill
----
Bill Venners
Artima, Inc.
http://www.artima.com
Re: REPL: null-values in parallel ranges?
We're in the process of solving this.
We're considering the following and will begin to experiment with the
necessary refactorings.
This will be the new hierarchy (can't draw 3d in ascii yet, but I'm sure
you'll figure this out):
TraversableOnce -------------> TraversableOnceLike
^ ^
| |
Traversable ---> ParTraversable ---> TraversableLike
^ ^ ^
| | |
Iterable ---> ParIterable ---> IterableLike
^ ^ ^
| | |
Seq ---> ParSeq ---> SeqLike
^ ^
| |
List, ParArray, ---------\
Vector, v
ArrayBuffer, ParVector, ---> ParSeqImpl ---> ParIterableImpl
Stream, ^
Range, ParRange ---------/
...
(only Seqs shown above, but Sets and Maps follow the same pattern).
(ParIterableImpl also extends IterableLike, and ParSeqImpl extends
SeqLike, which is not shown in the figure)
And separately from that:
TraversableOnce -------------> TraversableOnceLike
^
|
Iterator
^
|
Splitter <----------- IterableSplitter
^ ^
| |
PreciseSplitter <----- SeqSplitter
Note that:
There are no more ParIterators nor Par*Iterators, only splitters.
There are no more Par*Like classes, only (^Par)*Like classes. The
parallel implementations come from Par*Impl classes which get mixed in
to concrete parallel collection classes (such as ParArray).
Explanation:
- On the top of it all sits the TraversableOnceLike class. It has
implementations for all the basic accessors (reduceLeft, find, count,
...) based on foreach.
- TraversableOnce has already been used in some of the contracts, so it
can no longer be a supertype of TraversableLike, which could be
parallel. Therefore, it extends TraversableOnceLike, and is only
extended by collections known to be sequential.
- Traversable, Iterable, and Seq are the same things they were before
- ParTraversable, ParIterable and ParSeq are collections with methods
possibly executing in parallel (note that this does not only mean
invoking closures out-of-order, but also invoking them at the same time)
- TraversableLike, IterableLike and SeqLike are also collections with
methods possibly executing in parallel, but they are parametrized on the
representation type of the collection.
- the parallelIterator method is now called splitter and can only be
found in the Par*Impl traits declared abstract and defined in concrete
parallel collections
- there are some other minor refactorings involved we are aware this
might involve, not shown above
If anyone sees a fault with this design, or any serious showstoppers,
please say so.
Cheers,
Alex
Re: REPL: null-values in parallel ranges?
Am Montag, 4. April 2011, 14:46:06 schrieb Aleksandar Prokopec:
> We're in the process of solving this.
...
> If anyone sees a fault with this design, or any serious showstoppers,
> please say so.
>
+1
looks good to me and is conservative (viewed from the "outside" - I can't
comment about scala internals).
Great!
Greetings
Bernd.
Re: Re: REPL: null-values in parallel ranges?
On Mon, Apr 4, 2011 at 1:46 PM, Aleksandar Prokopec
wrote:
> (only Seqs shown above, but Sets and Maps follow the same pattern).
I have a question. If I do seq.par.map(identity), will the resulting
Seq be == to the source Seq (i.e. the order of the Seq will be
maintained even if multiple operations are executed in parallel)?
Best,
Ismael
Re: Re: REPL: null-values in parallel ranges?
> I have a question. If I do seq.par.map(identity), will the resulting
> Seq be == to the source Seq (i.e. the order of the Seq will be
> maintained even if multiple operations are executed in parallel)?
>
Yes, unless there is a bug somewhere I've missed, it most definitely
should. Unit tests were written around this assumption.
In fact, here's a little background - == should pattern match on the
object type to figure out it's a Seq, then call `sameElements`.
Method `sameElements` does the following:
- if the argument is a sequential collection, it can't do anything too
smart, so it matches the elements sequentially
- if the argument is a parallel collection, it can split both
collections in orderly subsets and compare the subsets concurrently
Cheers,
Alex
Re: Re: REPL: null-values in parallel ranges?
On Mon, Apr 4, 2011 at 3:45 PM, Aleksandar Prokopec
wrote:
>> I have a question. If I do seq.par.map(identity), will the resulting
>> Seq be == to the source Seq (i.e. the order of the Seq will be
>> maintained even if multiple operations are executed in parallel)?
>
> Yes, unless there is a bug somewhere I've missed, it most definitely should.
You guessed correctly that the reason for my question was that I had
seen different behaviour. My suspicion started when I saw Paul's fix
for the REPL that mentioned foreach, but looking at the code I was
only saw map, take and mkString being used. While trying to create a
test case, I ended up getting NPEs and AIOOBEs while simply using map
and mkString on a Seq created from calling par on a Range. So the
wrong order I was seeing was actually just a symptom, but the problem
seems to be deeper (unless I am missing something). See the details in
the following ticket:
https://lampsvn.epfl.ch/trac/scala/ticket/4459
Best,
Ismael
Re: Re: REPL: null-values in parallel ranges?
That code won't compile once the hierarchy changes are committed, btw.
-- Daniel C. Sobral
I travel to the future all the time.
Re: Re: REPL: null-values in parallel ranges?
On Mon, Apr 4, 2011 at 4:37 PM, Ismael Juma <ismael [at] juma [dot] me [dot] uk> wrote:
Yes. Orderedness of sequence elements is orthogonal to parallel evaluation.
-- Martin