This page is no longer maintained — Please continue to the home page at www.scala-lang.org

starting to optimize the virtualizing pattern matcher

29 replies
adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Dear All,
The time has come to start optimizing the virtualizing pattern matcher.
["what's that," you say? In a nutshell, it compiles pattern matches using a scheme similar to how we compile for-comprehensions: all you need is flatMap and orElse it's also available in nightly builds of scalac -- unleash it using -YvirtpatmatI urgently need to write something up about this, but for now, might I suggest you browse its implementation -- it's actually fairly simple: https://github.com/adriaanm/scala-dev/blob/topic/virtpatmat/src/compiler/scala/tools/nsc/typechecker/PatMatVirtualiser.scala ]

As it stands, a compiler compiled using -Yvirtpatmat, and compiling for -Yvirtpatmat,takes between 37-55% longer to compile the library and the compiler (... than a compiler that's still compiling for -Yvirtpatmat, but whose execution uses the code generated by the regular pattern matcher)
So, how do we get this down to an acceptable percentage?
My intuition was to see how can we can avoid redundant condition-testing in matches like
// it's instructive to see what this is compiled to by the current pattern matcher -- use scalac -Xprint:explicitouter def fastMatch() = {  List(1, 3, 4, 7) match {    case 1 :: 3 :: 4 :: 5 :: x => println("nope")    case 1 :: 3 :: 4 :: 6 :: x => println("nope") // only need to test whether the previous case failed due to the third element not being 5, and whether third element is 6     case 1 :: 3 :: 4 :: 7 :: x => 1  }}
So, I raced this method against the code currently generated by virtpatmat for the same match:(full code: https://gist.github.com/1400910)
import improving.benchmark.Benchmark.Race // https://github.com/adriaanm/scala-improving/blob/master/src/main/scala/Benchmark.scala (code by Paul)
// race fastMatch, which is compiled using the current pattern matcher,// against virtMatch, the same match compiled using -Yvirtpatmat,// by running each method 100K times Race(fastMatch, virtMatch)(100000).converge()
// --> First body 9% faster.
Ok, a 10% slowdown does not explain the 50% we get in "real life", but it seems substantial enough to spend some time chiselling at it.
Hence, the following experiments using hand-optimized versions of virtMatch:
// does it help to optimize orElse by storing the result in a mutable variable?Race(fastMatch, virtMatch_var_orElse)(100000).converge() // --> First body 8% faster.
// does inlining runOrElse make a difference?Race(fastMatch, virtMatch_inline_runOrElse)(100000).converge() // --> First body 8% faster.
okay, not much gain here...

// how about sharing? (i.e., to avoid re-computing already tested conditions)Race(fastMatch, virtMatch_sharing)(100000).converge() // --> First body 3% faster.
That looks more promising! So, I think I'm going to implement this optimization first. (hand-coded version in the gist: https://gist.github.com/1400910)
The idea is to eliminate common subexpressions found in the conditions of the if-statements that make up the match.I'd like to avoid duplicating code/generating jumps (which is what the current pattern matcher does). Hence the idea of sharing the results of the tests rather than the code doing the testing.
The reason I'm posting is of course to hear your thoughts!Suggestions for concrete benchmarks, a better benchmarking strategy, or optimizations most welcome!

cheersadriaan
Matthew Pocock 3
Joined: 2010-07-30,
User offline. Last seen 42 years 45 weeks ago.
Re: starting to optimize the virtualizing pattern matcher
Hi,
I may be being naive, but do your optimizations require that the unapply methods are side-effect-less, pure functions? If so, it sounds like your optimizations are those generally applicable to these cases - avoid executing code multiple times where you can remember the value from executing it once, and avoid executing code where the result isn't used. However, you don't have to look back far in the -users list to find examples of people doing stateful things in case statements. Without an effects system I'm not sure how far you can go with this while guaranteeing that you're not borking the semantics up in some exciting way. As usual with rewrite-based optimization, an effects system would be useful, even if it is somewhat conservative in what it declares effect-free.
Matthew
On 28 November 2011 16:18, Adriaan Moors <adriaan [dot] moors [at] epfl [dot] ch> wrote:
Dear All,
The time has come to start optimizing the virtualizing pattern matcher.
["what's that," you say? In a nutshell, it compiles pattern matches using a scheme similar to how we compile for-comprehensions: all you need is flatMap and orElse it's also available in nightly builds of scalac -- unleash it using -YvirtpatmatI urgently need to write something up about this, but for now, might I suggest you browse its implementation -- it's actually fairly simple: https://github.com/adriaanm/scala-dev/blob/topic/virtpatmat/src/compiler/scala/tools/nsc/typechecker/PatMatVirtualiser.scala ]

As it stands, a compiler compiled using -Yvirtpatmat, and compiling for -Yvirtpatmat,takes between 37-55% longer to compile the library and the compiler (... than a compiler that's still compiling for -Yvirtpatmat, but whose execution uses the code generated by the regular pattern matcher)
So, how do we get this down to an acceptable percentage?
My intuition was to see how can we can avoid redundant condition-testing in matches like
// it's instructive to see what this is compiled to by the current pattern matcher -- use scalac -Xprint:explicitouter def fastMatch() = {  List(1, 3, 4, 7) match {    case 1 :: 3 :: 4 :: 5 :: x => println("nope")    case 1 :: 3 :: 4 :: 6 :: x => println("nope") // only need to test whether the previous case failed due to the third element not being 5, and whether third element is 6     case 1 :: 3 :: 4 :: 7 :: x => 1  }}
So, I raced this method against the code currently generated by virtpatmat for the same match:(full code: https://gist.github.com/1400910)
import improving.benchmark.Benchmark.Race // https://github.com/adriaanm/scala-improving/blob/master/src/main/scala/Benchmark.scala (code by Paul)
// race fastMatch, which is compiled using the current pattern matcher,// against virtMatch, the same match compiled using -Yvirtpatmat,// by running each method 100K times Race(fastMatch, virtMatch)(100000).converge()
// --> First body 9% faster.
Ok, a 10% slowdown does not explain the 50% we get in "real life", but it seems substantial enough to spend some time chiselling at it.
Hence, the following experiments using hand-optimized versions of virtMatch:
// does it help to optimize orElse by storing the result in a mutable variable?Race(fastMatch, virtMatch_var_orElse)(100000).converge() // --> First body 8% faster.
// does inlining runOrElse make a difference?Race(fastMatch, virtMatch_inline_runOrElse)(100000).converge() // --> First body 8% faster.
okay, not much gain here...

// how about sharing? (i.e., to avoid re-computing already tested conditions)Race(fastMatch, virtMatch_sharing)(100000).converge() // --> First body 3% faster.
That looks more promising! So, I think I'm going to implement this optimization first. (hand-coded version in the gist: https://gist.github.com/1400910)
The idea is to eliminate common subexpressions found in the conditions of the if-statements that make up the match.I'd like to avoid duplicating code/generating jumps (which is what the current pattern matcher does). Hence the idea of sharing the results of the tests rather than the code doing the testing.
The reason I'm posting is of course to hear your thoughts!Suggestions for concrete benchmarks, a better benchmarking strategy, or optimizations most welcome!

cheersadriaan



--
Dr Matthew PocockIntegrative Bioinformatics Group, School of Computing Science, Newcastle Universitymailto: turingatemyhamster [at] gmail [dot] com gchat: turingatemyhamster [at] gmail [dot] commsn: matthew_pocock [at] yahoo [dot] co [dot] uk irc.freenode.net: drdozerskype: matthew.pococktel: (0191) 2566550mob: +447535664143
adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher
Hi,
Indeed, unapplies can't be optimized this way -- for the reasons you state, and besides, even if they are pure, they might not respect the other laws these optimizations rely on.I should have clarified this is only going to optimize matching on case classes (where we know the unapplies do the right thing), and possibly type tests/constant patterns. Luckily, this is still a pretty common scenario. 
cheersadriaan
ps: We're also devising another unapply scheme that might bring some of the benefit (mostly speed) of case class matching to user-defined unapplies, but that's a 2.0 feature ;-)

On Mon, Nov 28, 2011 at 5:43 PM, Matthew Pocock <turingatemyhamster [at] gmail [dot] com> wrote:
Hi,
I may be being naive, but do your optimizations require that the unapply methods are side-effect-less, pure functions? If so, it sounds like your optimizations are those generally applicable to these cases - avoid executing code multiple times where you can remember the value from executing it once, and avoid executing code where the result isn't used. However, you don't have to look back far in the -users list to find examples of people doing stateful things in case statements. Without an effects system I'm not sure how far you can go with this while guaranteeing that you're not borking the semantics up in some exciting way. As usual with rewrite-based optimization, an effects system would be useful, even if it is somewhat conservative in what it declares effect-free.
Matthew
On 28 November 2011 16:18, Adriaan Moors <adriaan [dot] moors [at] epfl [dot] ch> wrote:
Dear All,
The time has come to start optimizing the virtualizing pattern matcher.
["what's that," you say? In a nutshell, it compiles pattern matches using a scheme similar to how we compile for-comprehensions: all you need is flatMap and orElse it's also available in nightly builds of scalac -- unleash it using -YvirtpatmatI urgently need to write something up about this, but for now, might I suggest you browse its implementation -- it's actually fairly simple: https://github.com/adriaanm/scala-dev/blob/topic/virtpatmat/src/compiler/scala/tools/nsc/typechecker/PatMatVirtualiser.scala ]

As it stands, a compiler compiled using -Yvirtpatmat, and compiling for -Yvirtpatmat,takes between 37-55% longer to compile the library and the compiler (... than a compiler that's still compiling for -Yvirtpatmat, but whose execution uses the code generated by the regular pattern matcher)
So, how do we get this down to an acceptable percentage?
My intuition was to see how can we can avoid redundant condition-testing in matches like
// it's instructive to see what this is compiled to by the current pattern matcher -- use scalac -Xprint:explicitouter def fastMatch() = {  List(1, 3, 4, 7) match {    case 1 :: 3 :: 4 :: 5 :: x => println("nope")    case 1 :: 3 :: 4 :: 6 :: x => println("nope") // only need to test whether the previous case failed due to the third element not being 5, and whether third element is 6     case 1 :: 3 :: 4 :: 7 :: x => 1  }}
So, I raced this method against the code currently generated by virtpatmat for the same match:(full code: https://gist.github.com/1400910)
import improving.benchmark.Benchmark.Race // https://github.com/adriaanm/scala-improving/blob/master/src/main/scala/Benchmark.scala (code by Paul)
// race fastMatch, which is compiled using the current pattern matcher,// against virtMatch, the same match compiled using -Yvirtpatmat,// by running each method 100K times Race(fastMatch, virtMatch)(100000).converge()
// --> First body 9% faster.
Ok, a 10% slowdown does not explain the 50% we get in "real life", but it seems substantial enough to spend some time chiselling at it.
Hence, the following experiments using hand-optimized versions of virtMatch:
// does it help to optimize orElse by storing the result in a mutable variable?Race(fastMatch, virtMatch_var_orElse)(100000).converge() // --> First body 8% faster.
// does inlining runOrElse make a difference?Race(fastMatch, virtMatch_inline_runOrElse)(100000).converge() // --> First body 8% faster.
okay, not much gain here...

// how about sharing? (i.e., to avoid re-computing already tested conditions)Race(fastMatch, virtMatch_sharing)(100000).converge() // --> First body 3% faster.
That looks more promising! So, I think I'm going to implement this optimization first. (hand-coded version in the gist: https://gist.github.com/1400910)
The idea is to eliminate common subexpressions found in the conditions of the if-statements that make up the match.I'd like to avoid duplicating code/generating jumps (which is what the current pattern matcher does). Hence the idea of sharing the results of the tests rather than the code doing the testing.
The reason I'm posting is of course to hear your thoughts!Suggestions for concrete benchmarks, a better benchmarking strategy, or optimizations most welcome!

cheersadriaan



--
Dr Matthew PocockIntegrative Bioinformatics Group, School of Computing Science, Newcastle University mailto: turingatemyhamster [at] gmail [dot] com gchat: turingatemyhamster [at] gmail [dot] commsn: matthew_pocock [at] yahoo [dot] co [dot] uk irc.freenode.net: drdozerskype: matthew.pococktel: (0191) 2566550mob: +447535664143

Joshua.Suereth
Joined: 2008-09-02,
User offline. Last seen 32 weeks 5 days ago.
Re: starting to optimize the virtualizing pattern matcher
Now I see why you have the @ThisUnapplyIsPrettyOk annotation in your branch....

On Mon, Nov 28, 2011 at 11:51 AM, Adriaan Moors <adriaan [dot] moors [at] epfl [dot] ch> wrote:
Hi,
Indeed, unapplies can't be optimized this way -- for the reasons you state, and besides, even if they are pure, they might not respect the other laws these optimizations rely on. I should have clarified this is only going to optimize matching on case classes (where we know the unapplies do the right thing), and possibly type tests/constant patterns. Luckily, this is still a pretty common scenario. 
cheersadriaan
ps: We're also devising another unapply scheme that might bring some of the benefit (mostly speed) of case class matching to user-defined unapplies, but that's a 2.0 feature ;-)

On Mon, Nov 28, 2011 at 5:43 PM, Matthew Pocock <turingatemyhamster [at] gmail [dot] com> wrote:
Hi,
I may be being naive, but do your optimizations require that the unapply methods are side-effect-less, pure functions? If so, it sounds like your optimizations are those generally applicable to these cases - avoid executing code multiple times where you can remember the value from executing it once, and avoid executing code where the result isn't used. However, you don't have to look back far in the -users list to find examples of people doing stateful things in case statements. Without an effects system I'm not sure how far you can go with this while guaranteeing that you're not borking the semantics up in some exciting way. As usual with rewrite-based optimization, an effects system would be useful, even if it is somewhat conservative in what it declares effect-free.
Matthew
On 28 November 2011 16:18, Adriaan Moors <adriaan [dot] moors [at] epfl [dot] ch> wrote:
Dear All,
The time has come to start optimizing the virtualizing pattern matcher.
["what's that," you say? In a nutshell, it compiles pattern matches using a scheme similar to how we compile for-comprehensions: all you need is flatMap and orElse it's also available in nightly builds of scalac -- unleash it using -YvirtpatmatI urgently need to write something up about this, but for now, might I suggest you browse its implementation -- it's actually fairly simple: https://github.com/adriaanm/scala-dev/blob/topic/virtpatmat/src/compiler/scala/tools/nsc/typechecker/PatMatVirtualiser.scala ]

As it stands, a compiler compiled using -Yvirtpatmat, and compiling for -Yvirtpatmat,takes between 37-55% longer to compile the library and the compiler (... than a compiler that's still compiling for -Yvirtpatmat, but whose execution uses the code generated by the regular pattern matcher)
So, how do we get this down to an acceptable percentage?
My intuition was to see how can we can avoid redundant condition-testing in matches like
// it's instructive to see what this is compiled to by the current pattern matcher -- use scalac -Xprint:explicitouter def fastMatch() = {  List(1, 3, 4, 7) match {    case 1 :: 3 :: 4 :: 5 :: x => println("nope")    case 1 :: 3 :: 4 :: 6 :: x => println("nope") // only need to test whether the previous case failed due to the third element not being 5, and whether third element is 6     case 1 :: 3 :: 4 :: 7 :: x => 1  }}
So, I raced this method against the code currently generated by virtpatmat for the same match:(full code: https://gist.github.com/1400910)
import improving.benchmark.Benchmark.Race // https://github.com/adriaanm/scala-improving/blob/master/src/main/scala/Benchmark.scala (code by Paul)
// race fastMatch, which is compiled using the current pattern matcher,// against virtMatch, the same match compiled using -Yvirtpatmat,// by running each method 100K times Race(fastMatch, virtMatch)(100000).converge()
// --> First body 9% faster.
Ok, a 10% slowdown does not explain the 50% we get in "real life", but it seems substantial enough to spend some time chiselling at it.
Hence, the following experiments using hand-optimized versions of virtMatch:
// does it help to optimize orElse by storing the result in a mutable variable?Race(fastMatch, virtMatch_var_orElse)(100000).converge() // --> First body 8% faster.
// does inlining runOrElse make a difference?Race(fastMatch, virtMatch_inline_runOrElse)(100000).converge() // --> First body 8% faster.
okay, not much gain here...

// how about sharing? (i.e., to avoid re-computing already tested conditions)Race(fastMatch, virtMatch_sharing)(100000).converge() // --> First body 3% faster.
That looks more promising! So, I think I'm going to implement this optimization first. (hand-coded version in the gist: https://gist.github.com/1400910)
The idea is to eliminate common subexpressions found in the conditions of the if-statements that make up the match.I'd like to avoid duplicating code/generating jumps (which is what the current pattern matcher does). Hence the idea of sharing the results of the tests rather than the code doing the testing.
The reason I'm posting is of course to hear your thoughts!Suggestions for concrete benchmarks, a better benchmarking strategy, or optimizations most welcome!

cheersadriaan



--
Dr Matthew PocockIntegrative Bioinformatics Group, School of Computing Science, Newcastle University mailto: turingatemyhamster [at] gmail [dot] com gchat: turingatemyhamster [at] gmail [dot] commsn: matthew_pocock [at] yahoo [dot] co [dot] uk irc.freenode.net: drdozerskype: matthew.pococktel: (0191) 2566550mob: +447535664143


Tiark Rompf
Joined: 2009-02-18,
User offline. Last seen 42 years 45 weeks ago.
Re: starting to optimize the virtualizing pattern matcher
I'd suggest trying to get rid of the Option objects altogether (either encode using null or use a separate bool flag).- Tiark
On Nov 28, 2011, at 5:18 PM, Adriaan Moors wrote:
Dear All,
The time has come to start optimizing the virtualizing pattern matcher.
["what's that," you say? In a nutshell, it compiles pattern matches using a scheme similar to how we compile for-comprehensions: all you need is flatMap and orElse it's also available in nightly builds of scalac -- unleash it using -YvirtpatmatI urgently need to write something up about this, but for now, might I suggest you browse its implementation -- it's actually fairly simple: https://github.com/adriaanm/scala-dev/blob/topic/virtpatmat/src/compiler/scala/tools/nsc/typechecker/PatMatVirtualiser.scala ]

As it stands, a compiler compiled using -Yvirtpatmat, and compiling for -Yvirtpatmat,takes between 37-55% longer to compile the library and the compiler (... than a compiler that's still compiling for -Yvirtpatmat, but whose execution uses the code generated by the regular pattern matcher)
So, how do we get this down to an acceptable percentage?
My intuition was to see how can we can avoid redundant condition-testing in matches like
// it's instructive to see what this is compiled to by the current pattern matcher -- use scalac -Xprint:explicitouter def fastMatch() = {  List(1, 3, 4, 7) match {    case 1 :: 3 :: 4 :: 5 :: x => println("nope")    case 1 :: 3 :: 4 :: 6 :: x => println("nope") // only need to test whether the previous case failed due to the third element not being 5, and whether third element is 6     case 1 :: 3 :: 4 :: 7 :: x => 1  }}
So, I raced this method against the code currently generated by virtpatmat for the same match:(full code: https://gist.github.com/1400910)
import improving.benchmark.Benchmark.Race // https://github.com/adriaanm/scala-improving/blob/master/src/main/scala/Benchmark.scala (code by Paul)
// race fastMatch, which is compiled using the current pattern matcher,// against virtMatch, the same match compiled using -Yvirtpatmat,// by running each method 100K times Race(fastMatch, virtMatch)(100000).converge()
// --> First body 9% faster.
Ok, a 10% slowdown does not explain the 50% we get in "real life", but it seems substantial enough to spend some time chiselling at it.
Hence, the following experiments using hand-optimized versions of virtMatch:
// does it help to optimize orElse by storing the result in a mutable variable?Race(fastMatch, virtMatch_var_orElse)(100000).converge() // --> First body 8% faster.
// does inlining runOrElse make a difference?Race(fastMatch, virtMatch_inline_runOrElse)(100000).converge() // --> First body 8% faster.
okay, not much gain here...

// how about sharing? (i.e., to avoid re-computing already tested conditions)Race(fastMatch, virtMatch_sharing)(100000).converge() // --> First body 3% faster.
That looks more promising! So, I think I'm going to implement this optimization first. (hand-coded version in the gist: https://gist.github.com/1400910)
The idea is to eliminate common subexpressions found in the conditions of the if-statements that make up the match.I'd like to avoid duplicating code/generating jumps (which is what the current pattern matcher does). Hence the idea of sharing the results of the tests rather than the code doing the testing.
The reason I'm posting is of course to hear your thoughts!Suggestions for concrete benchmarks, a better benchmarking strategy, or optimizations most welcome!

cheersadriaan

adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher
On Mon, Nov 28, 2011 at 6:19 PM, Tiark Rompf <tiark [dot] rompf [at] epfl [dot] ch> wrote:
I'd suggest trying to get rid of the Option objects altogether (either encode using null or use a separate bool flag).
good call! I figured "how bad could that last option be," but you were right -- simply eliminating all traces of option brings the overhead down to 4%...  I updated the gist with a version without option -- if you see a more efficient way, please let me know!
cheersadriaan
extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: starting to optimize the virtualizing pattern matcher

On Mon, Nov 28, 2011 at 8:18 AM, Adriaan Moors wrote:
> // how about sharing? (i.e., to avoid re-computing already tested
> conditions)
> Race(fastMatch, virtMatch_sharing)(100000).converge() // --> First body 3%
> faster.
> That looks more promising!

Indeed, and type tests aren't that cheap, especially when compared to
a stack boolean.

if (x1.isInstanceOf[collection.immutable.::[Int]]) {
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
val o47: Option[AnyVal] = if
(x1.isInstanceOf[collection.immutable.::[Int]])
if (x1.isInstanceOf[collection.immutable.::[Int]])
if (x1.isInstanceOf[collection.immutable.::[Int]])
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
val o47: Option[AnyVal] = if
(x1.isInstanceOf[collection.immutable.::[Int]])
if (x1.isInstanceOf[collection.immutable.::[Int]])
if (x1.isInstanceOf[collection.immutable.::[Int]])

At least record the result of every type test which must be executed.
It'll get a little fuzzier when you get into non-mandatory code paths;
but you could allocate almost-free "lazy val booleans" in the form of
tri-state ints.

adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher


On Mon, Nov 28, 2011 at 8:02 PM, Paul Phillips <paulp [at] improving [dot] org> wrote:
you could allocate almost-free "lazy val booleans" in the form of tri-state ints.
ah yes, that sounds pretty easy to do -- I'll implement this next and report on the ensuing race
Tiark Rompf
Joined: 2009-02-18,
User offline. Last seen 42 years 45 weeks ago.
Re: starting to optimize the virtualizing pattern matcher
also, to make sure you are measuring just the pattern matching, it seems like a good idea to move the List(1, 3, 4, 7) object construction out of the benchmark loop and store it as a field of the enclosing object.
It'll be interesting to see how the tri-state boolean version turns out.
- Tiark
On Nov 28, 2011, at 7:29 PM, Adriaan Moors wrote:
On Mon, Nov 28, 2011 at 6:19 PM, Tiark Rompf <tiark [dot] rompf [at] epfl [dot] ch> wrote:
I'd suggest trying to get rid of the Option objects altogether (either encode using null or use a separate bool flag).
good call! I figured "how bad could that last option be," but you were right -- simply eliminating all traces of option brings the overhead down to 4%...  I updated the gist with a version without option -- if you see a more efficient way, please let me know!
cheersadriaan

Jason Zaugg
Joined: 2009-05-18,
User offline. Last seen 38 weeks 5 days ago.
Re: starting to optimize the virtualizing pattern matcher
On Mon, Nov 28, 2011 at 8:11 PM, Adriaan Moors <adriaan [dot] moors [at] epfl [dot] ch> wrote:
On Mon, Nov 28, 2011 at 8:02 PM, Paul Phillips <paulp [at] improving [dot] org> wrote:
you could allocate almost-free "lazy val booleans" in the form of tri-state ints.
ah yes, that sounds pretty easy to do -- I'll implement this next and report on the ensuing race

That would be a nice general optimization for any lazy val that doesn't escape the scope of a method.
-jason
adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher
the results for bootstrapping the compiler with the no-option virtual pattern matcher are in:
[strap.lib.timer: 1:55.194 sec]   115s ==> + 16%      (compiler compiled using -Yvirtpatmat compiling the library for -Yvirtpatmat) [quick.lib.timer: 1:39.240 sec]   99s                         (compiler with normal pattern matching compiling the library for -Yvirtpatmat)[strap.comp.timer: 1:39.152 sec]  99s  ==> + 26%   (compiler compiled using -Yvirtpatmat compiling the compiler for -Yvirtpatmat)  [quick.comp.timer: 1:18.715 sec]  78s                     (compiler with normal pattern matching compiling the compiler for -Yvirtpatmat)
you can try this yourself using `ant strap-vpm` on https://github.com/adriaanm/scala-dev/tree/topic/virtpatmat-bootstrap
I'd be happy to have these results confirmed/refuted independently. If these are accurate, I think we're getting there faster than expected.No-option for the win!
next step: tri-state ints for lazy val boolean of isInstanceOf's
On Mon, Nov 28, 2011 at 6:19 PM, Tiark Rompf <tiark [dot] rompf [at] epfl [dot] ch> wrote:
I'd suggest trying to get rid of the Option objects altogether (either encode using null or use a separate bool flag).- Tiark
On Nov 28, 2011, at 5:18 PM, Adriaan Moors wrote:
Dear All,
The time has come to start optimizing the virtualizing pattern matcher.
["what's that," you say? In a nutshell, it compiles pattern matches using a scheme similar to how we compile for-comprehensions: all you need is flatMap and orElse it's also available in nightly builds of scalac -- unleash it using -YvirtpatmatI urgently need to write something up about this, but for now, might I suggest you browse its implementation -- it's actually fairly simple: https://github.com/adriaanm/scala-dev/blob/topic/virtpatmat/src/compiler/scala/tools/nsc/typechecker/PatMatVirtualiser.scala ]

As it stands, a compiler compiled using -Yvirtpatmat, and compiling for -Yvirtpatmat,takes between 37-55% longer to compile the library and the compiler (... than a compiler that's still compiling for -Yvirtpatmat, but whose execution uses the code generated by the regular pattern matcher)
So, how do we get this down to an acceptable percentage?
My intuition was to see how can we can avoid redundant condition-testing in matches like
// it's instructive to see what this is compiled to by the current pattern matcher -- use scalac -Xprint:explicitouter def fastMatch() = {  List(1, 3, 4, 7) match {    case 1 :: 3 :: 4 :: 5 :: x => println("nope")    case 1 :: 3 :: 4 :: 6 :: x => println("nope") // only need to test whether the previous case failed due to the third element not being 5, and whether third element is 6     case 1 :: 3 :: 4 :: 7 :: x => 1  }}
So, I raced this method against the code currently generated by virtpatmat for the same match:(full code: https://gist.github.com/1400910)
import improving.benchmark.Benchmark.Race // https://github.com/adriaanm/scala-improving/blob/master/src/main/scala/Benchmark.scala (code by Paul)
// race fastMatch, which is compiled using the current pattern matcher,// against virtMatch, the same match compiled using -Yvirtpatmat,// by running each method 100K times Race(fastMatch, virtMatch)(100000).converge()
// --> First body 9% faster.
Ok, a 10% slowdown does not explain the 50% we get in "real life", but it seems substantial enough to spend some time chiselling at it.
Hence, the following experiments using hand-optimized versions of virtMatch:
// does it help to optimize orElse by storing the result in a mutable variable?Race(fastMatch, virtMatch_var_orElse)(100000).converge() // --> First body 8% faster.
// does inlining runOrElse make a difference?Race(fastMatch, virtMatch_inline_runOrElse)(100000).converge() // --> First body 8% faster.
okay, not much gain here...

// how about sharing? (i.e., to avoid re-computing already tested conditions)Race(fastMatch, virtMatch_sharing)(100000).converge() // --> First body 3% faster.
That looks more promising! So, I think I'm going to implement this optimization first. (hand-coded version in the gist: https://gist.github.com/1400910)
The idea is to eliminate common subexpressions found in the conditions of the if-statements that make up the match.I'd like to avoid duplicating code/generating jumps (which is what the current pattern matcher does). Hence the idea of sharing the results of the tests rather than the code doing the testing.
The reason I'm posting is of course to hear your thoughts!Suggestions for concrete benchmarks, a better benchmarking strategy, or optimizations most welcome!

cheersadriaan


Simon Ochsenreither
Joined: 2011-07-17,
User offline. Last seen 42 years 45 weeks ago.
Re: starting to optimize the virtualizing pattern matcher
Are these numbers also representative for “usual code”? A performance regression by 15-25% looks scary ...

Thanks!
adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher


On Wed, Nov 30, 2011 at 3:36 PM, Simon Ochsenreither <simon [dot] ochsenreither [at] googlemail [dot] com> wrote:
Are these numbers also representative for “usual code”? A performance regression by 15-25% looks scary ...
as Martin said, this is work in progress (I started working on optimizing it about a week ago)
that said, I'm happy to provide support to anyone trying it out on "usual code", just add -Yvirtpatmat to the scalac args

cheersadriaan
dcsobral
Joined: 2009-04-23,
User offline. Last seen 38 weeks 5 days ago.
Re: starting to optimize the virtualizing pattern matcher

On Wed, Nov 30, 2011 at 12:48, Adriaan Moors wrote:
>
>
> On Wed, Nov 30, 2011 at 3:36 PM, Simon Ochsenreither
> wrote:
>>
>> Are these numbers also representative for “usual code”? A performance
>> regression by 15-25% looks scary ...
>
> as Martin said, this is work in progress (I started working on optimizing it
> about a week ago)
> that said, I'm happy to provide support to anyone trying it out on "usual
> code", just add -Yvirtpatmat to the scalac args

Say, do these numbers include the suggested optimization of converting
Option tests into null tests?

adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher
Say, do these numbers include the suggested optimization of converting Option tests into null tests?
yep, except that they're not null tests, but boolean tests (to distinguish failure and success) + unwrapped option (a successful null must not be confused with failure)
(the gist of the idea is in: https://github.com/adriaanm/scala-dev/commit/06906e245ad523da6af8185a8e0b582ccffc5747, but has been refined subsequently)
I've also implemented tri-state ints for lazy bools for caching repeated isInstanceOf tests:
[strap.lib.timer: 1:59.914 sec]     120s ==> +19%[quick.lib.timer: 1:41.313 sec]     101s [strap.comp.timer: 1:49.832 sec]    110s ==> +35%[quick.comp.timer: 1:21.208 sec]    81s
so that's a bit of a bummer... the overheads went up since last time i measured (before this optimization)
I may have implemented the optimization poorly (see my first stab for yourself at https://github.com/adriaanm/scala-dev/commit/d5c8b82b6ec7897ac343ac6351eb0dd708fac65b),  maybe I got the timing results wrong, or something else is going on -- will look into it...on the other hand, it could be that the JVM is just good at instanceof tests :-)
adriaan
odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.
Re: starting to optimize the virtualizing pattern matcher


On Wed, Nov 30, 2011 at 11:10 PM, Adriaan Moors <adriaan [dot] moors [at] epfl [dot] ch> wrote:
Say, do these numbers include the suggested optimization of converting Option tests into null tests?
yep, except that they're not null tests, but boolean tests (to distinguish failure and success) + unwrapped option (a successful null must not be confused with failure)
(the gist of the idea is in: https://github.com/adriaanm/scala-dev/commit/06906e245ad523da6af8185a8e0b582ccffc5747, but has been refined subsequently)
I've also implemented tri-state ints for lazy bools for caching repeated isInstanceOf tests:
[strap.lib.timer: 1:59.914 sec]     120s ==> +19%[quick.lib.timer: 1:41.313 sec]     101s [strap.comp.timer: 1:49.832 sec]    110s ==> +35%[quick.comp.timer: 1:21.208 sec]    81s
so that's a bit of a bummer... the overheads went up since last time i measured (before this optimization)
I may have implemented the optimization poorly (see my first stab for yourself at https://github.com/adriaanm/scala-dev/commit/d5c8b82b6ec7897ac343ac6351eb0dd708fac65b),  maybe I got the timing results wrong, or something else is going on -- will look into it...on the other hand, it could be that the JVM is just good at instanceof tests :-)
In my experience, the JVM is indeed surprisingly good at instance oftests. It seems like they feed right into the branch prediction logic of the processor. We tried at some point to "optimize" pattern matching by switching on a tag. It was a loss. A sequence of instanceof tests was faster.
Cheers
 -- Martin 
extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: starting to optimize the virtualizing pattern matcher

On Wed, Nov 30, 2011 at 2:16 PM, martin odersky wrote:
> In my experience, the JVM is indeed surprisingly good at instance oftests.
> It seems like they feed right into the branch prediction logic of the
> processor. We tried at some point to "optimize" pattern matching by
> switching on a tag. It was a loss. A sequence of instanceof tests was
> faster.

They're fast, but no matter how fast they are they can't be faster
than reading a boolean on the stack without introducing some
additional factor. The use of $tag was trying to avoid the instanceof
tests entirely, and it brought overhead. Here the first instanceof
test is being performed, only successive identical ones are being
avoided, and the additional overhead is in the neighborhood of zero.

My guess is that the jvm is already caching and reusing the result of
the first instanceof test, and their cache is closer to the metal than
ours is.

odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.
Re: starting to optimize the virtualizing pattern matcher


On Thu, Dec 1, 2011 at 12:26 AM, Paul Phillips <paulp [at] improving [dot] org> wrote:
On Wed, Nov 30, 2011 at 2:16 PM, martin odersky <martin [dot] odersky [at] epfl [dot] ch> wrote:
> In my experience, the JVM is indeed surprisingly good at instance oftests.
> It seems like they feed right into the branch prediction logic of the
> processor. We tried at some point to "optimize" pattern matching by
> switching on a tag. It was a loss. A sequence of instanceof tests was
> faster.

They're fast, but no matter how fast they are they can't be faster
than reading  a boolean on the stack without introducing some
additional factor.  The use of $tag was trying to avoid the instanceof
tests entirely, and it brought overhead.  Here the first instanceof
test is being performed, only successive identical ones are being
avoided, and the additional overhead is in the neighborhood of zero.

My guess is that the jvm is already caching and reusing the result of
the first instanceof test, and their cache is closer to the metal than
ours is.

Yes, I think that's plausible. -- Martin

vlad.ureche
Joined: 2011-03-17,
User offline. Last seen 1 year 26 weeks ago.
Re: starting to optimize the virtualizing pattern matcher


On Thu, Dec 1, 2011 at 12:26 AM, Paul Phillips <paulp [at] improving [dot] org> wrote:
On Wed, Nov 30, 2011 at 2:16 PM, martin odersky <martin [dot] odersky [at] epfl [dot] ch> wrote:
> In my experience, the JVM is indeed surprisingly good at instance oftests.
> It seems like they feed right into the branch prediction logic of the
> processor. We tried at some point to "optimize" pattern matching by
> switching on a tag. It was a loss. A sequence of instanceof tests was
> faster.

They're fast, but no matter how fast they are they can't be faster
than reading  a boolean on the stack without introducing some
additional factor.  The use of $tag was trying to avoid the instanceof
tests entirely, and it brought overhead.  Here the first instanceof
test is being performed, only successive identical ones are being
avoided, and the additional overhead is in the neighborhood of zero.

My guess is that the jvm is already caching and reusing the result of
the first instanceof test, and their cache is closer to the metal than
ours is.


Could you test it with a list of lists? Maybe the JVM inlines the list into the pattern matcher, thus completely eliminating the tests.


Vlad
Ismael Juma 2
Joined: 2011-01-22,
User offline. Last seen 42 years 45 weeks ago.
Re: starting to optimize the virtualizing pattern matcher
On Wed, Nov 30, 2011 at 10:10 PM, Adriaan Moors <adriaan [dot] moors [at] epfl [dot] ch> wrote:
on the other hand, it could be that the JVM is just good at instanceof tests :-)

It is pretty good:
"Profiling is performed at the bytecode level in the interpreter and tier one compiler. The compiler leans heavily on profile data to motivate optimistic optimizations." "There is also a type profile for every checkcast, instanceof, and aastore. (Helps with generics.) "
http://wikis.sun.com/display/HotSpotInternals/PerformanceTechniques
Best,Ismael
Tiark Rompf
Joined: 2009-02-18,
User offline. Last seen 42 years 45 weeks ago.
Re: starting to optimize the virtualizing pattern matcher
I've also implemented tri-state ints for lazy bools for caching repeated isInstanceOf tests:
[strap.lib.timer: 1:59.914 sec]     120s ==> +19%[quick.lib.timer: 1:41.313 sec]     101s [strap.comp.timer: 1:49.832 sec]    110s ==> +35%[quick.comp.timer: 1:21.208 sec]    81s
so that's a bit of a bummer... the overheads went up since last time i measured (before this optimization)

Is the result of asInstanceOf also cached? if not the jvm will still need to do instance checks as part of asInstanceOf.
In the end i don't think we'll reach the performance baseline without generating jumps, so maybe now is a good opportunity to think about how we can get there in a clean, high-level way.
- Tiark


I may have implemented the optimization poorly (see my first stab for yourself at https://github.com/adriaanm/scala-dev/commit/d5c8b82b6ec7897ac343ac6351eb0dd708fac65b),  maybe I got the timing results wrong, or something else is going on -- will look into it...on the other hand, it could be that the JVM is just good at instanceof tests :-)
adriaan

adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher
Could you test it with a list of lists? Maybe the JVM inlines the list into the pattern matcher, thus completely eliminating the tests.
I assume you're referring to my test case? The timing results I quoted come from compiling the compiler and the library with a compiler that was compiled to use this implementation of pattern matching, so I think they represent real-life performance reasonably well.
cheersadriaan
adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher
On Thu, Dec 1, 2011 at 9:43 AM, Ismael Juma <ismael [at] juma [dot] me [dot] uk> wrote:
On Wed, Nov 30, 2011 at 10:10 PM, Adriaan Moors <adriaan [dot] moors [at] epfl [dot] ch> wrote:
on the other hand, it could be that the JVM is just good at instanceof tests :-)

It is pretty good:
"Profiling is performed at the bytecode level in the interpreter and tier one compiler. The compiler leans heavily on profile data to motivate optimistic optimizations." "There is also a type profile for every checkcast, instanceof, and aastore. (Helps with generics.) "
http://wikis.sun.com/display/HotSpotInternals/PerformanceTechniques
interesting! thanks for the pointer!
adriaan
adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher
Is the result of asInstanceOf also cached? if not the jvm will still need to do instance checks as part of asInstanceOf.
no, I only cached the isInstanceOf test, so yes, this could very well explain why the "optimization" didn't helpI'm not sure caching both would help, though, since other evidence suggests the JVM is simply good at this kind of caching/profiling
we could try caching the asInstanceOf anyway, but that brings me to your next point

In the end i don't think we'll reach the performance baseline without generating jumps,
so maybe now is a good opportunity to think about how we can get there in a clean, high-level way.
The isInstanceOf caching experiment was just that, and, after sleeping on it for a night, I think I'll punt on it for now.
I agree we need a more sophisticated analysis that generates clean code -- preferably without jumps, though.
My current plan is to implement the sharing of the outcomes of common tests, which I described earlier. It's different from the isInstanceOf caching in that it is not lazy and that it bundles checks.
Consider a match with cases that share the same outer pattern. It compiles to something that looks like this:
// <vals> abbreviates the local variables that bind the information that follows from the tested condition (e.g., if(<typeTest>) val x1 = <casted>)
if(C1) { <vals_1> if(C2) { <vals_2> if(C3A) {matchRes = ....; keepGoing = false} // case_1
if(keepGoing) if(C1) { <vals_1'> if(C2) { <vals_2'> if(C3B) {matchRes = ....; keepGoing = false} // case_2

I propose to optimize this as follows:
// case_1var C2satisfied = false<hoisted vals_1 + vals_2, mutable -- only those referenced in later cases> if(C1) { <vals_1_case_1> if(C2) { <vals_2_case_1>    C2satisfied = true  <assign values to shared variables: since the tests if(C1) {<vals> ... if(C2) { <vals> ... will be dropped below, need to preserve any variables bound by them>   if(C3A) {matchRes = ....; keepGoing = false}}
// case_2if(keepGoing && C2satisfied) if(C3B) {matchRes = ....; keepGoing = false} // replace old variable references by references to the shared variables

The advantage of this strategy, I think, is that it maintains the code-structure of the encoding (the size of the generated tree is linear in the size of the original matcher's tree)
if it turns out that is is inefficient to encode an automaton (that does not loop or backtrack [*]) by if-testing mutable flags, I think it would be an evolutionary step to use jumps

cheersadriaan

[*] I still need to read up a bit on automata theory, please correct me if I'm wrong on this
Jason Zaugg
Joined: 2009-05-18,
User offline. Last seen 38 weeks 5 days ago.
Re: starting to optimize the virtualizing pattern matcher
On Thu, Dec 1, 2011 at 11:27 AM, Adriaan Moors <adriaan [dot] moors [at] epfl [dot] ch> wrote:
Is the result of asInstanceOf also cached? if not the jvm will still need to do instance checks as part of asInstanceOf.
no, I only cached the isInstanceOf test, so yes, this could very well explain why the "optimization" didn't helpI'm not sure caching both would help, though, since other evidence suggests the JVM is simply good at this kind of caching/profiling
we could try caching the asInstanceOf anyway, but that brings me to your next point

In the end i don't think we'll reach the performance baseline without generating jumps,
so maybe now is a good opportunity to think about how we can get there in a clean, high-level way.
The isInstanceOf caching experiment was just that, and, after sleeping on it for a night, I think I'll punt on it for now.
I agree we need a more sophisticated analysis that generates clean code -- preferably without jumps, though.

Comment from the bleachers: it might be interesting to consider ways to split the generated code into smaller methods (as is done for guards at the moment.) AFAIK, large methods are less likely to receive attention from hotspot.
Earlier this year, Alexander converted [1] the pattern matches that implements type conformance in the IDEA plugin to polymorphic dispatch and saw big performance improvement. My hunch was that breaking it into smaller methods might have been the telling factor, rather than the difference between dispatching based on instanceOf vs invokeVirtual.
-jason
[1] http://git.jetbrains.org/?p=idea/scala-plugin.git;a=commitdiff;h=ad174f0b2466e118f90ad44f305a537bfb22a6e2;hp=e41a010664c62ccacaf567e80a27c29b7538de5e
adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher
aha! excellent point! now to get the size heuristics right... :-)
On Thu, Dec 1, 2011 at 11:44 AM, Jason Zaugg <jzaugg [at] gmail [dot] com> wrote:
On Thu, Dec 1, 2011 at 11:27 AM, Adriaan Moors <adriaan [dot] moors [at] epfl [dot] ch> wrote:
Is the result of asInstanceOf also cached? if not the jvm will still need to do instance checks as part of asInstanceOf.
no, I only cached the isInstanceOf test, so yes, this could very well explain why the "optimization" didn't helpI'm not sure caching both would help, though, since other evidence suggests the JVM is simply good at this kind of caching/profiling
we could try caching the asInstanceOf anyway, but that brings me to your next point

In the end i don't think we'll reach the performance baseline without generating jumps,
so maybe now is a good opportunity to think about how we can get there in a clean, high-level way.
The isInstanceOf caching experiment was just that, and, after sleeping on it for a night, I think I'll punt on it for now.
I agree we need a more sophisticated analysis that generates clean code -- preferably without jumps, though.

Comment from the bleachers: it might be interesting to consider ways to split the generated code into smaller methods (as is done for guards at the moment.) AFAIK, large methods are less likely to receive attention from hotspot.
Earlier this year, Alexander converted [1] the pattern matches that implements type conformance in the IDEA plugin to polymorphic dispatch and saw big performance improvement. My hunch was that breaking it into smaller methods might have been the telling factor, rather than the difference between dispatching based on instanceOf vs invokeVirtual.
-jason
[1] http://git.jetbrains.org/?p=idea/scala-plugin.git;a=commitdiff;h=ad174f0b2466e118f90ad44f305a537bfb22a6e2;hp=e41a010664c62ccacaf567e80a27c29b7538de5e

adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher
following a suggestion by Tiark, I stumbled upon a significant source of overhead in -Yvirtpatmat... alternatives!
who would've thought, right? it turns out the compiler uses quite a few of those
a better implementation brought virtpatmat's overhead (as defined earlier in this thread) down to about 5%, when also using -Ydead-code, which seems essential (otherwise the overhead is closer to 11% when compiling the compiler)
details are here: https://github.com/adriaanm/scala/blob/aa1ede90eaa234fb1099ed589a7ccc9344a66a6e/bootstrapping
so, my plan: 1) optimize simple matches that consist of only constants, type tests and alternatives, probably even emitting a switch when there are only constants and alternatives, as is done by the current pattern matcher...
2) add some simple dead-code elimination to the virtpatmat code gen, so we don't have to run with -Ydead-code, which seems a bit expensive (have a look at the jump from 2min -> 3min when I switched to the current implementation of alternatives)

cheersadriaan
ps: http://twitter.com/search?q=#virtpatmat
odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.
Re: starting to optimize the virtualizing pattern matcher

On Wed, Dec 14, 2011 at 6:10 PM, Adriaan Moors wrote:
> following a suggestion by Tiark, I stumbled upon a significant source of
> overhead in -Yvirtpatmat... alternatives!
>
> who would've thought, right? it turns out the compiler uses quite a few of
> those
>
> a better implementation brought virtpatmat's overhead (as defined earlier in
> this thread) down to about 5%, when also using -Ydead-code, which seems
> essential (otherwise the overhead is closer to 11% when compiling the
> compiler)
>
> details are
> here: https://github.com/adriaanm/scala/blob/aa1ede90eaa234fb1099ed589a7ccc9344a66a6e/bootstrapping
>
> so, my plan:
> 1) optimize simple matches that consist of only constants, type tests and
> alternatives, probably even emitting a switch when there are only constants
> and alternatives, as is done by the current pattern matcher...
>
So, it was the Scanner that crawled, right? Yes, emitting switches is
essential. You can simply leave the match expression in the trees for
them; the backend will take care of the rest.

Cheers

adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher
So, it was the Scanner that crawled, right?
indeed: parser, tailcalls, liftcode and cleanup jump out as having above average per-phase overhead
detailed  per-phase timings of -Yvirtpatmat: https://docs.google.com/spreadsheet/ccc?key=0Avduyz9rVkQ7dFpoZG1yLU9SZGYta2FGYUNlUVBTb3c
 
Yes, emitting switches is essential. You can simply leave the match expression in the trees for them; the backend will take care of the rest.
yep, doing that today, should be easy enough
cheersadriaan 
adriaanm
Joined: 2010-02-08,
User offline. Last seen 31 weeks 4 days ago.
Re: starting to optimize the virtualizing pattern matcher
just a quick status update: -Ydead-code -Ycloselim now has the same running times as -Ydead-code -Ycloselim, or at least within margin of error:
[strap.lib.timer: 1:53.893 sec][quick.lib.timer: 1:53.159 sec]
[strap.comp.timer: 1:45.424 sec][quick.comp.timer: 1:43.849 sec]
below are the strap.comp per-phase timings in ms, with the percentages comparing to a compiler (quick.comp) running with the same switches, so performing the same work, but using normal pattern matching:
         icode 5979 115.49%      tailcalls 781 113.02%       uncurry 3231 108.61%        flatten 548 108.09%     refchecks 2858 108.01%     lambdalift 1709 106.95%  constructors 1112 106.21%        erasure 9373 103.40%         mixin 2267 103.37%         parser 1906 102.03%         typer 35624 100.69%  explicitouter 2435 99.75%superaccessors 1035 99.52%        pickler 808 99.51%           jvm 10733 98.46%       closelim 5200 97.58%           dce 13183 97.28%     specialize 2754 95.03%
note the rows with < 100% :-)
details in the doc i linked to earlier 
next week I'll investigate the patterns in tailcalls and icode
cheersadriaan

ps: I realise these statistics need to be taken with a grain of salt, i'll be performing more benchmarking on other projects, as well as micro-benchmarks, where statistics is easier to apply -- help with this would be greatly appreciated!

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland