This page is no longer maintained — Please continue to the home page at www.scala-lang.org

Reflection on Array apply/update/length (was Re: brainstorming on ways to construct a ParallelArray)

18 replies
Jorge Ortiz
Joined: 2008-12-16,
User offline. Last seen 29 weeks 4 days ago.
(Cross-posted to scala-internals).

Disclaimer:
Nothing in this post applies to arrays whose type (Array[Int], Array[Long], etc) is known at compile time. This only applies to generic arrays, Array[T], where T is not known statically.

Background:
(Skip if you know how 2.8 Arrays work.)

Arrays in Scala 2.8 are fully native Java arrays. However, Scala's type system unified primitive (Int, Long, Double, etc) and reference types, so Scala's fully generic Array[T] could, at run time, mean anything from Java's int[] to long[] to double[] to Object[]. As a result, when Scala code attempts to do "new Array[T]", the compiler has no idea what kind of Java array needs to be instantiated. Therefore, the compiler requires a ClassManifest[T] be available to instantiate a generic Array[T]. The ClassManifest[T] can, at run time, instantiate the correct Java array.

So far so good.

A similar problem presents itself when trying to call apply, update, or length on a fully generic Array[T]. Because different bytecodes must be emitted depending on the run-time type of the array, it's impossible to know at compile-time what to emit. To work around this, the compiler currently emits code to call java.lang.reflect.Arrays.get/set/length.

Problem:

Reflective get/set on Arrays in remarkably slower than native get/set. On JDK6u18, when both operations are boxed (unavoidable without @specialized), a reflective get is 2763% slower than a native get. A reflective set is 527% slower than a native set. (On JDK5u19 the difference is even more striking: get is 3562% slower with reflection, set is 1259% slower with reflection.) This is in addition to the penalty for boxing. If the primitive operations are unboxed, the numbers are even worse. (Full numbers and code at the end.)

This is a huge performance trap for anyone who expects arrays to be fast.

Worse, reflection is used even if a ClassManifest[T] is available. If ClassManifest[T] had arrayApply(Array[T], Int)T and arrayUpdate(Array[T], Int, T)Unit methods (overridden in Manifest.Int, Manifest.Long, etc to use native operations, as is already done for newArray(Int)Array[T]) that were used instead of reflection, performance could improve dramatically. If my benchmarks are correct, a ClassManifest get would be 69% slower than a native get, and a ClassManifest set would actually be 3% -faster- than a native set. (On JDK5, the numbers look 116% slower and 25% slower, respectively.) If ClassManifest[T] were @specialized, the numbers look even better (see below).

Suggestion:

The following methods should be added to ClassManifest[T]:
  def arrayApply(array: Array[T], index: Int): T
  def arrayUpdate(array: Array[T], index: Int, value: T): Unit
  def arrayLength(array: Array[T]): Int
These methods should be overridden in Manifest.Int/Long/etc to use operations on native array types.

When apply, update, or length are called on an Array[T], and a ClassManifest[T] is implicitly in scope, the methods on ClassManifest[T] should be used to operate on the array.

When apply, update, or length are called on an Array[T], and a ClassManifest[T] is NOT implicitly in scope, behavior is up for debate. Given the poor performance of reflection, I'm of the opinion that this should be a compile error (just like "new Array[T]" is a compile error if no ClassManifest[T] is around). If users are willing to take the performance penalty, they can use reflection themselves. If, however, it's deemed convenient to let the compiler emit code that uses reflection, at the very least it should issue a warning.

Once this is done, the possibility of @specializing ClassManifest[T] should be explored.

I'll work on the patch for the necessary changes to ClassManifest, and benchmark the code to make sure the performance gains actually materialize, but the changes to the compiler should probably be done by someone more familiar with how the current Array code works.

--j

Code for the benchmark is here:
http://pastebin.com/m341fdcfb

You'll want a simple shell script to run the code:
http://pastebin.com/m67b15141
(warning: that script is very specific to my environment, you'll have to tweak it a bit)

Numbers for JDK6u18:
(Courtesy Ismael Juma, who keeps up to date with JVM releases)
Primitive unboxed apply took: 28.27378ms
Primitive boxed apply took: 41.364976ms
Virtual unboxed apply took: 43.188233ms
Virtual boxed apply took: 69.799574ms
Reflective unboxed apply took: 417.834377ms
Reflective boxed apply took: 1184.316914ms
Primitive unboxed update took: 30.53889ms
Primitive boxed update took: 97.971217ms
Virtual unboxed update took: 26.524369ms
Virtual boxed update took: 95.250312ms
Reflective unboxed update took: 429.913418ms
Reflective boxed update took: 614.674586ms
Primitive length took: 19.568499ms
Virtual length took: 18.571432ms
Reflective length took: 21.16248ms

Numbers for JDK5u19:
Primitive unboxed apply took: 21.984ms
Primitive boxed apply took: 51.213ms
Virtual unboxed apply took: 102.671ms
Virtual boxed apply took: 110.819ms
Reflective unboxed apply took: 940.272ms
Reflective boxed apply took: 1875.599ms
Primitive unboxed update took: 33.751ms
Primitive boxed update took: 104.144ms
Virtual unboxed update took: 31.84ms
Virtual boxed update took: 130.122ms
Reflective unboxed update took: 1177.238ms
Reflective boxed update took: 1415.715ms
Primitive length took: 23.133ms
Virtual length took: 22.298ms
Reflective length took: 863.342ms

Key:
"Primitive unboxed" is the native operative
"Primitive boxed" is the native operation + boxing costs
"Virtual unboxed" is the ClassManifest[@specialized T] approach
"Virtual boxed" is the ClassManifest[T] approach
"Reflective boxed" is the current behavior in 2.8

"Apply" is get
"Update" is set

---------- Forwarded message ----------
From: Tiark Rompf <tiark [dot] rompf [at] epfl [dot] ch>
Date: Thu, Dec 17, 2009 at 8:28 AM
Subject: Re: brainstorming on ways to construct a ParallelArray
To: Scala Parallel Collections <scala-parallel-collections [at] googlegroups [dot] com>


The situation is complex ;-) Imagine you have a function

def fifth[T](xs: Array[T]) = xs(5)

Then what bytecode should be generated so that both of the following
work:

fifth(new Array[Int](10))
fifth(new Array[AnyRef](10))

On the bytecode level, the argument types are int[] and Object[], so
element lookup is either an iaload or an aaload instruction.

Without generating different code for both cases (which is what
@specialize does) reflection is basically the only option.

In principle one could require that ClassManifests be available
everywhere generic arrays are used and then have methods arrayGet and
arraySet in ClassManifest that would contain the appropriate bytecode
instructions. But still they would need to return Object, which means
one allocation per slot access for primitive arrays, and that would
probably have about the same cost as a reflective access.

- Tiark

On Dec 16, 8:56 pm, Jorge Ortiz <jorge [dot] or [dot] [dot] [dot] [at] gmail [dot] com> wrote:
> Tiark,
>
> If the ClassManifest is available, shouldn't the compiler be able to make
> non-reflective calls for array access/update?
>
> --j
>
>
>
> On Wed, Dec 16, 2009 at 11:22 AM, Tiark Rompf <tiark [dot] ro [dot] [dot] [dot] [at] epfl [dot] ch> wrote:
> > In principle yes, but GenericArray[T] adds an unnecessary indirection.
> > I would recommend using Array[AnyRef] for the moment and later seeing
> > how @specialize does.
>
> > Note that only generic arrays have a performance problem, and
> > unfortunately there's not much that can be done about it.
>
> > - Tiark
>
> > On Dec 16, 6:45 pm, Jorge Ortiz <jorge [dot] or [dot] [dot] [dot] [at] gmail [dot] com> wrote:
> > > Yes, GenericArray[T] is the right way to go here, until the reflection
> > > issues with Array[T] are addressed. A GenericArray[T] is backed by an
> > > Array[AnyRef], which means that a GenericArray of primitives will have
> > its
> > > members boxed, but that's also the case with Array[Any], so there should
> > be
> > > no difference there. The advantage of using GenericArray[T] is that casts
> > in
> > > ParallelArray can be avoided.
>
> > > In the long-term, the performance issues with Array will need to be
> > > addressed.
>
> > > --j
>
> > > On Wed, Dec 16, 2009 at 9:37 AM, Daniel Sobral <dcsob [dot] [dot] [dot] [at] gmail [dot] com>
> > wrote:
> > > > Does this problem exists for GenericArray? Martin had a hell of a time
> > to
> > > > get Array in its Scala 2.8 state in order, and had to introduce a host
> > of
> > > > concepts to make it work. And the jury is still out on that. I'm just
> > saying
> > > > leaving Java arrays aside for now and concentrating on GenericArray
> > might be
> > > > a good idea.
>
> > > > On Wed, Dec 16, 2009 at 2:34 PM, Aleksandar Prokopec <
> > > > aleksandar [dot] proko [dot] [dot] [dot] [at] gmail [dot] com> wrote:
>
> > > >>> OK. Any reason not to stick a protected/private on there right now?
>
> > > >> No reason whatsoever as far as I can see :)
>
> > > >> Are you suggesting we add in the manifests to get the api right, then
> > add
> > > >>> casts in the code if fix the performance?
>
> > > >> Well, whatever we decide to do, the important thing is that if the
> > > >> collection uses an array
>
> > > >> I was thinking of something like, say, this (module the actual name of
> > the
> > > >> method in question :) ):
>
> > > >> object ParallelArray {
> > > >>  def fromArray[T](array: Array[T]) = new
> > > >> ParallelArray[T](array.asInstanceOf[Array[Any]], 0, array.size)
> > > >> }
>
> > > >> To the best of my knowledge, as soon as you declare a field typed as a
> > > >> generic array using manifests, it gets treated as an object and
> > reflection
> > > >> is used to access it's elements.
>
> > > >> The only problem for the approach above is that it doesn't quite work
> > for
> > > >> arrays of primitive types. I'll think about this a little bit
>
> > > > --
> > > > Daniel C. Sobral
>
> > > > I travel to the future all the time.

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: Reflection on Array apply/update/length (was Re: brainstorm

Hey Jorge,

Thanks for taking the time to investigate this. A minor note is that
different machines were used to test JDK6u18 and JDK5u19 (I only tested
the former), so the relative numbers for each set of results is
meaningful, but one should not compare the absolute numbers for JDK5
versus JDK6.

Best,
Ismael

Tiark Rompf
Joined: 2009-02-18,
User offline. Last seen 42 years 45 weeks ago.
Re: Reflection on Array apply/update/length (was Re: brainstormi
Hi Jorge,
I knew I would be taken up on the "... would probably have about the same cost as a reflective access" ;-) So you're suggesting we should follow the approach I hinted at in my previous mail? I'm glad you're investigating this further. I'm not a big fan of silently using reflection either, but so far I was not convinced using manifests exclusively is necessarily better. But you are right that we should consider it seriously.
Regarding your benchmarks, I noticed that you are using int arrays exclusively. That makes me suspect the actual speedup will not be quite as dramatic. Since all your methods are small and all call sites are monomorphic, it looks like HotSpot is inlining more or less everything, which would explain the times for 'virtual' are so close to 'primitive'. I guess you will see a 2x - 5x slowdown for 'virtual' if you modify your setup to take that into account.
Cheers,- Tiark


On 19.12.2009, at 15:44, Jorge Ortiz wrote:
(Cross-posted to scala-internals).

Disclaimer:
Nothing in this post applies to arrays whose type (Array[Int], Array[Long], etc) is known at compile time. This only applies to generic arrays, Array[T], where T is not known statically.

Background:
(Skip if you know how 2.8 Arrays work.)

Arrays in Scala 2.8 are fully native Java arrays. However, Scala's type system unified primitive (Int, Long, Double, etc) and reference types, so Scala's fully generic Array[T] could, at run time, mean anything from Java's int[] to long[] to double[] to Object[]. As a result, when Scala code attempts to do "new Array[T]", the compiler has no idea what kind of Java array needs to be instantiated. Therefore, the compiler requires a ClassManifest[T] be available to instantiate a generic Array[T]. The ClassManifest[T] can, at run time, instantiate the correct Java array.

So far so good.

A similar problem presents itself when trying to call apply, update, or length on a fully generic Array[T]. Because different bytecodes must be emitted depending on the run-time type of the array, it's impossible to know at compile-time what to emit. To work around this, the compiler currently emits code to call java.lang.reflect.Arrays.get/set/length.

Problem:

Reflective get/set on Arrays in remarkably slower than native get/set. On JDK6u18, when both operations are boxed (unavoidable without @specialized), a reflective get is 2763% slower than a native get. A reflective set is 527% slower than a native set. (On JDK5u19 the difference is even more striking: get is 3562% slower with reflection, set is 1259% slower with reflection.) This is in addition to the penalty for boxing. If the primitive operations are unboxed, the numbers are even worse. (Full numbers and code at the end.)

This is a huge performance trap for anyone who expects arrays to be fast.

Worse, reflection is used even if a ClassManifest[T] is available. If ClassManifest[T] had arrayApply(Array[T], Int)T and arrayUpdate(Array[T], Int, T)Unit methods (overridden in Manifest.Int, Manifest.Long, etc to use native operations, as is already done for newArray(Int)Array[T]) that were used instead of reflection, performance could improve dramatically. If my benchmarks are correct, a ClassManifest get would be 69% slower than a native get, and a ClassManifest set would actually be 3% -faster- than a native set. (On JDK5, the numbers look 116% slower and 25% slower, respectively.) If ClassManifest[T] were @specialized, the numbers look even better (see below).

Suggestion:

The following methods should be added to ClassManifest[T]:
  def arrayApply(array: Array[T], index: Int): T
  def arrayUpdate(array: Array[T], index: Int, value: T): Unit
  def arrayLength(array: Array[T]): Int
These methods should be overridden in Manifest.Int/Long/etc to use operations on native array types.

When apply, update, or length are called on an Array[T], and a ClassManifest[T] is implicitly in scope, the methods on ClassManifest[T] should be used to operate on the array.

When apply, update, or length are called on an Array[T], and a ClassManifest[T] is NOT implicitly in scope, behavior is up for debate. Given the poor performance of reflection, I'm of the opinion that this should be a compile error (just like "new Array[T]" is a compile error if no ClassManifest[T] is around). If users are willing to take the performance penalty, they can use reflection themselves. If, however, it's deemed convenient to let the compiler emit code that uses reflection, at the very least it should issue a warning.

Once this is done, the possibility of @specializing ClassManifest[T] should be explored.

I'll work on the patch for the necessary changes to ClassManifest, and benchmark the code to make sure the performance gains actually materialize, but the changes to the compiler should probably be done by someone more familiar with how the current Array code works.

--j

Code for the benchmark is here:
http://pastebin.com/m341fdcfb

You'll want a simple shell script to run the code:
http://pastebin.com/m67b15141
(warning: that script is very specific to my environment, you'll have to tweak it a bit)

Numbers for JDK6u18:
(Courtesy Ismael Juma, who keeps up to date with JVM releases)
Primitive unboxed apply took: 28.27378ms
Primitive boxed apply took: 41.364976ms
Virtual unboxed apply took: 43.188233ms
Virtual boxed apply took: 69.799574ms
Reflective unboxed apply took: 417.834377ms
Reflective boxed apply took: 1184.316914ms
Primitive unboxed update took: 30.53889ms
Primitive boxed update took: 97.971217ms
Virtual unboxed update took: 26.524369ms
Virtual boxed update took: 95.250312ms
Reflective unboxed update took: 429.913418ms
Reflective boxed update took: 614.674586ms
Primitive length took: 19.568499ms
Virtual length took: 18.571432ms
Reflective length took: 21.16248ms

Numbers for JDK5u19:
Primitive unboxed apply took: 21.984ms
Primitive boxed apply took: 51.213ms
Virtual unboxed apply took: 102.671ms
Virtual boxed apply took: 110.819ms
Reflective unboxed apply took: 940.272ms
Reflective boxed apply took: 1875.599ms
Primitive unboxed update took: 33.751ms
Primitive boxed update took: 104.144ms
Virtual unboxed update took: 31.84ms
Virtual boxed update took: 130.122ms
Reflective unboxed update took: 1177.238ms
Reflective boxed update took: 1415.715ms
Primitive length took: 23.133ms
Virtual length took: 22.298ms
Reflective length took: 863.342ms

Key:
"Primitive unboxed" is the native operative
"Primitive boxed" is the native operation + boxing costs
"Virtual unboxed" is the ClassManifest[@specialized T] approach
"Virtual boxed" is the ClassManifest[T] approach
"Reflective boxed" is the current behavior in 2.8

"Apply" is get
"Update" is set

---------- Forwarded message ----------
From: Tiark Rompf <tiark [dot] rompf [at] epfl [dot] ch>
Date: Thu, Dec 17, 2009 at 8:28 AM
Subject: Re: brainstorming on ways to construct a ParallelArray
To: Scala Parallel Collections <scala-parallel-collections [at] googlegroups [dot] com>


The situation is complex ;-) Imagine you have a function

def fifth[T](xs: Array[T]) = xs(5)

Then what bytecode should be generated so that both of the following
work:

fifth(new Array[Int](10))
fifth(new Array[AnyRef](10))

On the bytecode level, the argument types are int[] and Object[], so
element lookup is either an iaload or an aaload instruction.

Without generating different code for both cases (which is what
@specialize does) reflection is basically the only option.

In principle one could require that ClassManifests be available
everywhere generic arrays are used and then have methods arrayGet and
arraySet in ClassManifest that would contain the appropriate bytecode
instructions. But still they would need to return Object, which means
one allocation per slot access for primitive arrays, and that would
probably have about the same cost as a reflective access.

- Tiark

On Dec 16, 8:56 pm, Jorge Ortiz <jorge [dot] or [dot] [dot] [dot] [at] gmail [dot] com> wrote:
> Tiark,
>
> If the ClassManifest is available, shouldn't the compiler be able to make
> non-reflective calls for array access/update?
>
> --j
>
>
>
> On Wed, Dec 16, 2009 at 11:22 AM, Tiark Rompf <tiark [dot] ro [dot] [dot] [dot] [at] epfl [dot] ch> wrote:
> > In principle yes, but GenericArray[T] adds an unnecessary indirection.
> > I would recommend using Array[AnyRef] for the moment and later seeing
> > how @specialize does.
>
> > Note that only generic arrays have a performance problem, and
> > unfortunately there's not much that can be done about it.
>
> > - Tiark
>
> > On Dec 16, 6:45 pm, Jorge Ortiz <jorge [dot] or [dot] [dot] [dot] [at] gmail [dot] com> wrote:
> > > Yes, GenericArray[T] is the right way to go here, until the reflection
> > > issues with Array[T] are addressed. A GenericArray[T] is backed by an
> > > Array[AnyRef], which means that a GenericArray of primitives will have
> > its
> > > members boxed, but that's also the case with Array[Any], so there should
> > be
> > > no difference there. The advantage of using GenericArray[T] is that casts
> > in
> > > ParallelArray can be avoided.
>
> > > In the long-term, the performance issues with Array will need to be
> > > addressed.
>
> > > --j
>
> > > On Wed, Dec 16, 2009 at 9:37 AM, Daniel Sobral <dcsob [dot] [dot] [dot] [at] gmail [dot] com>
> > wrote:
> > > > Does this problem exists for GenericArray? Martin had a hell of a time
> > to
> > > > get Array in its Scala 2.8 state in order, and had to introduce a host
> > of
> > > > concepts to make it work. And the jury is still out on that. I'm just
> > saying
> > > > leaving Java arrays aside for now and concentrating on GenericArray
> > might be
> > > > a good idea.
>
> > > > On Wed, Dec 16, 2009 at 2:34 PM, Aleksandar Prokopec <
> > > > aleksandar [dot] proko [dot] [dot] [dot] [at] gmail [dot] com> wrote:
>
> > > >>> OK. Any reason not to stick a protected/private on there right now?
>
> > > >> No reason whatsoever as far as I can see :)
>
> > > >> Are you suggesting we add in the manifests to get the api right, then
> > add
> > > >>> casts in the code if fix the performance?
>
> > > >> Well, whatever we decide to do, the important thing is that if the
> > > >> collection uses an array
>
> > > >> I was thinking of something like, say, this (module the actual name of
> > the
> > > >> method in question :) ):
>
> > > >> object ParallelArray {
> > > >>  def fromArray[T](array: Array[T]) = new
> > > >> ParallelArray[T](array.asInstanceOf[Array[Any]], 0, array.size)
> > > >> }
>
> > > >> To the best of my knowledge, as soon as you declare a field typed as a
> > > >> generic array using manifests, it gets treated as an object and
> > reflection
> > > >> is used to access it's elements.
>
> > > >> The only problem for the approach above is that it doesn't quite work
> > for
> > > >> arrays of primitive types. I'll think about this a little bit
>
> > > > --
> > > > Daniel C. Sobral
>
> > > > I travel to the future all the time.


Jorge Ortiz
Joined: 2008-12-16,
User offline. Last seen 29 weeks 4 days ago.
Re: Reflection on Array apply/update/length (was Re: brainstormi
Hi Tiark,

I've gone ahead and patched Manifest and ClassManifest with the required methods.

I've also written another benchmark. This one implements an insertion sort algorithm for Array[T], and runs it on reverse-sorted lists (to guarantee worst-case N^2 performance). The same algorithm is implemented using normal array apply/update (which uses reflection) and again using the methods I added to ClassManifest. I use it to sort arrays of Ints, Doubles, and Strings.

You're correct that megamorphic call sites hurt performance, but the penalty is not quite as bad as you feared.

The Manifest approach still vastly outperforms reflection.

--j

Patch for Manifest and ClassManifest:
http://pastebin.com/f160aa3a9

Benchmark Code:
http://pastebin.com/f6daef1ec

Shell script to run benchmark:
(Warning: again, very environment specific)
http://pastebin.com/m47c55b2a

Numbers:

JDK5u19

Monomorphic Manifest Int took: 77.193ms
Monomorphic Manifest Double took: 98.564ms
Monomorphic Manifest String took: 70.486ms
Monomorphic Reflection Int took: 1158.119ms
Monomorphic Reflection Double took: 1201.433ms
Monomorphic Reflection String took: 959.578ms
Megamorphic Manifest Int took: 95.246ms
Megamorphic Manifest Double took: 127.137ms
Megamorphic Manifest String took: 129.797ms
Megamorphic Reflection Int took: 1176.816ms
Megamorphic Reflection Double took: 1199.387ms
Megamorphic Reflection String took: 994.1ms

JDK6u13

Monomorphic Manifest Int took: 77.942ms
Monomorphic Manifest Double took: 95.598ms
Monomorphic Manifest String took: 59.91ms
Monomorphic Reflection Int took: 1191.609ms
Monomorphic Reflection Double took: 1193.703ms
Monomorphic Reflection String took: 789.917ms
Megamorphic Manifest Int took: 97.985ms
Megamorphic Manifest Double took: 115.175ms
Megamorphic Manifest String took: 109.111ms
Megamorphic Reflection Int took: 1236.348ms
Megamorphic Reflection Double took: 1231.131ms
Megamorphic Reflection String took: 811.598ms

On Sat, Dec 19, 2009 at 6:17 PM, Tiark Rompf <tiark [dot] rompf [at] epfl [dot] ch> wrote:
Hi Jorge,
I knew I would be taken up on the "... would probably have about the same cost as a reflective access" ;-) So you're suggesting we should follow the approach I hinted at in my previous mail? I'm glad you're investigating this further. I'm not a big fan of silently using reflection either, but so far I was not convinced using manifests exclusively is necessarily better. But you are right that we should consider it seriously.
Regarding your benchmarks, I noticed that you are using int arrays exclusively. That makes me suspect the actual speedup will not be quite as dramatic. Since all your methods are small and all call sites are monomorphic, it looks like HotSpot is inlining more or less everything, which would explain the times for 'virtual' are so close to 'primitive'. I guess you will see a 2x - 5x slowdown for 'virtual' if you modify your setup to take that into account.
Cheers,- Tiark


On 19.12.2009, at 15:44, Jorge Ortiz wrote:
(Cross-posted to scala-internals).

Disclaimer:
Nothing in this post applies to arrays whose type (Array[Int], Array[Long], etc) is known at compile time. This only applies to generic arrays, Array[T], where T is not known statically.

Background:
(Skip if you know how 2.8 Arrays work.)

Arrays in Scala 2.8 are fully native Java arrays. However, Scala's type system unified primitive (Int, Long, Double, etc) and reference types, so Scala's fully generic Array[T] could, at run time, mean anything from Java's int[] to long[] to double[] to Object[]. As a result, when Scala code attempts to do "new Array[T]", the compiler has no idea what kind of Java array needs to be instantiated. Therefore, the compiler requires a ClassManifest[T] be available to instantiate a generic Array[T]. The ClassManifest[T] can, at run time, instantiate the correct Java array.

So far so good.

A similar problem presents itself when trying to call apply, update, or length on a fully generic Array[T]. Because different bytecodes must be emitted depending on the run-time type of the array, it's impossible to know at compile-time what to emit. To work around this, the compiler currently emits code to call java.lang.reflect.Arrays.get/set/length.

Problem:

Reflective get/set on Arrays in remarkably slower than native get/set. On JDK6u18, when both operations are boxed (unavoidable without @specialized), a reflective get is 2763% slower than a native get. A reflective set is 527% slower than a native set. (On JDK5u19 the difference is even more striking: get is 3562% slower with reflection, set is 1259% slower with reflection.) This is in addition to the penalty for boxing. If the primitive operations are unboxed, the numbers are even worse. (Full numbers and code at the end.)

This is a huge performance trap for anyone who expects arrays to be fast.

Worse, reflection is used even if a ClassManifest[T] is available. If ClassManifest[T] had arrayApply(Array[T], Int)T and arrayUpdate(Array[T], Int, T)Unit methods (overridden in Manifest.Int, Manifest.Long, etc to use native operations, as is already done for newArray(Int)Array[T]) that were used instead of reflection, performance could improve dramatically. If my benchmarks are correct, a ClassManifest get would be 69% slower than a native get, and a ClassManifest set would actually be 3% -faster- than a native set. (On JDK5, the numbers look 116% slower and 25% slower, respectively.) If ClassManifest[T] were @specialized, the numbers look even better (see below).

Suggestion:

The following methods should be added to ClassManifest[T]:
  def arrayApply(array: Array[T], index: Int): T
  def arrayUpdate(array: Array[T], index: Int, value: T): Unit
  def arrayLength(array: Array[T]): Int
These methods should be overridden in Manifest.Int/Long/etc to use operations on native array types.

When apply, update, or length are called on an Array[T], and a ClassManifest[T] is implicitly in scope, the methods on ClassManifest[T] should be used to operate on the array.

When apply, update, or length are called on an Array[T], and a ClassManifest[T] is NOT implicitly in scope, behavior is up for debate. Given the poor performance of reflection, I'm of the opinion that this should be a compile error (just like "new Array[T]" is a compile error if no ClassManifest[T] is around). If users are willing to take the performance penalty, they can use reflection themselves. If, however, it's deemed convenient to let the compiler emit code that uses reflection, at the very least it should issue a warning.

Once this is done, the possibility of @specializing ClassManifest[T] should be explored.

I'll work on the patch for the necessary changes to ClassManifest, and benchmark the code to make sure the performance gains actually materialize, but the changes to the compiler should probably be done by someone more familiar with how the current Array code works.

--j

Code for the benchmark is here:
http://pastebin.com/m341fdcfb

You'll want a simple shell script to run the code:
http://pastebin.com/m67b15141
(warning: that script is very specific to my environment, you'll have to tweak it a bit)

Numbers for JDK6u18:
(Courtesy Ismael Juma, who keeps up to date with JVM releases)
Primitive unboxed apply took: 28.27378ms
Primitive boxed apply took: 41.364976ms
Virtual unboxed apply took: 43.188233ms
Virtual boxed apply took: 69.799574ms
Reflective unboxed apply took: 417.834377ms
Reflective boxed apply took: 1184.316914ms
Primitive unboxed update took: 30.53889ms
Primitive boxed update took: 97.971217ms
Virtual unboxed update took: 26.524369ms
Virtual boxed update took: 95.250312ms
Reflective unboxed update took: 429.913418ms
Reflective boxed update took: 614.674586ms
Primitive length took: 19.568499ms
Virtual length took: 18.571432ms
Reflective length took: 21.16248ms

Numbers for JDK5u19:
Primitive unboxed apply took: 21.984ms
Primitive boxed apply took: 51.213ms
Virtual unboxed apply took: 102.671ms
Virtual boxed apply took: 110.819ms
Reflective unboxed apply took: 940.272ms
Reflective boxed apply took: 1875.599ms
Primitive unboxed update took: 33.751ms
Primitive boxed update took: 104.144ms
Virtual unboxed update took: 31.84ms
Virtual boxed update took: 130.122ms
Reflective unboxed update took: 1177.238ms
Reflective boxed update took: 1415.715ms
Primitive length took: 23.133ms
Virtual length took: 22.298ms
Reflective length took: 863.342ms

Key:
"Primitive unboxed" is the native operative
"Primitive boxed" is the native operation + boxing costs
"Virtual unboxed" is the ClassManifest[@specialized T] approach
"Virtual boxed" is the ClassManifest[T] approach
"Reflective boxed" is the current behavior in 2.8

"Apply" is get
"Update" is set

---------- Forwarded message ----------
From: Tiark Rompf <tiark [dot] rompf [at] epfl [dot] ch>
Date: Thu, Dec 17, 2009 at 8:28 AM
Subject: Re: brainstorming on ways to construct a ParallelArray
To: Scala Parallel Collections <scala-parallel-collections [at] googlegroups [dot] com>


The situation is complex ;-) Imagine you have a function

def fifth[T](xs: Array[T]) = xs(5)

Then what bytecode should be generated so that both of the following
work:

fifth(new Array[Int](10))
fifth(new Array[AnyRef](10))

On the bytecode level, the argument types are int[] and Object[], so
element lookup is either an iaload or an aaload instruction.

Without generating different code for both cases (which is what
@specialize does) reflection is basically the only option.

In principle one could require that ClassManifests be available
everywhere generic arrays are used and then have methods arrayGet and
arraySet in ClassManifest that would contain the appropriate bytecode
instructions. But still they would need to return Object, which means
one allocation per slot access for primitive arrays, and that would
probably have about the same cost as a reflective access.

- Tiark

On Dec 16, 8:56 pm, Jorge Ortiz <jorge [dot] or [dot] [dot] [dot] [at] gmail [dot] com> wrote:
> Tiark,
>
> If the ClassManifest is available, shouldn't the compiler be able to make
> non-reflective calls for array access/update?
>
> --j
>
>
>
> On Wed, Dec 16, 2009 at 11:22 AM, Tiark Rompf <tiark [dot] ro [dot] [dot] [dot] [at] epfl [dot] ch> wrote:
> > In principle yes, but GenericArray[T] adds an unnecessary indirection.
> > I would recommend using Array[AnyRef] for the moment and later seeing
> > how @specialize does.
>
> > Note that only generic arrays have a performance problem, and
> > unfortunately there's not much that can be done about it.
>
> > - Tiark
>
> > On Dec 16, 6:45 pm, Jorge Ortiz <jorge [dot] or [dot] [dot] [dot] [at] gmail [dot] com> wrote:
> > > Yes, GenericArray[T] is the right way to go here, until the reflection
> > > issues with Array[T] are addressed. A GenericArray[T] is backed by an
> > > Array[AnyRef], which means that a GenericArray of primitives will have
> > its
> > > members boxed, but that's also the case with Array[Any], so there should
> > be
> > > no difference there. The advantage of using GenericArray[T] is that casts
> > in
> > > ParallelArray can be avoided.
>
> > > In the long-term, the performance issues with Array will need to be
> > > addressed.
>
> > > --j
>
> > > On Wed, Dec 16, 2009 at 9:37 AM, Daniel Sobral <dcsob [dot] [dot] [dot] [at] gmail [dot] com>
> > wrote:
> > > > Does this problem exists for GenericArray? Martin had a hell of a time
> > to
> > > > get Array in its Scala 2.8 state in order, and had to introduce a host
> > of
> > > > concepts to make it work. And the jury is still out on that. I'm just
> > saying
> > > > leaving Java arrays aside for now and concentrating on GenericArray
> > might be
> > > > a good idea.
>
> > > > On Wed, Dec 16, 2009 at 2:34 PM, Aleksandar Prokopec <
> > > > aleksandar [dot] proko [dot] [dot] [dot] [at] gmail [dot] com> wrote:
>
> > > >>> OK. Any reason not to stick a protected/private on there right now?
>
> > > >> No reason whatsoever as far as I can see :)
>
> > > >> Are you suggesting we add in the manifests to get the api right, then
> > add
> > > >>> casts in the code if fix the performance?
>
> > > >> Well, whatever we decide to do, the important thing is that if the
> > > >> collection uses an array
>
> > > >> I was thinking of something like, say, this (module the actual name of
> > the
> > > >> method in question :) ):
>
> > > >> object ParallelArray {
> > > >>  def fromArray[T](array: Array[T]) = new
> > > >> ParallelArray[T](array.asInstanceOf[Array[Any]], 0, array.size)
> > > >> }
>
> > > >> To the best of my knowledge, as soon as you declare a field typed as a
> > > >> generic array using manifests, it gets treated as an object and
> > reflection
> > > >> is used to access it's elements.
>
> > > >> The only problem for the approach above is that it doesn't quite work
> > for
> > > >> arrays of primitive types. I'll think about this a little bit
>
> > > > --
> > > > Daniel C. Sobral
>
> > > > I travel to the future all the time.



odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.
Re: Re: Reflection on Array apply/update/length (was Re: brain

I think Jorge's approach looks like a winner. Jorge, can you send us
your patches to Manifest, then I can try to change to compiler to make
use of Manifests where available. There will still be a performance
hit, namely when manifests are not available, but at least there's
something one can do about it now. I am a bit surprised that array
reflection is so bad performance-wise. A factor of 10 performance hit
for boxed arguments like strings is not at all impressive. I would
have thought that in that case a reflective array access should be
about as fast as native code, but obviously it's implemented in a very
suboptimal way.

Cheers

Tiark Rompf
Joined: 2009-02-18,
User offline. Last seen 42 years 45 weeks ago.
Re: Reflection on Array apply/update/length (was Re: brainstormi

On 20.12.2009, at 20:47, Jorge Ortiz wrote:

> I've also written another benchmark. This one implements an insertion sort algorithm for Array[T], and runs it on reverse-sorted lists (to guarantee worst-case N^2 performance). The same algorithm is implemented using normal array apply/update (which uses reflection) and again using the methods I added to ClassManifest. I use it to sort arrays of Ints, Doubles, and Strings.
>
> You're correct that megamorphic call sites hurt performance, but the penalty is not quite as bad as you feared.

Your mileage will vary, as always. If you look at the String case, it's almost 2x and in both cases, a lot of time will be spent doing String comparisons. But anyways, the edge over reflection is clearly more than substantial. It's really impressive how slow reflection still is. I knew it was back in Java 1.4, because back then I hacked the Nice compiler to avoid Array.get since it was a serious bottleneck in one of my projects, but I really thought it had improved in the meantime.

Since you already have those benchmarks going, there is one other thing we might want to try, namely using explicit instanceof tests (probably first Object[], then int[], etc) instead of reflection. Could you try that out as well, both monomorphically and megamorphically?

Thanks,
- Tiark

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: Re: Reflection on Array apply/update/length (was Re: brains

On Mon, 2009-12-21 at 16:28 +0100, Tiark Rompf wrote:
> Your mileage will vary, as always. If you look at the String case, it's
> almost 2x and in both cases, a lot of time will be spent doing String
> comparisons. But anyways, the edge over reflection is clearly more
> than substantial. It's really impressive how slow reflection still is.

There are many aspects of reflection and some of them have received a
substantial performance boost over the years as runtime bytecode
generation was introduced. There is some overhead that is still there
(particularly wrapping the arguments into an array) and the look-up
methods are still slow.

For this particular case (Array.get and set), I believe that the reason
why it hasn't been optimised much is that it's not commonly used in
performance-sensitive Java code.

Best,
Ismael

odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.
Re: Re: Reflection on Array apply/update/length (was Re: brain

Ok, I just committed (in r20267) a change to ScalaRunTime which
replaces the reflexive calls by instanceOf tests. Jorge, can you try
your benchamark again with that?
It's currently in the build queue.

Thanks

Jorge Ortiz
Joined: 2008-12-16,
User offline. Last seen 29 weeks 4 days ago.
Re: Re: Reflection on Array apply/update/length (was Re: brain
Now we're talking!

That made a HUGE difference. So much so that I had to work on the benchmarks some more to really tease out the differences between the two approaches.

First I dropped the Double benchmarks, because the GC overhead from all the boxing was killing them. (The Ints are all <=100, so they're cached by the runtime, but the Doubles were getting allocated/collected at a dizzying rate.)

Next I dropped sorting as a benchmark, since just doing the comparisons was becoming pretty significant (also, each comparison on primitives adds another round of boxing/unboxing).

I settled on a "shuffling" benchmark. It just takes an array and shuffles its elements in place by swapping two elements at a time.

The manifest approach is lighting fast with monomorphic call sites. But it takes a HUGE hit (almost 20x with String!) when the call site is megamorphic. Such a big hit that it performs worse than the instanceof approach (which, if anything, seems to get slightly faster with megamorphic call sites).

I don't know what to recommend now, because neither approach is a clear winner. Perhaps Manifests + @specialized could provide a clear win, since it would avoid boxing altogether and perhaps even avoid megamorphic call sites.

--j

Code:
http://pastebin.com/m31a50b96

Bash:
http://pastebin.com/m7f61f36

Numbers on JDK6u14:
Monomorphic Manifest Shuffle Int Took: 35.74033333333333ms
Monomorphic Instanceof Shuffle Int Took: 148.65333333333334ms
Monomorphic Manifest Shuffle String Took: 9.428333333333335ms
Monomorphic Instanceof Shuffle String Took: 175.306ms
Megamorphic Manifest Shuffle Int Took: 178.27033333333335ms
Megamorphic Instanceof Shuffle Int Took: 146.33466666666666ms
Megamorphic Manifest Shuffle String Took: 194.65733333333333ms
Megamorphic Instanceof Shuffle String Took: 128.22866666666667ms

On Mon, Dec 21, 2009 at 11:49 AM, martin odersky <martin [dot] odersky [at] epfl [dot] ch> wrote:
Ok, I just committed (in r20267) a change to ScalaRunTime which
replaces the reflexive calls by instanceOf tests. Jorge, can you try
your benchamark again with that?
It's currently in the build queue.

Thanks

Jorge Ortiz
Joined: 2008-12-16,
User offline. Last seen 29 weeks 4 days ago.
Re: Re: Reflection on Array apply/update/length (was Re: brain
Open mouth, insert foot.

I had added some dubious command-line options (-Xbatch -XX:CICompilerCount=1) at one point and forgot to remove them.

With those options gone (and only -server -Xms2048m -Xmx2048m remaining), the manifest approach ekes out a small improvement on megamorphic call sites, making it a clearer winner overall.

I'll do some benchmarking on array allocations and array length (which I've neglected until now) and then submit a patch for ClassManifest and Manifest.

--j

Monomorphic Manifest Shuffle Int Took: 36.275666666666666ms
Monomorphic Instanceof Shuffle Int Took: 142.18233333333333ms
Monomorphic Manifest Shuffle String Took: 9.397333333333334ms
Monomorphic Instanceof Shuffle String Took: 165.287ms
Megamorphic Manifest Shuffle Int Took: 141.081ms
Megamorphic Instanceof Shuffle Int Took: 146.20066666666665ms
Megamorphic Manifest Shuffle String Took: 155.03233333333333ms
Megamorphic Instanceof Shuffle String Took: 158.02166666666665ms


On Mon, Dec 21, 2009 at 9:28 PM, Jorge Ortiz <jorge [dot] ortiz [at] gmail [dot] com> wrote:
Now we're talking!

That made a HUGE difference. So much so that I had to work on the benchmarks some more to really tease out the differences between the two approaches.

First I dropped the Double benchmarks, because the GC overhead from all the boxing was killing them. (The Ints are all <=100, so they're cached by the runtime, but the Doubles were getting allocated/collected at a dizzying rate.)

Next I dropped sorting as a benchmark, since just doing the comparisons was becoming pretty significant (also, each comparison on primitives adds another round of boxing/unboxing).

I settled on a "shuffling" benchmark. It just takes an array and shuffles its elements in place by swapping two elements at a time.

The manifest approach is lighting fast with monomorphic call sites. But it takes a HUGE hit (almost 20x with String!) when the call site is megamorphic. Such a big hit that it performs worse than the instanceof approach (which, if anything, seems to get slightly faster with megamorphic call sites).

I don't know what to recommend now, because neither approach is a clear winner. Perhaps Manifests + @specialized could provide a clear win, since it would avoid boxing altogether and perhaps even avoid megamorphic call sites.

--j

Code:
http://pastebin.com/m31a50b96

Bash:
http://pastebin.com/m7f61f36

Numbers on JDK6u14:
Monomorphic Manifest Shuffle Int Took: 35.74033333333333ms
Monomorphic Instanceof Shuffle Int Took: 148.65333333333334ms
Monomorphic Manifest Shuffle String Took: 9.428333333333335ms
Monomorphic Instanceof Shuffle String Took: 175.306ms
Megamorphic Manifest Shuffle Int Took: 178.27033333333335ms
Megamorphic Instanceof Shuffle Int Took: 146.33466666666666ms
Megamorphic Manifest Shuffle String Took: 194.65733333333333ms
Megamorphic Instanceof Shuffle String Took: 128.22866666666667ms

On Mon, Dec 21, 2009 at 11:49 AM, martin odersky <martin [dot] odersky [at] epfl [dot] ch> wrote:
Ok, I just committed (in r20267) a change to ScalaRunTime which
replaces the reflexive calls by instanceOf tests. Jorge, can you try
your benchamark again with that?
It's currently in the build queue.

Thanks

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: Re: Reflection on Array apply/update/length (was Re: brains

On Mon, 2009-12-21 at 21:39 -0600, Jorge Ortiz wrote:
> With those options gone (and only -server -Xms2048m -Xmx2048m
> remaining), the manifest approach ekes out a small improvement on
> megamorphic call sites, making it a clearer winner overall.

Looks that way, although the fallback when a manifest is not available
should be the instanceof approach instead of Array.get/set.

> I'll do some benchmarking on array allocations and array length (which
> I've neglected until now) and then submit a patch for ClassManifest
> and Manifest.

Makes sense to avoid any surprises.

Best,
Ismael

odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.
Re: Re: Reflection on Array apply/update/length (was Re: brain

On Tue, Dec 22, 2009 at 8:56 AM, Ismael Juma wrote:
> On Mon, 2009-12-21 at 21:39 -0600, Jorge Ortiz wrote:
>> With those options gone (and only -server -Xms2048m -Xmx2048m
>> remaining), the manifest approach ekes out a small improvement on
>> megamorphic call sites, making it a clearer winner overall.

Thanks for the analysis Jorge!

> Looks that way, although the fallback when a manifest is not available
> should be the instanceof approach instead of Array.get/set.
>
Yes, of course. Array.get/set are already gone for good.
I am really surprised that they were so inefficient. I would have
thought the trivial implementation of Array.get/set in terms of
isInstanceof would be the baseline case on which they improve by means
of clever VM hackery. It's a mystery to me what these implemenyations
do that makes them so slow, and why they do it that way.

Cheers

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: Re: Reflection on Array apply/update/length (was Re: brain

On Tue, 2009-12-22 at 09:46 +0100, martin odersky wrote:
> Yes, of course. Array.get/set are already gone for good.

Great.

> I am really surprised that they were so inefficient. I would have
> thought the trivial implementation of Array.get/set in terms of
> isInstanceof would be the baseline case on which they improve by means
> of clever VM hackery. It's a mystery to me what these implemenyations
> do that makes them so slow, and why they do it that way.

Yes, if they didn't have the time to have an optimised and
"intrinsified" method, they should have just done it as a Java method
with instanceof as you suggest. A possibility is that plain JNI is used
in the native methods, which involves some overhead. Maybe it was done
that way when instanceof was still slow and no-one updated it when that
stopped being true.

Best,
Ismael

Tiark Rompf
Joined: 2009-02-18,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: Reflection on Array apply/update/length (was Re: brains
Hi Jorge,
On 22.12.2009, at 04:39, Jorge Ortiz wrote:
the manifest approach ekes out a small improvement on megamorphic call sites, making it a clearer winner overall.

Hmm. Is it that clear a winner? The monomorphic String shuffle performance is sweet, but 9 ms compared to 155 says pretty clearly that everything is inlined and optimized en bloc. Now a fundamental question seems to be what use cases we want to perform really well, and I'm not sure whether realistic uses of generic arrays will optimize as well as the simple shuffling. 
That said, I think there is room for further improvement in the instanceof approach. For example, if I replace the instanceOfSwap method with the following
  def accessBoxed(array: AnyRef, i: Int): AnyRef = {    if (array.isInstanceOf[Array[Int]]) Integer.valueOf(array.asInstanceOf[Array[Int]](i)) else null  }
  def updateBoxed(array: AnyRef, i: Int, value: AnyRef): Unit = {    if (array.isInstanceOf[Array[Int]]) array.asInstanceOf[Array[Int]](i) = value.asInstanceOf[java.lang.Integer].intValue  }
  def instanceofSwap[T](array: Array[T], i: Int, j: Int): Unit = {    val a: AnyRef = if (array.isInstanceOf[Array[AnyRef]])  { array.asInstanceOf[Array[AnyRef]](i) }    else accessBoxed(array, i)    val b: AnyRef = if (array.isInstanceOf[Array[AnyRef]])  { array.asInstanceOf[Array[AnyRef]](j) }    else accessBoxed(array, j)        if (array.isInstanceOf[Array[AnyRef]])  { array.asInstanceOf[Array[AnyRef]](i) = b }    else updateBoxed(array, i, b)    if (array.isInstanceOf[Array[AnyRef]]) { array.asInstanceOf[Array[AnyRef]](j) = a }    else updateBoxed(array, j, a)  } 
then I get somewhat different numbers (6u17 64bit):
Monomorphic Manifest Shuffle Int Took: 40.19266666666667msMonomorphic Instanceof Shuffle Int Took: 130.00066666666666msMonomorphic Manifest Shuffle String Took: 11.000333333333332msMonomorphic Instanceof Shuffle String Took: 63.836333333333336msMegamorphic Manifest Shuffle Int Took: 192.18633333333332msMegamorphic Instanceof Shuffle Int Took: 114.82333333333334msMegamorphic Manifest Shuffle String Took: 188.84699999999998msMegamorphic Instanceof Shuffle String Took: 93.54466666666667ms
So for the megamorphic cases, instanceof now has a clear advantage. I'm not saying we shouldn't use manifests, but I think the alternatives have to be weighed carefully, and there are a lot of issues to consider (like possible cache misses caused by indirection, amenability to inlining by -optimize, etc).
- Tiark



I'll do some benchmarking on array allocations and array length (which I've neglected until now) and then submit a patch for ClassManifest and Manifest.

--j

Monomorphic Manifest Shuffle Int Took: 36.275666666666666ms
Monomorphic Instanceof Shuffle Int Took: 142.18233333333333ms
Monomorphic Manifest Shuffle String Took: 9.397333333333334ms
Monomorphic Instanceof Shuffle String Took: 165.287ms
Megamorphic Manifest Shuffle Int Took: 141.081ms
Megamorphic Instanceof Shuffle Int Took: 146.20066666666665ms
Megamorphic Manifest Shuffle String Took: 155.03233333333333ms
Megamorphic Instanceof Shuffle String Took: 158.02166666666665ms


On Mon, Dec 21, 2009 at 9:28 PM, Jorge Ortiz <jorge [dot] ortiz [at] gmail [dot] com> wrote:
Now we're talking!

That made a HUGE difference. So much so that I had to work on the benchmarks some more to really tease out the differences between the two approaches.

First I dropped the Double benchmarks, because the GC overhead from all the boxing was killing them. (The Ints are all <=100, so they're cached by the runtime, but the Doubles were getting allocated/collected at a dizzying rate.)

Next I dropped sorting as a benchmark, since just doing the comparisons was becoming pretty significant (also, each comparison on primitives adds another round of boxing/unboxing).

I settled on a "shuffling" benchmark. It just takes an array and shuffles its elements in place by swapping two elements at a time.

The manifest approach is lighting fast with monomorphic call sites. But it takes a HUGE hit (almost 20x with String!) when the call site is megamorphic. Such a big hit that it performs worse than the instanceof approach (which, if anything, seems to get slightly faster with megamorphic call sites).

I don't know what to recommend now, because neither approach is a clear winner. Perhaps Manifests + @specialized could provide a clear win, since it would avoid boxing altogether and perhaps even avoid megamorphic call sites.

--j

Code:
http://pastebin.com/m31a50b96

Bash:
http://pastebin.com/m7f61f36

Numbers on JDK6u14:
Monomorphic Manifest Shuffle Int Took: 35.74033333333333ms
Monomorphic Instanceof Shuffle Int Took: 148.65333333333334ms
Monomorphic Manifest Shuffle String Took: 9.428333333333335ms
Monomorphic Instanceof Shuffle String Took: 175.306ms
Megamorphic Manifest Shuffle Int Took: 178.27033333333335ms
Megamorphic Instanceof Shuffle Int Took: 146.33466666666666ms
Megamorphic Manifest Shuffle String Took: 194.65733333333333ms
Megamorphic Instanceof Shuffle String Took: 128.22866666666667ms

On Mon, Dec 21, 2009 at 11:49 AM, martin odersky <martin [dot] odersky [at] epfl [dot] ch> wrote:
Ok, I just committed (in r20267) a change to ScalaRunTime which
replaces the reflexive calls by instanceOf tests. Jorge, can you try
your benchamark again with that?
It's currently in the build queue.

Thanks

 -- Martin

On Mon, Dec 21, 2009 at 4:42 PM, Ismael Juma <mlists [at] juma [dot] me [dot] uk> wrote:
> On Mon, 2009-12-21 at 16:28 +0100, Tiark Rompf wrote:
>> Your mileage will vary, as always. If you look at the String case, it's
>> almost 2x and in both cases, a lot of time will be spent doing String
>> comparisons. But anyways, the edge over reflection is clearly more
>> than substantial. It's really impressive how slow reflection still is.
>
> There are many aspects of reflection and some of them have received a
> substantial performance boost over the years as runtime bytecode
> generation was introduced. There is some overhead that is still there
> (particularly wrapping the arguments into an array) and the look-up
> methods are still slow.
>
> For this particular case (Array.get and set), I believe that the reason
> why it hasn't been optimised much is that it's not commonly used in
> performance-sensitive Java code.
>
> Best,
> Ismael
>
>



Jorge Ortiz
Joined: 2008-12-16,
User offline. Last seen 29 weeks 4 days ago.
Re: Re: Reflection on Array apply/update/length (was Re: brain
What's the proposed compiler desugaring for array get/set? Or is this an optimization the user can apply manually if they know only Arrays of Ints and AnyRefs will be used?

--j

On Tue, Dec 22, 2009 at 7:01 PM, Tiark Rompf <tiark [dot] rompf [at] epfl [dot] ch> wrote:
Hi Jorge,
On 22.12.2009, at 04:39, Jorge Ortiz wrote:
the manifest approach ekes out a small improvement on megamorphic call sites, making it a clearer winner overall.

Hmm. Is it that clear a winner? The monomorphic String shuffle performance is sweet, but 9 ms compared to 155 says pretty clearly that everything is inlined and optimized en bloc. Now a fundamental question seems to be what use cases we want to perform really well, and I'm not sure whether realistic uses of generic arrays will optimize as well as the simple shuffling. 
That said, I think there is room for further improvement in the instanceof approach. For example, if I replace the instanceOfSwap method with the following
  def accessBoxed(array: AnyRef, i: Int): AnyRef = {     if (array.isInstanceOf[Array[Int]]) Integer.valueOf(array.asInstanceOf[Array[Int]](i)) else null  }
  def updateBoxed(array: AnyRef, i: Int, value: AnyRef): Unit = {     if (array.isInstanceOf[Array[Int]]) array.asInstanceOf[Array[Int]](i) = value.asInstanceOf[java.lang.Integer].intValue  }
  def instanceofSwap[T](array: Array[T], i: Int, j: Int): Unit = {     val a: AnyRef = if (array.isInstanceOf[Array[AnyRef]])  { array.asInstanceOf[Array[AnyRef]](i) }    else accessBoxed(array, i)    val b: AnyRef = if (array.isInstanceOf[Array[AnyRef]])  { array.asInstanceOf[Array[AnyRef]](j) }     else accessBoxed(array, j)        if (array.isInstanceOf[Array[AnyRef]])  { array.asInstanceOf[Array[AnyRef]](i) = b }    else updateBoxed(array, i, b)    if (array.isInstanceOf[Array[AnyRef]]) { array.asInstanceOf[Array[AnyRef]](j) = a }     else updateBoxed(array, j, a)  } 
then I get somewhat different numbers (6u17 64bit):
Monomorphic Manifest Shuffle Int Took: 40.19266666666667ms Monomorphic Instanceof Shuffle Int Took: 130.00066666666666msMonomorphic Manifest Shuffle String Took: 11.000333333333332msMonomorphic Instanceof Shuffle String Took: 63.836333333333336ms Megamorphic Manifest Shuffle Int Took: 192.18633333333332msMegamorphic Instanceof Shuffle Int Took: 114.82333333333334msMegamorphic Manifest Shuffle String Took: 188.84699999999998msMegamorphic Instanceof Shuffle String Took: 93.54466666666667ms
So for the megamorphic cases, instanceof now has a clear advantage. I'm not saying we shouldn't use manifests, but I think the alternatives have to be weighed carefully, and there are a lot of issues to consider (like possible cache misses caused by indirection, amenability to inlining by -optimize, etc).
- Tiark



I'll do some benchmarking on array allocations and array length (which I've neglected until now) and then submit a patch for ClassManifest and Manifest.

--j

Monomorphic Manifest Shuffle Int Took: 36.275666666666666ms
Monomorphic Instanceof Shuffle Int Took: 142.18233333333333ms
Monomorphic Manifest Shuffle String Took: 9.397333333333334ms
Monomorphic Instanceof Shuffle String Took: 165.287ms
Megamorphic Manifest Shuffle Int Took: 141.081ms
Megamorphic Instanceof Shuffle Int Took: 146.20066666666665ms
Megamorphic Manifest Shuffle String Took: 155.03233333333333ms
Megamorphic Instanceof Shuffle String Took: 158.02166666666665ms


On Mon, Dec 21, 2009 at 9:28 PM, Jorge Ortiz <jorge [dot] ortiz [at] gmail [dot] com> wrote:
Now we're talking!

That made a HUGE difference. So much so that I had to work on the benchmarks some more to really tease out the differences between the two approaches.

First I dropped the Double benchmarks, because the GC overhead from all the boxing was killing them. (The Ints are all <=100, so they're cached by the runtime, but the Doubles were getting allocated/collected at a dizzying rate.)

Next I dropped sorting as a benchmark, since just doing the comparisons was becoming pretty significant (also, each comparison on primitives adds another round of boxing/unboxing).

I settled on a "shuffling" benchmark. It just takes an array and shuffles its elements in place by swapping two elements at a time.

The manifest approach is lighting fast with monomorphic call sites. But it takes a HUGE hit (almost 20x with String!) when the call site is megamorphic. Such a big hit that it performs worse than the instanceof approach (which, if anything, seems to get slightly faster with megamorphic call sites).

I don't know what to recommend now, because neither approach is a clear winner. Perhaps Manifests + @specialized could provide a clear win, since it would avoid boxing altogether and perhaps even avoid megamorphic call sites.

--j

Code:
http://pastebin.com/m31a50b96

Bash:
http://pastebin.com/m7f61f36

Numbers on JDK6u14:
Monomorphic Manifest Shuffle Int Took: 35.74033333333333ms
Monomorphic Instanceof Shuffle Int Took: 148.65333333333334ms
Monomorphic Manifest Shuffle String Took: 9.428333333333335ms
Monomorphic Instanceof Shuffle String Took: 175.306ms
Megamorphic Manifest Shuffle Int Took: 178.27033333333335ms
Megamorphic Instanceof Shuffle Int Took: 146.33466666666666ms
Megamorphic Manifest Shuffle String Took: 194.65733333333333ms
Megamorphic Instanceof Shuffle String Took: 128.22866666666667ms

On Mon, Dec 21, 2009 at 11:49 AM, martin odersky <martin [dot] odersky [at] epfl [dot] ch> wrote:
Ok, I just committed (in r20267) a change to ScalaRunTime which
replaces the reflexive calls by instanceOf tests. Jorge, can you try
your benchamark again with that?
It's currently in the build queue.

Thanks

Tiark Rompf
Joined: 2009-02-18,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: Reflection on Array apply/update/length (was Re: brains
For example like this:
  @inline def array_apply(xs: AnyRef, idx: Int): Any = xs match {    case x: Array[AnyRef]  => x(idx).asInstanceOf[Any]    case _ => array_apply_primitive(xs, idx)  }
  @noinline def array_apply_primitive(xs: AnyRef, idx: Int): Any = xs match {    case x: Array[Int]     => x(idx).asInstanceOf[Any]    case x: Array[Double]  => x(idx).asInstanceOf[Any]    case x: Array[Long]    => x(idx).asInstanceOf[Any]    case x: Array[Float]   => x(idx).asInstanceOf[Any]    case x: Array[Char]    => x(idx).asInstanceOf[Any]    case x: Array[Byte]    => x(idx).asInstanceOf[Any]    case x: Array[Short]   => x(idx).asInstanceOf[Any]    case x: Array[Boolean] => x(idx).asInstanceOf[Any]    case x: Array[Unit]    => x(idx).asInstanceOf[Any]    case _ => throw new IllegalArgumentException  }
Or some variation, like splitting off the Int case into its own method as well (I haven't tried that). 
I think it would be good to give priority to the AnyRef case because it is likely that for primitives, boxing will preclude any efficient implementation in the end. Do you agree on that?
BTW, I've had a look at the HotSpot implementation of Array.get. It is indeed not an intrinsic but a plain JNI call to a method that does a bounds check, checks whether the argument is a reference array and if not does a switch/case on the primitive type tag. It does not look particularly inefficient, but I suppose the JNI overhead, the fact that it goes through several function calls and the various checks just add up. And of course HotSpot can't 'look inside' the code, so the usual bytecode/intrinsics optimizations don't apply.
Best wishes for the holidays!- Tiark

On 23.12.2009, at 04:56, Jorge Ortiz wrote:
What's the proposed compiler desugaring for array get/set? Or is this an optimization the user can apply manually if they know only Arrays of Ints and AnyRefs will be used?

--j

On Tue, Dec 22, 2009 at 7:01 PM, Tiark Rompf <tiark [dot] rompf [at] epfl [dot] ch> wrote:
Hi Jorge,
On 22.12.2009, at 04:39, Jorge Ortiz wrote:
the manifest approach ekes out a small improvement on megamorphic call sites, making it a clearer winner overall.

Hmm. Is it that clear a winner? The monomorphic String shuffle performance is sweet, but 9 ms compared to 155 says pretty clearly that everything is inlined and optimized en bloc. Now a fundamental question seems to be what use cases we want to perform really well, and I'm not sure whether realistic uses of generic arrays will optimize as well as the simple shuffling. 
That said, I think there is room for further improvement in the instanceof approach. For example, if I replace the instanceOfSwap method with the following
  def accessBoxed(array: AnyRef, i: Int): AnyRef = {     if (array.isInstanceOf[Array[Int]]) Integer.valueOf(array.asInstanceOf[Array[Int]](i)) else null  }
  def updateBoxed(array: AnyRef, i: Int, value: AnyRef): Unit = {     if (array.isInstanceOf[Array[Int]]) array.asInstanceOf[Array[Int]](i) = value.asInstanceOf[java.lang.Integer].intValue  }
  def instanceofSwap[T](array: Array[T], i: Int, j: Int): Unit = {     val a: AnyRef = if (array.isInstanceOf[Array[AnyRef]])  { array.asInstanceOf[Array[AnyRef]](i) }    else accessBoxed(array, i)    val b: AnyRef = if (array.isInstanceOf[Array[AnyRef]])  { array.asInstanceOf[Array[AnyRef]](j) }     else accessBoxed(array, j)        if (array.isInstanceOf[Array[AnyRef]])  { array.asInstanceOf[Array[AnyRef]](i) = b }    else updateBoxed(array, i, b)    if (array.isInstanceOf[Array[AnyRef]]) { array.asInstanceOf[Array[AnyRef]](j) = a }     else updateBoxed(array, j, a)  } 
then I get somewhat different numbers (6u17 64bit):
Monomorphic Manifest Shuffle Int Took: 40.19266666666667ms Monomorphic Instanceof Shuffle Int Took: 130.00066666666666msMonomorphic Manifest Shuffle String Took: 11.000333333333332msMonomorphic Instanceof Shuffle String Took: 63.836333333333336ms Megamorphic Manifest Shuffle Int Took: 192.18633333333332msMegamorphic Instanceof Shuffle Int Took: 114.82333333333334msMegamorphic Manifest Shuffle String Took: 188.84699999999998msMegamorphic Instanceof Shuffle String Took: 93.54466666666667ms
So for the megamorphic cases, instanceof now has a clear advantage. I'm not saying we shouldn't use manifests, but I think the alternatives have to be weighed carefully, and there are a lot of issues to consider (like possible cache misses caused by indirection, amenability to inlining by -optimize, etc).
- Tiark



I'll do some benchmarking on array allocations and array length (which I've neglected until now) and then submit a patch for ClassManifest and Manifest.

--j

Monomorphic Manifest Shuffle Int Took: 36.275666666666666ms
Monomorphic Instanceof Shuffle Int Took: 142.18233333333333ms
Monomorphic Manifest Shuffle String Took: 9.397333333333334ms
Monomorphic Instanceof Shuffle String Took: 165.287ms
Megamorphic Manifest Shuffle Int Took: 141.081ms
Megamorphic Instanceof Shuffle Int Took: 146.20066666666665ms
Megamorphic Manifest Shuffle String Took: 155.03233333333333ms
Megamorphic Instanceof Shuffle String Took: 158.02166666666665ms


On Mon, Dec 21, 2009 at 9:28 PM, Jorge Ortiz <jorge [dot] ortiz [at] gmail [dot] com> wrote:
Now we're talking!

That made a HUGE difference. So much so that I had to work on the benchmarks some more to really tease out the differences between the two approaches.

First I dropped the Double benchmarks, because the GC overhead from all the boxing was killing them. (The Ints are all <=100, so they're cached by the runtime, but the Doubles were getting allocated/collected at a dizzying rate.)

Next I dropped sorting as a benchmark, since just doing the comparisons was becoming pretty significant (also, each comparison on primitives adds another round of boxing/unboxing).

I settled on a "shuffling" benchmark. It just takes an array and shuffles its elements in place by swapping two elements at a time.

The manifest approach is lighting fast with monomorphic call sites. But it takes a HUGE hit (almost 20x with String!) when the call site is megamorphic. Such a big hit that it performs worse than the instanceof approach (which, if anything, seems to get slightly faster with megamorphic call sites).

I don't know what to recommend now, because neither approach is a clear winner. Perhaps Manifests + @specialized could provide a clear win, since it would avoid boxing altogether and perhaps even avoid megamorphic call sites.

--j

Code:
http://pastebin.com/m31a50b96

Bash:
http://pastebin.com/m7f61f36

Numbers on JDK6u14:
Monomorphic Manifest Shuffle Int Took: 35.74033333333333ms
Monomorphic Instanceof Shuffle Int Took: 148.65333333333334ms
Monomorphic Manifest Shuffle String Took: 9.428333333333335ms
Monomorphic Instanceof Shuffle String Took: 175.306ms
Megamorphic Manifest Shuffle Int Took: 178.27033333333335ms
Megamorphic Instanceof Shuffle Int Took: 146.33466666666666ms
Megamorphic Manifest Shuffle String Took: 194.65733333333333ms
Megamorphic Instanceof Shuffle String Took: 128.22866666666667ms

On Mon, Dec 21, 2009 at 11:49 AM, martin odersky <martin [dot] odersky [at] epfl [dot] ch> wrote:
Ok, I just committed (in r20267) a change to ScalaRunTime which
replaces the reflexive calls by instanceOf tests. Jorge, can you try
your benchamark again with that?
It's currently in the build queue.

Thanks

 -- Martin

On Mon, Dec 21, 2009 at 4:42 PM, Ismael Juma <mlists [at] juma [dot] me [dot] uk> wrote:
> On Mon, 2009-12-21 at 16:28 +0100, Tiark Rompf wrote:
>> Your mileage will vary, as always. If you look at the String case, it's
>> almost 2x and in both cases, a lot of time will be spent doing String
>> comparisons. But anyways, the edge over reflection is clearly more
>> than substantial. It's really impressive how slow reflection still is.
>
> There are many aspects of reflection and some of them have received a
> substantial performance boost over the years as runtime bytecode
> generation was introduced. There is some overhead that is still there
> (particularly wrapping the arguments into an array) and the look-up
> methods are still slow.
>
> For this particular case (Array.get and set), I believe that the reason
> why it hasn't been optimised much is that it's not commonly used in
> performance-sensitive Java code.
>
> Best,
> Ismael
>
>





David Hall 4
Joined: 2009-08-21,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: Reflection on Array apply/update/length (was Re: brain

Just to follow up on all this, I have a generic sparse array backed by
an int array and an Array[T]. In some tight loops where I'm accessing
this array, calls to Array.reflect.get and set are eating 60% of my
cpu time according to hprof. When I was using a dense array backed by
an ArrayBuffer, access to the buffer wasn't even showing up.

This is using Beta1.RC8.

David Hall 4
Joined: 2009-08-21,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: Reflection on Array apply/update/length (was Re: brain

My bad, accidentally using RC5. Everything is amazing.

joshcough
Joined: 2009-05-08,
User offline. Last seen 1 year 21 weeks ago.
Re: Re: Reflection on Array apply/update/length (was Re: brain
Paul asked me to bring this back up. Unfortunately I've been super busy lately and haven't had time for parallel collections at all. Anyway, now that things appear to be improved, should we switch ParallelArray from using Array[Any]?

Alex would it make sense to maintain two version, one that uses Array[Any], and the other using manifests? It would at least be nice to know you still notice performance differences. I'm really not up to date with all this stuff, but I did notice were still using Array[Any] still, so I think its worth continuing the discussion:

class ParallelArray[T] private[mutable] (val arrcreator: () => Array[Any], sz: Int)


-Josh


On Wed, Jan 27, 2010 at 1:00 PM, David Hall <dlwh [at] cs [dot] berkeley [dot] edu> wrote:
My bad, accidentally using RC5. Everything is amazing.

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland