This page is no longer maintained — Please continue to the home page at www.scala-lang.org

How to speed us this Akka app?

6 replies
Vikas Hazrati
Joined: 2012-01-04,
User offline. Last seen 42 years 45 weeks ago.

We have a simple benching app which you can access at
https://github.com/knoldus/AkkaFibonacciBenchmark

The idea is simple that we have a producer which is sending messages
to a listener which in turn has a number of processors available to it
to work on the message. Right now as soon as the processor is done, we
have a diagnostic actor to find the timings and throughput.

Producer->Listener->Workers in parallel -> Diagnostics

The levers that we played with were

val NO_OF_WORKERS = 8 in FibonacciBench.scala

Dispatcher settings in Processor.scala

object Processor {
val dispatcher =
Dispatchers.newExecutorBasedEventDrivenWorkStealingDispatcher("dispatcher")
.setCorePoolSize(16)
.setMaxPoolSize(32)
.build
}
We changed the core pool size.

We could peak at around 30K MPS.

The tests were executed on 4 core 2.4 GHZ dell machine running Ubuntu
11.10

What other options would you recommend?

Regards | Vikas

www.knoldus.com
knoldus.wordpress.com

Graeme Smith
Joined: 2010-03-21,
User offline. Last seen 42 years 45 weeks ago.
Re: How to speed us this Akka app?

You say you have a dell, but what are the cache sizes for the processor
it is running? What is the FSB or if it doesn't have an FSB is it
running rimm?

Most people think that the speed of their machine is determined only by
the speed of the processor. Not true, I am afraid, the speed of the
memory is also a factor, as is the size of the cache.

Essentially even DDR3 although it is operating at the clock speed of the
computer, has a latency that might slow down the processor if it isn't
cached right.

Every time you take a cache miss, you generate a waiting period while
the computer attempts to access the same information from main memory.

The larger the cache, the faster the processor can run for any speed of
processor, but if you look at the clock speed of the processor, it is
actually much slower than the posted speed. This is because there is a
clock splitter circuit in the cpu that multiplies the data-rate within
the cpu. This means that the cpu clock speed, and thus the memory speed
is lower than the cpu clock speed. Requiring a larger latency per cache
miss.

Once you have dealt with that, there is still the latency likely in the
microkernel and for each actor.

Finally there is the latency extant in the application you are using for
a benchmark. Is it perhaps sending more messages than it has to, or
using too long a time-out when it has to randomize because of a network
collision? Most people don't realize that just because the network
claims to be a gigabit rate at the NIC, that it might not be further
back in the system. The protocols slow it down somewhat, as do any
collisions. When you have multiple processors each shipping out regular
net updates, you have to stagger them somehow or they will each try to
send at the same time, and set up a collision which triggers the time
intensive collision protection mechanism.

Are you using a timer to trigger the "Processor Modules"?

I hope this helps, Note I didn't trouble shoot your application, I will
leave that to someone more familiar with AKKA. As a microkernel, can't
AKKa run without the overhead of Ubuntu?

Good luck!

On Wed, 2012-01-04 at 05:30 -0800, Vikas Hazrati wrote:
> We have a simple benching app which you can access at
> https://github.com/knoldus/AkkaFibonacciBenchmark
>
> The idea is simple that we have a producer which is sending messages
> to a listener which in turn has a number of processors available to it
> to work on the message. Right now as soon as the processor is done, we
> have a diagnostic actor to find the timings and throughput.
>
> Producer->Listener->Workers in parallel -> Diagnostics
>
> The levers that we played with were
>
> val NO_OF_WORKERS = 8 in FibonacciBench.scala
>
> Dispatcher settings in Processor.scala
>
> object Processor {
> val dispatcher =
> Dispatchers.newExecutorBasedEventDrivenWorkStealingDispatcher("dispatcher")
> .setCorePoolSize(16)
> .setMaxPoolSize(32)
> .build
> }
> We changed the core pool size.
>
> We could peak at around 30K MPS.
>
> The tests were executed on 4 core 2.4 GHZ dell machine running Ubuntu
> 11.10
>
> What other options would you recommend?
>
> Regards | Vikas
>
> www.knoldus.com
> knoldus.wordpress.com
>
>

Viktor Klang
Joined: 2008-12-17,
User offline. Last seen 1 year 27 weeks ago.
Re: How to speed us this Akka app?
I recommend searching the akka-user mailinglist for the countless tips and tricks for tuning Akka posted over there.
Cheers,√
On Wed, Jan 4, 2012 at 11:15 PM, Graeme Smith <grysmith [at] telus [dot] net> wrote:
You say you have a dell, but what are the cache sizes for the processor
it is running? What is the FSB or if it doesn't have an FSB is it
running rimm?

Most people think that the speed of their machine is determined only by
the speed of the processor. Not true, I am afraid, the speed of the
memory is also a factor, as is the size of the cache.

Essentially even DDR3 although it is operating at the clock speed of the
computer, has a latency that might slow down the processor if it isn't
cached right.

Every time you take a cache miss, you generate a waiting period while
the computer attempts to access the same information from main memory.

The larger the cache, the faster the processor can run for any speed of
processor, but if you look at the clock speed of the processor, it is
actually much slower than the posted speed. This is because there is a
clock splitter circuit in the cpu that multiplies the data-rate within
the cpu. This means that the cpu clock speed, and thus the memory speed
is lower than the cpu clock speed. Requiring a larger latency per cache
miss.

Once you have dealt with that, there is still the latency likely in the
microkernel and for each actor.

Finally there is the latency extant in the application you are using for
a benchmark. Is it perhaps sending more messages than it has to, or
using too long a time-out when it has to randomize because of a network
collision? Most people don't realize that just because the network
claims to be a gigabit rate at the NIC, that it might not be further
back in the system. The protocols slow it down somewhat, as do any
collisions. When you have multiple processors each shipping out regular
net updates, you have to stagger them somehow or they will each try to
send at the same time, and set up a collision which triggers the time
intensive collision protection mechanism.

Are you using a timer to trigger the "Processor Modules"?

I hope this helps, Note I didn't trouble shoot your application, I will
leave that to someone more familiar with AKKA. As a microkernel, can't
AKKa run without the overhead of Ubuntu?

Good luck!




On Wed, 2012-01-04 at 05:30 -0800, Vikas Hazrati wrote:
> We have a simple benching app which you can access at
> https://github.com/knoldus/AkkaFibonacciBenchmark
>
> The idea is simple that we have a producer which is sending messages
> to a listener which in turn has a number of processors available to it
> to work on the message. Right now as soon as the processor is done, we
> have a diagnostic actor to find the timings and throughput.
>
> Producer->Listener->Workers in parallel -> Diagnostics
>
> The levers that we played with were
>
>  val NO_OF_WORKERS = 8 in FibonacciBench.scala
>
> Dispatcher settings in Processor.scala
>
> object Processor {
>   val dispatcher =
> Dispatchers.newExecutorBasedEventDrivenWorkStealingDispatcher("dispatcher")
>     .setCorePoolSize(16)
>     .setMaxPoolSize(32)
>     .build
> }
> We changed the core pool size.
>
> We could peak at around 30K MPS.
>
> The tests were executed on 4 core 2.4 GHZ dell machine running Ubuntu
> 11.10
>
> What other options would you recommend?
>
> Regards | Vikas
>
> www.knoldus.com
> knoldus.wordpress.com
>
>





--
Viktor Klang

Akka Tech LeadTypesafe - Enterprise-Grade Scala from the Experts

Twitter: @viktorklang
Derek Williams 3
Joined: 2011-08-12,
User offline. Last seen 42 years 45 weeks ago.
Re: How to speed us this Akka app?
On Wed, Jan 4, 2012 at 6:30 AM, Vikas Hazrati <vikas [at] knoldus [dot] com> wrote:
We could peak at around 30K MPS.

The tests were executed on 4 core 2.4 GHZ dell machine running Ubuntu
11.10

What other options would you recommend?


Performance on my machine improved by 4x by just commenting out the println.
Also, there is no warmup period with this test, so increasing the amount of messages helps. (I bumped it up to 1000000). A proper warmup run should give you better numbers.
With both of those changes, running with a single worker gives me 36k. With 8 I get 137k. This is on an I7-2600K (4 core), so that seems like a pretty linear speed increase. All cores were maxed while running, so probably not much room for improvement on my machine.
--
Derek Williams
Vikas Hazrati
Joined: 2012-01-04,
User offline. Last seen 42 years 45 weeks ago.
Re: How to speed us this Akka app?
Derek,

The results that i see on my machine mirror your experience in terms of CPU maxing out with 4/8/16 ... processors, I have not disabled the println so the results are on the lower side. As you did, we tried with varying loads as well to get benchmarks. Here are the results with
 val NO_OF_MESSAGES = 1000000. The code for the bench is present here https://github.com/knoldus/AkkaFibonacciBenchmark






              Processor                     time                 max cpu
1 12367 57
2 20656 68
3 23730 85
4 23568 96
8 24968 98
16 24119 98
64 24113 98
128 23799 98
254 22808 97





However the trend is clear that with processors increasing >= no of cores we max out on the CPUs and after a while it forms a bell curve where adding more processors results in decrease in performance. So it looks like that on this box the max that I can get is with 8 Processors.

Which brings me to a tangent question. When we say that we can run millions of actors on 4GB ram in which practical scenario would that help me when i pretty much max out with 8 actors in this case. I understand that they could hold messages in their mailboxes but they would get to process only once the core is available to them right?

Regards | Vikas





On Thu, Jan 5, 2012 at 4:29 AM, Derek Williams <derek [at] fyrie [dot] net> wrote:
On Wed, Jan 4, 2012 at 6:30 AM, Vikas Hazrati <vikas [at] knoldus [dot] com> wrote:
We could peak at around 30K MPS.

The tests were executed on 4 core 2.4 GHZ dell machine running Ubuntu
11.10

What other options would you recommend?


Performance on my machine improved by 4x by just commenting out the println.
Also, there is no warmup period with this test, so increasing the amount of messages helps. (I bumped it up to 1000000). A proper warmup run should give you better numbers.
With both of those changes, running with a single worker gives me 36k. With 8 I get 137k. This is on an I7-2600K (4 core), so that seems like a pretty linear speed increase. All cores were maxed while running, so probably not much room for improvement on my machine.
--
Derek Williams

Vikas Hazrati
Joined: 2012-01-04,
User offline. Last seen 42 years 45 weeks ago.
Re: How to speed us this Akka app?
>I have not disabled the println so the results are on the lower side.
Sorry read this as the sysout had been disabled ....

On Thu, Jan 5, 2012 at 4:58 PM, Vikas Hazrati <vikas [at] knoldus [dot] com> wrote:
Derek,

The results that i see on my machine mirror your experience in terms of CPU maxing out with 4/8/16 ... processors, I have not disabled the println so the results are on the lower side. As you did, we tried with varying loads as well to get benchmarks. Here are the results with
 val NO_OF_MESSAGES = 1000000. The code for the bench is present here https://github.com/knoldus/AkkaFibonacciBenchmark






              Processor                     time                 max cpu
1 12367 57
2 20656 68
3 23730 85
4 23568 96
8 24968 98
16 24119 98
64 24113 98
128 23799 98
254 22808 97





However the trend is clear that with processors increasing >= no of cores we max out on the CPUs and after a while it forms a bell curve where adding more processors results in decrease in performance. So it looks like that on this box the max that I can get is with 8 Processors.

Which brings me to a tangent question. When we say that we can run millions of actors on 4GB ram in which practical scenario would that help me when i pretty much max out with 8 actors in this case. I understand that they could hold messages in their mailboxes but they would get to process only once the core is available to them right?

Regards | Vikas





On Thu, Jan 5, 2012 at 4:29 AM, Derek Williams <derek [at] fyrie [dot] net> wrote:
On Wed, Jan 4, 2012 at 6:30 AM, Vikas Hazrati <vikas [at] knoldus [dot] com> wrote:
We could peak at around 30K MPS.

The tests were executed on 4 core 2.4 GHZ dell machine running Ubuntu
11.10

What other options would you recommend?


Performance on my machine improved by 4x by just commenting out the println.
Also, there is no warmup period with this test, so increasing the amount of messages helps. (I bumped it up to 1000000). A proper warmup run should give you better numbers.
With both of those changes, running with a single worker gives me 36k. With 8 I get 137k. This is on an I7-2600K (4 core), so that seems like a pretty linear speed increase. All cores were maxed while running, so probably not much room for improvement on my machine.
--
Derek Williams


Vikas Hazrati
Joined: 2012-01-04,
User offline. Last seen 42 years 45 weeks ago.
Re: How to speed us this Akka app?
Graeme,

What is suggesting probably makes a lot of sense and I would have to find out a lot of things that you mentioned, however for the current benchs i was trying to look at something which could be done better in the code to see a relative improvement with the same h/w configs.

This is the CPU info that I have

vikas@vikas-laptop:~/w/knoldus/AkkaFibonacciBenchmark$ cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 37
model name    : Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz
stepping    : 2
cpu MHz        : 1197.000
cache size    : 3072 KB
physical id    : 0
siblings    : 4
core id        : 0
cpu cores    : 2
apicid        : 0
initial apicid    : 0
fdiv_bug    : no
hlt_bug        : no
f00f_bug    : no
coma_bug    : no
fpu        : yes
fpu_exception    : yes
cpuid level    : 11
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt aes lahf_lm ida arat dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 4788.27
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:


On Thu, Jan 5, 2012 at 3:45 AM, Graeme Smith <grysmith [at] telus [dot] net> wrote:
You say you have a dell, but what are the cache sizes for the processor
it is running? What is the FSB or if it doesn't have an FSB is it
running rimm?

Most people think that the speed of their machine is determined only by
the speed of the processor. Not true, I am afraid, the speed of the
memory is also a factor, as is the size of the cache.

Essentially even DDR3 although it is operating at the clock speed of the
computer, has a latency that might slow down the processor if it isn't
cached right.

Every time you take a cache miss, you generate a waiting period while
the computer attempts to access the same information from main memory.

The larger the cache, the faster the processor can run for any speed of
processor, but if you look at the clock speed of the processor, it is
actually much slower than the posted speed. This is because there is a
clock splitter circuit in the cpu that multiplies the data-rate within
the cpu. This means that the cpu clock speed, and thus the memory speed
is lower than the cpu clock speed. Requiring a larger latency per cache
miss.

Once you have dealt with that, there is still the latency likely in the
microkernel and for each actor.

Finally there is the latency extant in the application you are using for
a benchmark. Is it perhaps sending more messages than it has to, or
using too long a time-out when it has to randomize because of a network
collision? Most people don't realize that just because the network
claims to be a gigabit rate at the NIC, that it might not be further
back in the system. The protocols slow it down somewhat, as do any
collisions. When you have multiple processors each shipping out regular
net updates, you have to stagger them somehow or they will each try to
send at the same time, and set up a collision which triggers the time
intensive collision protection mechanism.

Are you using a timer to trigger the "Processor Modules"?

I hope this helps, Note I didn't trouble shoot your application, I will
leave that to someone more familiar with AKKA. As a microkernel, can't
AKKa run without the overhead of Ubuntu?

Good luck!




On Wed, 2012-01-04 at 05:30 -0800, Vikas Hazrati wrote:
> We have a simple benching app which you can access at
> https://github.com/knoldus/AkkaFibonacciBenchmark
>
> The idea is simple that we have a producer which is sending messages
> to a listener which in turn has a number of processors available to it
> to work on the message. Right now as soon as the processor is done, we
> have a diagnostic actor to find the timings and throughput.
>
> Producer->Listener->Workers in parallel -> Diagnostics
>
> The levers that we played with were
>
>  val NO_OF_WORKERS = 8 in FibonacciBench.scala
>
> Dispatcher settings in Processor.scala
>
> object Processor {
>   val dispatcher =
> Dispatchers.newExecutorBasedEventDrivenWorkStealingDispatcher("dispatcher")
>     .setCorePoolSize(16)
>     .setMaxPoolSize(32)
>     .build
> }
> We changed the core pool size.
>
> We could peak at around 30K MPS.
>
> The tests were executed on 4 core 2.4 GHZ dell machine running Ubuntu
> 11.10
>
> What other options would you recommend?
>
> Regards | Vikas
>
> www.knoldus.com
> knoldus.wordpress.com
>
>



Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland