Profiling MRuby Code
I loathe inefficient, slow, or bloated software. Unfortunately, I’ve written plenty of code that’s terrible in this sense and if you program frequently, then I’d wager you have written quite a few inefficient lines of code as well. Fortunately code isn’t written in stone after the first pass and we as programmers can optimize it. Most of the time the speed of a single line of code doesn’t matter or contribute to the overall inefficiency in an application. Sometimes, though, a change to a few lines can make all the difference.
Profilers
Finding those lines of code however is a challenging task unless you have access to some means of profiling the program and finding out which sections of code are run the most frequently and the lines which use up the most time. Each programming language tends to have it’s own tools, though if you’re tackling a mixed ruby/C codebase then you might be familiar with: RubyProf, callgrind, gprof, or the poor man’s profiler.
Based upon the title of this article I’m guessing that you may be interested in profiling the embeddable ruby implementation 'mruby'. While I was developing my own application using mruby, I tended to profile ruby code with timers:
before = Time.new() run_routine() after = Time.new() puts("run_routine took #{1000*(after-before)} ms")
This provided limited information, but it helped direct the development to a few hotspots and then additional profiling of the C portion of the codebase was done with callgrind. Callgrind for me is a holy grail of profiling CPU bound programs. Through kcachegrind it provides an easy to explore callgraph, relative and absolute timing information, source level information, and information at the assembly level. Callgrind combined with kcachgrind make it easy to understand where the hotspots for a program are, what is the callgraph that generates these issues, and what compiled assembly creates the issues. If you’re trying to optimize a C/C++ codebase just use callgrind. It slows down the overall execution by at least an order of magnitude, but the information provided is superb.
MRuby-Profiler
Back to mruby, the initial timer based profiling doesn’t have many of the advantages of callgrind, or other profiler tools. Adding timers was a manual process, it provided very coarse information, and it provided only a limited slice of the whole program. As such, a full mruby VM profiler should be preferred. The mruby-profiler gem is one such tool.
MRuby-profiler works by observing the progression of the mruby virtual machine via a callback which is invoked every time the VM executes a new ruby instruction. This allows for the profiler to have an exact count for how many times an instruction is invoked, a reasonably accurate time for the execution of each instruction, information about the call graph, and often detailed line-by-line source level times.
Example
Now that the mruby-profiler gem is introduced, let’s take a look at a simple example run. In the top-left you’ll see a simple ruby program, on the right mruby-profiler’s source annotated output, and in the bottom-left mruby-profiler’s no-source VM instruction only output.
def test_code
var1 = []
100.times do |x|
var1 << x
end
var1.each_with_index do |x, ind|
var1[ind] = 0.1 + 2.0*x + 4.3*x**2.0
end
var1
end
test_code
0000 0.00011 def test_code
1 0.00002 OP_TCLASS R1
1 0.00003 OP_LAMBDA R2 I(+1) 1
1 0.00003 OP_METHOD R1 :test_code
1 0.00003 OP_ENTER 0:0:0:0:0:0:0
0001 0.00004 var1 = []
1 0.00004 OP_ARRAY R2 R3 0
0002 0.00186 100.times do |x|
1 0.00002 OP_LOADI R3 100
1 0.00005 OP_LAMBDA R4 I(+1) 2
1 0.00007 OP_SENDB R3 :times 0
100 0.00172 OP_ENTER 1:0:0:0:0:0:0
0003 0.00957 var1 << x
100 0.00142 OP_GETUPVAR R3 2 0
100 0.00128 OP_MOVE R4 R1
100 0.00479 OP_SEND R3 :<< 1
100 0.00208 OP_RETURN R3 return
0004 0.00000 end
0005 0.00000
0006 0.00186 var1.each_with_index do |x, ind|
1 0.00002 OP_MOVE R3 R2
1 0.00004 OP_LAMBDA R4 I(+2) 2
1 0.00006 OP_SENDB R3 :each_with_index 0
100 0.00175 OP_ENTER 2:0:0:0:0:0:0
0007 0.03609 var1[ind] = 0.1 + 2.0*x + 4.3*x**2.0
100 0.00142 OP_LOADL R4 L(0) ; 0.1
100 0.00139 OP_LOADL R5 L(1) ; 2
100 0.00142 OP_MOVE R6 R1
100 0.00167 OP_MUL R5 :* 1
100 0.00145 OP_ADD R4 :+ 1
100 0.00148 OP_LOADL R5 L(2) ; 4.3
100 0.00141 OP_MOVE R6 R1
100 0.00143 OP_LOADL R7 L(1) ; 2
100 0.00850 OP_SEND R6 :** 1
100 0.00157 OP_MUL R5 :* 1
100 0.00152 OP_ADD R4 :+ 1
100 0.00158 OP_GETUPVAR R5 2 0
100 0.00128 OP_MOVE R6 R2
100 0.00141 OP_MOVE R7 R4
100 0.00647 OP_SEND R5 :[]= 2
100 0.00209 OP_RETURN R4 return
0008 0.00000 end
0009 0.00000
0010 0.00004 var1
1 0.00004 OP_RETURN R2 return
0011 0.00000 end
0012 0.00000
0013 0.00225 test_code
1 0.00001 OP_LOADSELF R1
1 0.00005 OP_SEND R1 :test_code 0
1 0.00218 OP_STOP
Fixnum#times 0.01822
1 0.00002 OP_ENTER 0:0:0:0:0:0:1
1 0.00001 OP_LOADSELF R3
1 0.00011 OP_SEND R3 :block_given? 0
1 0.00001 OP_JMPNOT R3 002
1 0.00001 OP_JMP 005
0 0.00000 OP_LOADSELF R3
0 0.00000 OP_LOADSYM R4 :times
0 0.00000 OP_SEND R3 :to_enum 1
0 0.00000 OP_RETURN R3 return
1 0.00001 OP_LOADI R2 0
1 0.00001 OP_JMP 007
100 0.00128 OP_MOVE R3 R1
100 0.00142 OP_MOVE R4 R2
100 0.00546 OP_SEND R3 :call 1
100 0.00128 OP_MOVE R3 R2
100 0.00164 OP_ADDI R3 :+ 1
100 0.00128 OP_MOVE R2 R3
101 0.00144 OP_MOVE R3 R2
101 0.00138 OP_LOADSELF R4
101 0.00143 OP_LT R3 :< 1
101 0.00138 OP_JMPIF R3 -09
1 0.00001 OP_LOADSELF R3
1 0.00002 OP_RETURN R3 return
From these outputs, we can see that the most expensive line in the program is line 7 which takes roughly 36 ms to execute over the course of this entire program. The '**' operator takes a large portion of that time (8.5 ms) and is executed 100 times as expected. The first '100.times' loop takes 1.8 ms + 9.5 ms + 18.2 ms (from the overhead of Fixnum#times itself). Within the output, line numbers, source code, VM instruction call counts, and VM instruction self times can be seen fairly clearly.
Complications In Profiling
The mruby-profiler gem makes profiling much easier, however there are a few limits with it’s current implementation.
Entering and Exiting The MRuby VM
The timers within mruby-profiler are relatively simple and in most code they tend to work very well. They do fail when interacting with any program which ends up doing a fair amount of work within C which ends up calling ruby methods from C. To elaborate on that, let’s first look at how mruby-profiler calculates how much time is spent at a given mruby opcode.
The mruby-profiler gem uses the code fetch hook within the mruby VM. Every time an opcode is executed by the mruby VM the code fetch hook is called. MRuby-profiler records the time when an instruction is fetched. When the next fetch occurs the difference in time is assumed to be the amount of time spent in the previous instruction.
Generally this model works well. For normal ruby code it’s completely accurate, for ruby calling simple C routines the appropriate amount of time is given to the OP_SEND instruction which lead to the C call, but it fails with a mixture of C/ruby calls. Consider the below sequence of events:
-
C code calls one ruby method
-
C code works on a long running task
-
C code calls another ruby method
During step 2 no ruby opcodes will be observed by mruby-profiler. Thus, when step 3 occurs and a new VM opcode is fetched, then all the time that was spent in step 2 is attributed to the last instruction in step 1. Typically the last instruction of a mruby method would be OP_RETURN. So, if you spot an OP_RETURN which is taking much more time than it should, be aware that it may be counting time spent in a calling C function.
A lack of cumulative child function time
In general I’d say having method/instruction time presented as 'self-time' is preferable in a reasonably well architected system. Self-time presents how much time could effectively be saved by optimizing the method by itself without considering the rest of the code it ends up calling.
Self-time, however, can create a few problems with interpreting the results of mruby-profiler. If a hotspot function is called a large number of times, it can be tricky to backtrack which functions called it a significant number of times or with a particular type of data which took the hotspot function longer to evaluate. The lack of cumulative times also make it hard to evaluate if a particular function is 'expensive to call'. It is possible to have an 'expensive to call' function which does not use a significant amount of time with called functions which also do not use much time. If, however, there is a sufficiently deep call stack, then an innocuous function can still become very expensive in the cumulative sense (i.e. "death by a million cuts")
The last issue is one that’s more ruby specific. I’d say it’s fair to say that if you enjoy ruby you use blocks… You use them a lot. For those of you unfamiliar with what a block is here’s a simple example:
object.method do |optional_args|
work_on_block
end
irep 0x9b433e0 nregs=3 nlocals=1 pools=0 syms=2 reps=1
file: tmp.rb
1 000 OP_LOADSELF R1
1 001 OP_SEND R1 :object 0
1 002 OP_LAMBDA R2 I(+1) 2
1 003 OP_SENDB R1 :method 0
1 004 OP_STOP
irep 0x9b494a8 nregs=5 nlocals=3 pools=0 syms=1 reps=0
file: tmp.rb
1 000 OP_ENTER 1:0:0:0:0:0:0
2 001 OP_LOADSELF R3
2 002 OP_SEND R3 :work_on_block 0
2 003 OP_RETURN R3 return
In the ruby code :method is called with the block (do..end). The block is encoded as a lambda function (the second irep) and method ends up calling the lambda. Just like in the very first example, with Fixnum#times, the cost involved with :method is associated with :method’s implementation and not the object.method call. When the block accepting method adds a significant amount of overhead it’s very easy to overlook using self-time and you should be aware of it when profiling. In the case of iteration, we’ll revisit block-method overhead later in this article.
While at the moment mruby-profiler doesn’t present cumulative time information it does capture it from the MRuby VM (or at least it appeared to when I looked at the source). It’s not in the most convenient format, but at least the data is there, it’s just the analysis which needs to be updated (and that’s all ruby). The mruby-profiler gem is still missing the functionality, but I imagine it could be added reasonably easily in the future.
Bugs
Of course another limitation of using both mruby and mruby-profiler is that both of them are going to have more bugs than other tools with more widespread use. When I initially found mruby-profiler it tended to crash in a variety of ways for the codebase I was profiling (I’d recommend using the fixes I’ve proposed to mruby-profiler via a pull request if it hasn’t yet been merged). When I initially began using MRuby I encountered a few bugs, though MRuby has become a fair bit more stable over the past year. Lastly, while mruby-profiler does provide an output for kcachgrind, be aware that it is incomplete and there are some bugs in the output (though it is entirely unclear if they are from mruby’s debug info or a bug within mruby-profiler).
Deciding when to move from ruby to C
One of the great things about MRuby (and one of the major reasons why I’ve used it for a sizable user interface project) is that it’s extremely easy to move a routine from ruby to C. Of course it’s still easier to leave code as ruby (otherwise I would have just written everything in C), so what tasks does ruby struggle with?
Heavy numerical tasks
Like most interpreted languages, heavy mathematical operations aren’t 'fast'. Of course, they may be fast enough, but major gains can be made by using a compiled and heavily optimized language like C.
Consider the below ruby code:
def func(array)
array.map do |x|
Math.abs(Math.sin(x + 2 + 7 * 8 + 3))
end
end
void func(float *f, int len)
{
for(int i=0; i<len; ++i)
f[i] = fabsf(sinf(f[i] + 2 + 7 * 8 + 3));
}
irep 0x8d24b40 nregs=5 nlocals=3 pools=0 syms=1 reps=1
file: tmp.rb
1 000 OP_ENTER 1:0:0:0:0:0:0
2 001 OP_MOVE R3 R1 ; R1:array
2 002 OP_LAMBDA R4 I(+1) 2
2 003 OP_SENDB R3 :map 0
2 004 OP_RETURN R3 return
irep 0x8d24b90 nregs=9 nlocals=3 pools=0 syms=5 reps=0
file: tmp.rb
2 000 OP_ENTER 1:0:0:0:0:0:0
3 001 OP_GETCONST R3 :Math
3 002 OP_GETCONST R4 :Math
3 003 OP_MOVE R5 R1 ; R1:x
3 004 OP_ADDI R5 :+ 2
3 005 OP_LOADI R6 7
3 006 OP_LOADI R7 8
3 007 OP_MUL R6 :* 1
3 008 OP_ADD R5 :+ 1
3 009 OP_ADDI R5 :+ 3
3 010 OP_SEND R4 :sin 1
3 011 OP_SEND R3 :abs 1
3 012 OP_RETURN R3 return
func:
.LFB9:
.cfi_startproc
movl 4(%esp), %ecx
movl 8(%esp), %edx
testl %edx, %edx
jle .L1
movl %ecx, %eax
leal (%ecx,%edx,4), %edx
flds .LC0
.L3:
fld %st(0)
fadds (%eax)
fsin
fabs
fstps (%eax)
addl $4, %eax
cmpl %edx, %eax
jne .L3
fstp %st(0)
.L1:
ret
The ruby VM instructions are reasonably quick, but when comparing 12 ruby opcodes to 4 assembly instructions (fld..fstps), it’s rather obvious that there’s going to be a pretty reasonable difference in speed. MRuby isn’t going to simplify any of the math that you supply it with and each opcode (simple or not) is going to take a fair bit longer than a single assembly instruction.
Heavy Member Access
Accessing data stored in mruby classes isn’t all that cheap even for simple attributes. Each member access in idiomatic ruby results in a method call via OP_SEND. Evaluating each method call is relatively expensive compared to other opcodes and each call tends to involve a setup phase for the arguments of each method. In comparison for C to access member variables it’s as simple as fetching memory at an offset to the base of the structure.
class Y
attr_accessor :a, :b, :c, :d
end
def func(array, y)
array.map do |x|
x + y.a + y.b * y.c + y.d
end
end
struct Y {
float a, b, c, d;
};
void func(float *f, int len, struct Y y)
{
for(int i=0; i<len; ++i)
f[i] = f[i] + y.a + y.b * y.c + y.d;
}
irep 0x94feb90 nregs=6 nlocals=4 pools=0 syms=1 reps=1
file: tmp.rb
5 000 OP_ENTER 2:0:0:0:0:0:0
6 001 OP_MOVE R4 R1 ; R1:array
6 002 OP_LAMBDA R5 I(+1) 2
6 003 OP_SENDB R4 :map 0
6 004 OP_RETURN R4 return
irep 0x9510068 nregs=7 nlocals=3 pools=0 syms=6 reps=0
file: tmp.rb
6 000 OP_ENTER 1:0:0:0:0:0:0
7 001 OP_MOVE R3 R1 ; R1:x
7 002 OP_GETUPVAR R4 2 0
7 003 OP_SEND R4 :a 0
7 004 OP_ADD R3 :+ 1
7 005 OP_GETUPVAR R4 2 0
7 006 OP_SEND R4 :b 0
7 007 OP_GETUPVAR R5 2 0
7 008 OP_SEND R5 :c 0
7 009 OP_MUL R4 :* 1
7 010 OP_ADD R3 :+ 1
7 011 OP_GETUPVAR R4 2 0
7 012 OP_SEND R4 :d 0
7 013 OP_ADD R3 :+ 1
7 014 OP_RETURN R3 return
func:
.LFB0:
.cfi_startproc
movl 4(%esp), %ecx
movl 8(%esp), %edx
testl %edx, %edx
jle .L1
flds 24(%esp)
fadds 12(%esp)
flds 20(%esp)
fmuls 16(%esp)
faddp %st, %st(1)
movl %ecx, %eax
leal (%ecx,%edx,4), %edx
.L3:
fld %st(0)
fadds (%eax)
fstps (%eax)
addl $4, %eax
cmpl %edx, %eax
jne .L3
fstp %st(0)
.L1:
ret
The mruby VM is fast, but when dealing with these member variable references the overhead adds up. The relatively basic loop results in a 15 opcode body. Of those opcodes, 4 are method calls, and 4 involve setting up the method calls. C doesn’t even need a separate instruction to fetch the values due to the addressing modes that x86 provides. Additionally, C can recognize that the member variables are constant and calculate their affect once outside the loop body. That leaves 3 (fld..fstps) instructions in the loop body for a tight C loop.
Loops of any sort
Actually, going one step beyond, loops over large amounts of data are just bad for performance in MRuby due to the overhead introduced by Array#each/Array#map/etc.
$dummy = 0
def func(array)
array.each do |x|
$dummy = x
end
end
volatile int dummy;
void func(int *x, int len)
{
for(int i=0; i<len; ++i)
dummy = x[i];
}
irep 0x8fa1b40 nregs=5 nlocals=3 pools=0 syms=1 reps=1
file: tmp.rb
2 000 OP_ENTER 1:0:0:0:0:0:0
3 001 OP_MOVE R3 R1 ; R1:array
3 002 OP_LAMBDA R4 I(+1) 2
3 003 OP_SENDB R3 :each 0
3 004 OP_RETURN R3 return
irep 0x8fa1b90 nregs=4 nlocals=3 pools=0 syms=1 reps=0
file: tmp.rb
3 000 OP_ENTER 1:0:0:0:0:0:0
4 001 OP_SETGLOBAL :$dummy R1 ; R1:x
4 002 OP_RETURN R1 return ; R1:x
func:
.cfi_startproc
testl %esi, %esi
jle .L1
movl $0, %eax
.L3:
movl (%rdi,%rax,4), %edx
movl %edx, dummy(%rip)
addq $1, %rax
cmpl %eax, %esi
jg .L3
.L1:
rep ret
In previous examples the loop overhead for C was neglected. For this example, 3 instructions: addq, cmpl, and jg are the overhead per loop iteration. Ruby’s is hidden in the :each method of the container. If you’re in an optimizing mindset, you might think that since :each is the idiomatic way to build loops in ruby it would be built to limit overhead.
Nope:
class Array
def each(&block)
return to_enum :each unless block_given?
idx, length = -1, self.length-1
while idx < length and length <= self.length and length = self.length-1
elm = self[idx += 1]
unless elm
if elm.nil? and length >= self.length
break
end
end
block.call(elm)
end
self
end
end
Translating that to VM instructions results in 25 opcodes overhead per loop with 4 method calls (:[], :call, :length, :length). Ouch…
So, Array#each/Array#map/etc have a lot of overhead when you get to optimizing. What about other types of loops? A standard for loop is just an alias to :each. A while loop however avoids much of the setup and per iteration cost.
$dummy = 0
def func(array)
for x in 0...array.length do
$dummy = array[x]
end
end
irep 0x904db40 nregs=7 nlocals=4 pools=0 syms=2 reps=1
file: tmp.rb
2 000 OP_ENTER 1:0:0:0:0:0:0
3 001 OP_LOADI R4 0
3 002 OP_MOVE R5 R1 ; R1:array
3 003 OP_SEND R5 :length 0
3 004 OP_RANGE R4 R4 1
3 005 OP_LAMBDA R5 I(+1) 2
3 006 OP_SENDB R4 :each 0
3 007 OP_RETURN R4 return
irep 0x904db90 nregs=5 nlocals=1 pools=0 syms=2 reps=0
file: tmp.rb
3 000 OP_ENTER 1:0:0:0:0:0:0
3 001 OP_SETUPVAR R1 3 0
4 002 OP_GETUPVAR R2 1 0
4 003 OP_GETUPVAR R3 3 0
4 004 OP_SEND R2 :[] 1
4 005 OP_SETGLOBAL :$dummy R2
4 006 OP_RETURN R2 return
$dummy = 0
def func2(array)
itr = 0
n = array.length
while itr < n
$dummy = array[itr]
itr += 1
end
end
irep 0x90fbb40 nregs=8 nlocals=5 pools=0 syms=5 reps=0
file: tmp.rb
2 000 OP_ENTER 1:0:0:0:0:0:0
3 001 OP_LOADI R3 0 ; R3:itr
4 002 OP_MOVE R5 R1 ; R1:array
4 003 OP_SEND R5 :length 0
4 004 OP_MOVE R4 R5 ; R4:n
5 005 OP_JMP 013
6 006 OP_MOVE R5 R1 ; R1:array
6 007 OP_MOVE R6 R3 ; R3:itr
6 008 OP_SEND R5 :[] 1
6 009 OP_SETGLOBAL :$dummy R5
7 010 OP_MOVE R5 R3 ; R3:itr
7 011 OP_ADDI R5 :+ 1
7 012 OP_MOVE R3 R5 ; R3:itr
5 013 OP_MOVE R5 R3 ; R3:itr
5 014 OP_MOVE R6 R4 ; R4:n
5 015 OP_LT R5 :< 1
5 016 OP_JMPIF R5 006
5 017 OP_LOADNIL R5
5 018 OP_RETURN R5 return
There’s still plenty of register shuffling with the while loop, but 6-7 opcodes of overhead per iteration and only one method call (:[]) sure beats 25 extra opcodes per loop. So, if you want to keep a hotspot in pure ruby, you might have more luck with inlining a while loop or using :each_index which is a middle ground in terms of cost.
Conclusions
MRuby is reasonably fast, though once you’re ploughing through enough data and code, then hotspots slowing things down are basically inevitable. MRuby makes it pretty easy to patch up these hotspots by seamlessly implementing the methods in C. To keep a codebase quick you still need to find hotspots in code and tools like mruby-prof make this task a lot easier. I hope this ramble about using mruby-profiler is helpful.