I loathe inefficient, slow, or bloated software. Unfortunately, I’ve written plenty of code that’s terrible in this sense and if you program frequently, then I’d wager you have written quite a few inefficient lines of code as well. Fortunately code isn’t written in stone after the first pass and we as programmers can optimize it. Most of the time the speed of a single line of code doesn’t matter or contribute to the overall inefficiency in an application. Sometimes, though, a change to a few lines can make all the difference.

Profilers

Finding those lines of code however is a challenging task unless you have access to some means of profiling the program and finding out which sections of code are run the most frequently and the lines which use up the most time. Each programming language tends to have it’s own tools, though if you’re tackling a mixed ruby/C codebase then you might be familiar with: RubyProf, callgrind, gprof, or the poor man’s profiler.

Based upon the title of this article I’m guessing that you may be interested in profiling the embeddable ruby implementation 'mruby'. While I was developing my own application using mruby, I tended to profile ruby code with timers:

before = Time.new()
run_routine()
after  = Time.new()
puts("run_routine took #{1000*(after-before)} ms")

This provided limited information, but it helped direct the development to a few hotspots and then additional profiling of the C portion of the codebase was done with callgrind. Callgrind for me is a holy grail of profiling CPU bound programs. Through kcachegrind it provides an easy to explore callgraph, relative and absolute timing information, source level information, and information at the assembly level. Callgrind combined with kcachgrind make it easy to understand where the hotspots for a program are, what is the callgraph that generates these issues, and what compiled assembly creates the issues. If you’re trying to optimize a C/C++ codebase just use callgrind. It slows down the overall execution by at least an order of magnitude, but the information provided is superb.

MRuby-Profiler

Back to mruby, the initial timer based profiling doesn’t have many of the advantages of callgrind, or other profiler tools. Adding timers was a manual process, it provided very coarse information, and it provided only a limited slice of the whole program. As such, a full mruby VM profiler should be preferred. The mruby-profiler gem is one such tool.

MRuby-profiler works by observing the progression of the mruby virtual machine via a callback which is invoked every time the VM executes a new ruby instruction. This allows for the profiler to have an exact count for how many times an instruction is invoked, a reasonably accurate time for the execution of each instruction, information about the call graph, and often detailed line-by-line source level times.

Example

Now that the mruby-profiler gem is introduced, let’s take a look at a simple example run. In the top-left you’ll see a simple ruby program, on the right mruby-profiler’s source annotated output, and in the bottom-left mruby-profiler’s no-source VM instruction only output.

def test_code
var1 = []
100.times do |x|
var1 << x
end

var1.each_with_index do |x, ind|
var1[ind] = 0.1 + 2.0*x + 4.3*x**2.0
end

var1
end

test_code
0000 0.00011 def test_code
1 0.00002    OP_TCLASS     R1
1 0.00003    OP_LAMBDA     R2      I(+1)   1
1 0.00003    OP_METHOD     R1      :test_code
1 0.00003    OP_ENTER      0:0:0:0:0:0:0
0001 0.00004     var1 = []
1 0.00004    OP_ARRAY      R2      R3      0
0002 0.00186     100.times do |x|
1 0.00005    OP_LAMBDA     R4      I(+1)   2
1 0.00007    OP_SENDB      R3      :times  0
100 0.00172    OP_ENTER      1:0:0:0:0:0:0
0003 0.00957         var1 << x
100 0.00142    OP_GETUPVAR   R3      2       0
100 0.00128    OP_MOVE       R4      R1
100 0.00479    OP_SEND       R3      :<<     1
100 0.00208    OP_RETURN     R3      return
0004 0.00000     end
0005 0.00000
0006 0.00186     var1.each_with_index do |x, ind|
1 0.00002    OP_MOVE       R3      R2
1 0.00004    OP_LAMBDA     R4      I(+2)   2
1 0.00006    OP_SENDB      R3      :each_with_index 0
100 0.00175    OP_ENTER      2:0:0:0:0:0:0
0007 0.03609         var1[ind] = 0.1 + 2.0*x + 4.3*x**2.0
100 0.00142    OP_LOADL      R4      L(0)    ; 0.1
100 0.00139    OP_LOADL      R5      L(1)    ; 2
100 0.00142    OP_MOVE       R6      R1
100 0.00167    OP_MUL        R5      :*      1
100 0.00145    OP_ADD        R4      :+      1
100 0.00148    OP_LOADL      R5      L(2)    ; 4.3
100 0.00141    OP_MOVE       R6      R1
100 0.00143    OP_LOADL      R7      L(1)    ; 2
100 0.00850    OP_SEND       R6      :**     1
100 0.00157    OP_MUL        R5      :*      1
100 0.00152    OP_ADD        R4      :+      1
100 0.00158    OP_GETUPVAR   R5      2       0
100 0.00128    OP_MOVE       R6      R2
100 0.00141    OP_MOVE       R7      R4
100 0.00647    OP_SEND       R5      :[]=    2
100 0.00209    OP_RETURN     R4      return
0008 0.00000     end
0009 0.00000
0010 0.00004     var1
1 0.00004    OP_RETURN     R2      return
0011 0.00000 end
0012 0.00000
0013 0.00225 test_code
1 0.00005    OP_SEND       R1      :test_code      0
1 0.00218    OP_STOP
Fixnum#times 0.01822
1 0.00002    OP_ENTER      0:0:0:0:0:0:1
1 0.00011    OP_SEND       R3      :block_given?   0
1 0.00001    OP_JMPNOT     R3      002
1 0.00001    OP_JMP                005
0 0.00000    OP_SEND       R3      :to_enum        1
0 0.00000    OP_RETURN     R3      return
1 0.00001    OP_JMP                007
100 0.00128    OP_MOVE       R3      R1
100 0.00142    OP_MOVE       R4      R2
100 0.00546    OP_SEND       R3      :call   1
100 0.00128    OP_MOVE       R3      R2
100 0.00164    OP_ADDI       R3      :+      1
100 0.00128    OP_MOVE       R2      R3
101 0.00144    OP_MOVE       R3      R2
101 0.00143    OP_LT R3      :<      1
101 0.00138    OP_JMPIF      R3      -09
1 0.00002    OP_RETURN     R3      return

From these outputs, we can see that the most expensive line in the program is line 7 which takes roughly 36 ms to execute over the course of this entire program. The '**' operator takes a large portion of that time (8.5 ms) and is executed 100 times as expected. The first '100.times' loop takes 1.8 ms + 9.5 ms + 18.2 ms (from the overhead of Fixnum#times itself). Within the output, line numbers, source code, VM instruction call counts, and VM instruction self times can be seen fairly clearly.

Complications In Profiling

The mruby-profiler gem makes profiling much easier, however there are a few limits with it’s current implementation.

Entering and Exiting The MRuby VM

The timers within mruby-profiler are relatively simple and in most code they tend to work very well. They do fail when interacting with any program which ends up doing a fair amount of work within C which ends up calling ruby methods from C. To elaborate on that, let’s first look at how mruby-profiler calculates how much time is spent at a given mruby opcode.

The mruby-profiler gem uses the code fetch hook within the mruby VM. Every time an opcode is executed by the mruby VM the code fetch hook is called. MRuby-profiler records the time when an instruction is fetched. When the next fetch occurs the difference in time is assumed to be the amount of time spent in the previous instruction.

Generally this model works well. For normal ruby code it’s completely accurate, for ruby calling simple C routines the appropriate amount of time is given to the OP_SEND instruction which lead to the C call, but it fails with a mixture of C/ruby calls. Consider the below sequence of events:

1. C code calls one ruby method

2. C code works on a long running task

3. C code calls another ruby method

During step 2 no ruby opcodes will be observed by mruby-profiler. Thus, when step 3 occurs and a new VM opcode is fetched, then all the time that was spent in step 2 is attributed to the last instruction in step 1. Typically the last instruction of a mruby method would be OP_RETURN. So, if you spot an OP_RETURN which is taking much more time than it should, be aware that it may be counting time spent in a calling C function.

A lack of cumulative child function time

In general I’d say having method/instruction time presented as 'self-time' is preferable in a reasonably well architected system. Self-time presents how much time could effectively be saved by optimizing the method by itself without considering the rest of the code it ends up calling.

Self-time, however, can create a few problems with interpreting the results of mruby-profiler. If a hotspot function is called a large number of times, it can be tricky to backtrack which functions called it a significant number of times or with a particular type of data which took the hotspot function longer to evaluate. The lack of cumulative times also make it hard to evaluate if a particular function is 'expensive to call'. It is possible to have an 'expensive to call' function which does not use a significant amount of time with called functions which also do not use much time. If, however, there is a sufficiently deep call stack, then an innocuous function can still become very expensive in the cumulative sense (i.e. "death by a million cuts")

The last issue is one that’s more ruby specific. I’d say it’s fair to say that if you enjoy ruby you use blocks…​ You use them a lot. For those of you unfamiliar with what a block is here’s a simple example:

object.method do |optional_args|
work_on_block
end
irep 0x9b433e0 nregs=3 nlocals=1 pools=0 syms=2 reps=1
file: tmp.rb
1 001 OP_SEND       R1      :object 0
1 002 OP_LAMBDA     R2      I(+1)   2
1 003 OP_SENDB      R1      :method 0
1 004 OP_STOP

irep 0x9b494a8 nregs=5 nlocals=3 pools=0 syms=1 reps=0
file: tmp.rb
1 000 OP_ENTER      1:0:0:0:0:0:0
2 002 OP_SEND       R3      :work_on_block  0
2 003 OP_RETURN     R3      return

In the ruby code :method is called with the block (do..end). The block is encoded as a lambda function (the second irep) and method ends up calling the lambda. Just like in the very first example, with Fixnum#times, the cost involved with :method is associated with :method’s implementation and not the object.method call. When the block accepting method adds a significant amount of overhead it’s very easy to overlook using self-time and you should be aware of it when profiling. In the case of iteration, we’ll revisit block-method overhead later in this article.

While at the moment mruby-profiler doesn’t present cumulative time information it does capture it from the MRuby VM (or at least it appeared to when I looked at the source). It’s not in the most convenient format, but at least the data is there, it’s just the analysis which needs to be updated (and that’s all ruby). The mruby-profiler gem is still missing the functionality, but I imagine it could be added reasonably easily in the future.

Bugs

Of course another limitation of using both mruby and mruby-profiler is that both of them are going to have more bugs than other tools with more widespread use. When I initially found mruby-profiler it tended to crash in a variety of ways for the codebase I was profiling (I’d recommend using the fixes I’ve proposed to mruby-profiler via a pull request if it hasn’t yet been merged). When I initially began using MRuby I encountered a few bugs, though MRuby has become a fair bit more stable over the past year. Lastly, while mruby-profiler does provide an output for kcachgrind, be aware that it is incomplete and there are some bugs in the output (though it is entirely unclear if they are from mruby’s debug info or a bug within mruby-profiler).

Deciding when to move from ruby to C

One of the great things about MRuby (and one of the major reasons why I’ve used it for a sizable user interface project) is that it’s extremely easy to move a routine from ruby to C. Of course it’s still easier to leave code as ruby (otherwise I would have just written everything in C), so what tasks does ruby struggle with?

Like most interpreted languages, heavy mathematical operations aren’t 'fast'. Of course, they may be fast enough, but major gains can be made by using a compiled and heavily optimized language like C.

Consider the below ruby code:

def func(array)
array.map do |x|
Math.abs(Math.sin(x + 2 + 7 * 8 + 3))
end
end
void func(float *f, int len)
{
for(int i=0; i<len; ++i)
f[i] = fabsf(sinf(f[i] + 2 + 7 * 8 + 3));
}
irep 0x8d24b40 nregs=5 nlocals=3 pools=0 syms=1 reps=1
file: tmp.rb
1 000 OP_ENTER      1:0:0:0:0:0:0
2 001 OP_MOVE       R3      R1              ; R1:array
2 002 OP_LAMBDA     R4      I(+1)   2
2 003 OP_SENDB      R3      :map    0
2 004 OP_RETURN     R3      return

irep 0x8d24b90 nregs=9 nlocals=3 pools=0 syms=5 reps=0
file: tmp.rb
2 000 OP_ENTER      1:0:0:0:0:0:0
3 001 OP_GETCONST   R3      :Math
3 002 OP_GETCONST   R4      :Math
3 003 OP_MOVE       R5      R1              ; R1:x
3 004 OP_ADDI       R5      :+      2
3 007 OP_MUL        R6      :*      1
3 008 OP_ADD        R5      :+      1
3 009 OP_ADDI       R5      :+      3
3 010 OP_SEND       R4      :sin    1
3 011 OP_SEND       R3      :abs    1
3 012 OP_RETURN     R3      return
func:
.LFB9:
.cfi_startproc
movl    4(%esp), %ecx
movl    8(%esp), %edx
testl   %edx, %edx
jle .L1
movl    %ecx, %eax
leal    (%ecx,%edx,4), %edx
flds    .LC0
.L3:
fld %st(0)
fsin
fabs
fstps   (%eax)
addl    $4, %eax cmpl %edx, %eax jne .L3 fstp %st(0) .L1: ret The ruby VM instructions are reasonably quick, but when comparing 12 ruby opcodes to 4 assembly instructions (fld..fstps), it’s rather obvious that there’s going to be a pretty reasonable difference in speed. MRuby isn’t going to simplify any of the math that you supply it with and each opcode (simple or not) is going to take a fair bit longer than a single assembly instruction. Heavy Member Access Accessing data stored in mruby classes isn’t all that cheap even for simple attributes. Each member access in idiomatic ruby results in a method call via OP_SEND. Evaluating each method call is relatively expensive compared to other opcodes and each call tends to involve a setup phase for the arguments of each method. In comparison for C to access member variables it’s as simple as fetching memory at an offset to the base of the structure. class Y attr_accessor :a, :b, :c, :d end def func(array, y) array.map do |x| x + y.a + y.b * y.c + y.d end end struct Y { float a, b, c, d; }; void func(float *f, int len, struct Y y) { for(int i=0; i<len; ++i) f[i] = f[i] + y.a + y.b * y.c + y.d; } irep 0x94feb90 nregs=6 nlocals=4 pools=0 syms=1 reps=1 file: tmp.rb 5 000 OP_ENTER 2:0:0:0:0:0:0 6 001 OP_MOVE R4 R1 ; R1:array 6 002 OP_LAMBDA R5 I(+1) 2 6 003 OP_SENDB R4 :map 0 6 004 OP_RETURN R4 return irep 0x9510068 nregs=7 nlocals=3 pools=0 syms=6 reps=0 file: tmp.rb 6 000 OP_ENTER 1:0:0:0:0:0:0 7 001 OP_MOVE R3 R1 ; R1:x 7 002 OP_GETUPVAR R4 2 0 7 003 OP_SEND R4 :a 0 7 004 OP_ADD R3 :+ 1 7 005 OP_GETUPVAR R4 2 0 7 006 OP_SEND R4 :b 0 7 007 OP_GETUPVAR R5 2 0 7 008 OP_SEND R5 :c 0 7 009 OP_MUL R4 :* 1 7 010 OP_ADD R3 :+ 1 7 011 OP_GETUPVAR R4 2 0 7 012 OP_SEND R4 :d 0 7 013 OP_ADD R3 :+ 1 7 014 OP_RETURN R3 return func: .LFB0: .cfi_startproc movl 4(%esp), %ecx movl 8(%esp), %edx testl %edx, %edx jle .L1 flds 24(%esp) fadds 12(%esp) flds 20(%esp) fmuls 16(%esp) faddp %st, %st(1) movl %ecx, %eax leal (%ecx,%edx,4), %edx .L3: fld %st(0) fadds (%eax) fstps (%eax) addl$4, %eax
cmpl    %edx, %eax
jne .L3
fstp    %st(0)
.L1:
ret

The mruby VM is fast, but when dealing with these member variable references the overhead adds up. The relatively basic loop results in a 15 opcode body. Of those opcodes, 4 are method calls, and 4 involve setting up the method calls. C doesn’t even need a separate instruction to fetch the values due to the addressing modes that x86 provides. Additionally, C can recognize that the member variables are constant and calculate their affect once outside the loop body. That leaves 3 (fld..fstps) instructions in the loop body for a tight C loop.

Loops of any sort

Actually, going one step beyond, loops over large amounts of data are just bad for performance in MRuby due to the overhead introduced by Array#each/Array#map/etc.

$dummy = 0 def func(array) array.each do |x|$dummy = x
end
end
volatile int dummy;
void func(int *x, int len)
{
for(int i=0; i<len; ++i)
dummy = x[i];
}
irep 0x8fa1b40 nregs=5 nlocals=3 pools=0 syms=1 reps=1
file: tmp.rb
2 000 OP_ENTER      1:0:0:0:0:0:0
3 001 OP_MOVE       R3      R1              ; R1:array
3 002 OP_LAMBDA     R4      I(+1)   2
3 003 OP_SENDB      R3      :each   0
3 004 OP_RETURN     R3      return

irep 0x8fa1b90 nregs=4 nlocals=3 pools=0 syms=1 reps=0
file: tmp.rb
3 000 OP_ENTER      1:0:0:0:0:0:0
4 001 OP_SETGLOBAL  :$dummy R1 ; R1:x 4 002 OP_RETURN R1 return ; R1:x func: .cfi_startproc testl %esi, %esi jle .L1 movl$0, %eax
.L3:
movl    (%rdi,%rax,4), %edx
movl    %edx, dummy(%rip)
addq    $1, %rax cmpl %eax, %esi jg .L3 .L1: rep ret In previous examples the loop overhead for C was neglected. For this example, 3 instructions: addq, cmpl, and jg are the overhead per loop iteration. Ruby’s is hidden in the :each method of the container. If you’re in an optimizing mindset, you might think that since :each is the idiomatic way to build loops in ruby it would be built to limit overhead. Nope: class Array def each(&block) return to_enum :each unless block_given? idx, length = -1, self.length-1 while idx < length and length <= self.length and length = self.length-1 elm = self[idx += 1] unless elm if elm.nil? and length >= self.length break end end block.call(elm) end self end end Translating that to VM instructions results in 25 opcodes overhead per loop with 4 method calls (:[], :call, :length, :length). Ouch…​ So, Array#each/Array#map/etc have a lot of overhead when you get to optimizing. What about other types of loops? A standard for loop is just an alias to :each. A while loop however avoids much of the setup and per iteration cost. $dummy = 0
def func(array)
for x in 0...array.length do
$dummy = array[x] end end irep 0x904db40 nregs=7 nlocals=4 pools=0 syms=2 reps=1 file: tmp.rb 2 000 OP_ENTER 1:0:0:0:0:0:0 3 001 OP_LOADI R4 0 3 002 OP_MOVE R5 R1 ; R1:array 3 003 OP_SEND R5 :length 0 3 004 OP_RANGE R4 R4 1 3 005 OP_LAMBDA R5 I(+1) 2 3 006 OP_SENDB R4 :each 0 3 007 OP_RETURN R4 return irep 0x904db90 nregs=5 nlocals=1 pools=0 syms=2 reps=0 file: tmp.rb 3 000 OP_ENTER 1:0:0:0:0:0:0 3 001 OP_SETUPVAR R1 3 0 4 002 OP_GETUPVAR R2 1 0 4 003 OP_GETUPVAR R3 3 0 4 004 OP_SEND R2 :[] 1 4 005 OP_SETGLOBAL :$dummy R2
4 006 OP_RETURN     R2      return
$dummy = 0 def func2(array) itr = 0 n = array.length while itr < n$dummy = array[itr]
itr += 1
end
end
irep 0x90fbb40 nregs=8 nlocals=5 pools=0 syms=5 reps=0
file: tmp.rb
2 000 OP_ENTER      1:0:0:0:0:0:0
3 001 OP_LOADI      R3      0               ; R3:itr
4 002 OP_MOVE       R5      R1              ; R1:array
4 003 OP_SEND       R5      :length 0
4 004 OP_MOVE       R4      R5              ; R4:n
5 005 OP_JMP                013
6 006 OP_MOVE       R5      R1              ; R1:array
6 007 OP_MOVE       R6      R3              ; R3:itr
6 008 OP_SEND       R5      :[]     1
6 009 OP_SETGLOBAL  :\$dummy R5
7 010 OP_MOVE       R5      R3              ; R3:itr
7 011 OP_ADDI       R5      :+      1
7 012 OP_MOVE       R3      R5              ; R3:itr
5 013 OP_MOVE       R5      R3              ; R3:itr
5 014 OP_MOVE       R6      R4              ; R4:n
5 015 OP_LT R5      :<      1
5 016 OP_JMPIF      R5      006
5 018 OP_RETURN     R5      return