Expression Templates and Vector Math In C++

If you have done a lot of vector math in C++ then you’re likely used to a lot of for loops for each operation that you would want to code up. With all of the operator overloading in C++ it would seem natural to be able to translate something like

for(int i=0; i<N; ++i)
    out[i] = 2*in1[i] + 4*in2[i]

into

out = 2*in1 + 4*in2

However the most direct approach would lead to each subexpression (2*in1), (4*in2), and (2*in1 + 4*in2) returning some sort of copy, however this does not need to be the case. If the AST can be captured, then you can effectively transform the above code into

out = generator<+<*<2,in1>, *<4,in2>>>

From here, with an overloaded operator=, the code can be transformed into the original code. So, how exactly does one capture the AST of some vector math in order to acomplish this transformation? Expression Templates are the solution to this task. So, without further adu, let’s take a look at some of the code needed to do based off of the wiki page.

First, a base class for a vector expression is defined. Even though this is a base class, no virtual methods will be seen here and instead any call to cast the object, find its size, or get an element result in calling a static method of the subclass.

template <typename E>
class VecExpr {
    public:
    size_t  size()             const { return static_cast<const E&>(*this).size(); }
    float operator[](size_t i) const { return static_cast<const E&>(*this)[i];     }

    void dump(void) const
    {
        const size_t s = size();
        printf("[");
        for(size_t i=0; i<s-1; ++i) printf("%f, ", (*this)[i]);
        if(s) printf("%f]\n", (*this)[s-1]);
        else printf("]\n");
    }

    operator E&()             { return static_cast<      E&>(*this); }
    operator E const&() const { return static_cast<const E&>(*this); }
};

To define a normal vector, the operators of the superclass are redefined and additionally an operator= is defined which extracts the result of a right hand side vector expression element by element.

class Vec : public VecExpr<Vec> {
    float *data_;
    size_t size_;
    public:
    float operator[](size_t i)       { return data_[i]; }
    float operator[](size_t i) const { return data_[i]; }
    size_t  size()             const { return size_; }

    Vec(float *f, size_t n) :data_(f), size_(n)  {}

    void clear(void) {memset(data_, 0, sizeof(float)*size_);}

    template<typename E>
    Vec &operator=(const VecExpr<E> &vec)
    {
        assert(size_ == vec.size());
        for(size_t i=0; i!=size_; ++i)
            data_[i] = vec[i];
        return *this;
    }

};

Now for an example constant vector expression, let’s consider a range like object. If the range varies over a to b given the domain 0 to N, then the output of each element is simply a + i*(b-a)/(N-1) where i is th index.

class LinearVec:public VecExpr<LinearVec>
{
    float low, high;
    size_t  size_;

    public:
    LinearVec(float low_, float high_, size_t size__)
        :low(low_), high(high_), size_(size__) {}
    float operator[](size_t i) const { return low + (high-low)*(i*1.0f/(size_-1)); }
    size_t size() const {return size_;}
};

Codifying a simple range class is simple enough, now for the somewhat messier system, all combinations of <Scalar OP Vector>, <Vector OP Vector>, and <Vector OP Scalar>. Since it is simple enough to take this repeating pattern and just generate out a bunch of similar classes that’s what we’ll do.

#define BinSV(OP,y) template<typename E> class Vec##y##SV:public VecExpr<Vec##y##SV<E>> {\
    float scalar;\
    const E &vec; \
    public:\
           Vec##y##SV(float s, const VecExpr<E> &v) : scalar(s), vec(v) {} \
    inline size_t size() const { return vec.size(); }\
    inline float operator[](size_t i) const { return OP(scalar, vec[i]); }\
};

#define BinVV(OP,y) template<typename E1, typename E2> \
    class Vec##y##VV:public VecExpr<Vec##y##VV<E1,E2>> {\
        const E1 &vec1; \
        const E2 &vec2; \
        public:\
               Vec##y##VV(const VecExpr<E1> &v1, const VecExpr<E1> &v2)\
               : vec1(v1), vec2(v2) {} \
        inline size_t size() const { return vec1.size(); }\
        inline float operator[](size_t i) const { return OP(vec1[i], vec2[i]); }\
    };

#define BinVS(OP,y) template<typename E> class Vec##y##VS:public VecExpr<Vec##y##VS<E>> {\
    float scalar;\
    const E &vec; \
    public:\
           Vec##y##VS(const VecExpr<E> &v, float s) : scalar(s), vec(v) {} \
    inline size_t size() const { return vec.size(); }\
    inline float operator[](size_t i) const { return OP(vec[i],scalar); }\
};

#define OpSV(OP,y) template <typename E> inline const Vec##y##SV<E> \
    operator OP(float s, VecExpr<E> const& v) { return Vec##y##SV<E>(s,v); }
#define OpVV(OP,y) template <typename E1, typename E2> inline const Vec##y##VV<E1,E2> \
    operator OP(const VecExpr<E1> &v1, const VecExpr<E2> &v2) \
 { return Vec##y##VV<E1,E2>(v1,v2); }
#define OpVS(OP,y) template <typename E> inline const Vec##y##VS<E> \
    operator OP(const VecExpr<E> & v, float s) { return Vec##y##VS<E>(v, s); }
#define BinOp(x,y,z) BinSV(z,y) BinVV(z,y) BinVS(z,y) OpSV(x,y) OpVV(x,y) OpVS(x,y)
static inline float s_add(float x, float y) {return x+y;}
static inline float s_sub(float x, float y) {return x-y;}
static inline float s_div(float x, float y) {return x/y;}
static inline float s_mul(float x, float y) {return x*y;}
static inline float s_pow(float x, float y) {return powf(x,y);}
BinOp(+,Add, s_add);
BinOp(-,Sub, s_sub);
BinOp(/,Div, s_div);
BinOp(*,Mul, s_mul);
BinOp(^,Pow, s_pow);

Now, all of the basic vector math should work just fine. An example of using this new set of classes can be seen below.

unsigned data_size = 16;
float *input_  = new float[data_size];
float *mix_    = new float[data_size];
float *output_ = new float[data_size];
Vec input{input_,   data_size};
Vec mix{mix_,       data_size};
Vec output{output_, data_size};
LinearVec lerp{0,1, data_size};

input.clear();
input = lerp;
mix   = 4*lerp;

output = (input+mix)*(input+mix);
output.dump();

The output is

[0.000000, 0.111111, 0.444444, 1.000000, 1.777778, 2.777778, 4.000000, 5.444444,
7.111112, 9.000000, 11.111113, 13.444445, 16.000000, 18.777779, 21.777777,
25.000000]

Now you might wonder how much this syntactic sugar costs. For lower levels of optimization it is certainly expensive, but high levels of optimization seem to be able to handle this template based hierarchy better than some of the direct forms of code. Looking at the output assembly, it looks like the compiler’s vectorizer had more success on the templated code perhaps due to some thresholds on code size, though I’m not too sure what triggered the disparity.

All below benchmarks are run on a 1,000,000 element rendition of the above code executed 100 times.

Table 1. gcc 4.8.2/clang 3.3
Compiler	Fused For Loop (s)	Split For Loops (s)	Template Code (s)
g++ -std=c++11 -O0	4.752	4.763	11.298
clang++ -std=c++11 -O0	2.702	2.574	10.585
g++ -std=c++11 -O1	1.344	1.484	4.102
clang++ -std=c++11 -O1	1.033	1.135	4.368
g++ -std=c++11 -O2	1.369	1.482	1.433
clang++ -std=c++11 -O2	1.034	1.149	0.630
g++ -std=c++11 -O3	1.362	1.477	1.427
clang++ -std=c++11 -O3	1.034	1.133	0.243
g++ -std=c++11 -O3 -march=native	1.338	1.451	0.209
clang++ -std=c++11 -O3 -march=native	1.033	1.125	0.241

Table 2. gcc 4.8.2/clang 3.6
Compiler	Fused For Loop (s)	Split For Loops (s)	Template Code (s)
g++ -std=c++11 -O0	5.365	5.420	13.409
clang++ -std=c++11 -O0	3.103	3.094	17.279
g++ -std=c++11 -O1	1.584	1.875	4.319
clang++ -std=c++11 -O1	1.313	1.532	9.013
g++ -std=c++11 -O2	1.578	1.872	1.864
clang++ -std=c++11 -O2	1.301	1.610	0.709
g++ -std=c++11 -O3	1.576	1.864	1.823
clang++ -std=c++11 -O3	1.292	1.590	0.704
g++ -std=c++11 -O3 -march=native	1.572	1.787	0.583
clang++ -std=c++11 -O3 -march=native	1.296	1.583	0.699

So, while this might be taking some syntax sugar a bit far within C++, it shows that there’s either virtually no runtime overhead in doing so or a distinct advantage in doing so.