It is said numerical computing is highly depending on the capacity of compiler, since most algorithms are already there, implementations are also similar from language to language, thus choosing a low-level language like C/C++ and a fast compiler like icc/icpc will be the most efficient for most people, especially with -O3 flag turned on, allowing vectorizing automatically.

ispc is just another C compiler which supports SPMD (single program multiple data), making things faster by making use of SIMD whenever possible. It is better than optimizing by ourselves using unrolling or other tricks related to hardware, even better than mkl under certain cases.

Usually numerical computing functions are highly depending on vectorization, like MATLAB. Unfortunately, MATLAB does a lot of single-core computing, only a few common routes are taking use of multi-core. With ispc tasks, and OpenMP or tbb, it might provide another boost from multi-core.

Note: following codes are compiled with g++/gcc. Using clang++ or icpc are also working, but possibly obtains very different results.

  • vectorization from ispc

    When looping does not communicate with each other, like bsxfun, we can observe a 5x-6x, if AVX/AVX2 is available.

    // naive c code
    inline void vec_sqrt(int n, float * a, float * b) {
        for (int i = 0; i < n; i++) {
          b[i] = sqrt(a[i]);
        }
    }
    // ispc code
    inline void _ssqrt(uniform float * uniform A, uniform float * uniform B, int idx) {
        B[idx] = sqrt(A[idx]);
    }
     
    export
    void ispc_ssqrt(uniform float * uniform A, uniform float * uniform B,
      uniform int num) {
        foreach(i = 0 ... num){
          _ssqrt(A, B, i);
        }
    }
    /// mkl using vml
    vsSqrt(n, a, b);

    compiled with g++, ispc is about 5% slower than mkl when n is large, for small n less than 256, ispc is faster. It is also worth to notice that using icpc on naive to compile without further optimization is 20% faster than ispc.

  • parallel tasks inside ispc

    OpenMP allows us to parallel our code with minimal changes. Typically we can see 4x performance on a quad-core machine.