It is said numerical computing is highly depending on the capacity of compiler, since most algorithms are already there, implementations are also similar from language to language, thus choosing a low-level language like C/C++ and a fast compiler like icc/icpc will be the most efficient for most people, especially with -O3
flag turned on, allowing vectorizing automatically.
ispc is just another C compiler which supports SPMD (single program multiple data), making things faster by making use of SIMD whenever possible. It is better than optimizing by ourselves using unrolling or other tricks related to hardware, even better than mkl
under certain cases.
Usually numerical computing functions are highly depending on vectorization, like MATLAB
. Unfortunately, MATLAB
does a lot of single-core computing, only a few common routes are taking use of multi-core. With ispc
tasks, and OpenMP
or tbb
, it might provide another boost from multi-core.
Note: following codes are compiled with g++/gcc. Using clang++ or icpc are also working, but possibly obtains very different results.
-
vectorization from
ispc
When looping does not communicate with each other, like
bsxfun
, we can observe a5x-6x
, ifAVX/AVX2
is available.// naive c code inline void vec_sqrt(int n, float * a, float * b) { for (int i = 0; i < n; i++) { b[i] = sqrt(a[i]); } }
// ispc code inline void _ssqrt(uniform float * uniform A, uniform float * uniform B, int idx) { B[idx] = sqrt(A[idx]); } export void ispc_ssqrt(uniform float * uniform A, uniform float * uniform B, uniform int num) { foreach(i = 0 ... num){ _ssqrt(A, B, i); } }
/// mkl using vml vsSqrt(n, a, b);
compiled with
g++
,ispc
is about 5% slower thanmkl
whenn
is large, for smalln
less than 256,ispc
is faster. It is also worth to notice that usingicpc
on naive to compile without further optimization is 20% faster thanispc
. -
parallel tasks inside
ispc
OpenMP
allows us to parallel our code with minimal changes. Typically we can see4x
performance on a quad-core machine.