It is said numerical computing is highly depending on the capacity of compiler, since most algorithms are already there, implementations are also similar from language to language, thus choosing a low-level language like C/C++ and a fast compiler like icc/icpc will be the most efficient for most people, especially with -O3 flag turned on, allowing vectorizing automatically.
ispc is just another C compiler which supports SPMD (single program multiple data), making things faster by making use of SIMD whenever possible. It is better than optimizing by ourselves using unrolling or other tricks related to hardware, even better than mkl under certain cases.
Usually numerical computing functions are highly depending on vectorization, like MATLAB. Unfortunately, MATLAB does a lot of single-core computing, only a few common routes are taking use of multi-core. With ispc tasks, and OpenMP or tbb, it might provide another boost from multi-core.
Note: following codes are compiled with g++/gcc. Using clang++ or icpc are also working, but possibly obtains very different results.
-
vectorization from
ispcWhen looping does not communicate with each other, like
bsxfun, we can observe a5x-6x, ifAVX/AVX2is available.// naive c code inline void vec_sqrt(int n, float * a, float * b) { for (int i = 0; i < n; i++) { b[i] = sqrt(a[i]); } }// ispc code inline void _ssqrt(uniform float * uniform A, uniform float * uniform B, int idx) { B[idx] = sqrt(A[idx]); } export void ispc_ssqrt(uniform float * uniform A, uniform float * uniform B, uniform int num) { foreach(i = 0 ... num){ _ssqrt(A, B, i); } }/// mkl using vml vsSqrt(n, a, b);compiled with
g++,ispcis about 5% slower thanmklwhennis large, for smallnless than 256,ispcis faster. It is also worth to notice that usingicpcon naive to compile without further optimization is 20% faster thanispc. -
parallel tasks inside
ispcOpenMPallows us to parallel our code with minimal changes. Typically we can see4xperformance on a quad-core machine.