I was coding a particle system for my home engine. I wanted it to be quite flexible and after few iterations I had literally a dozens of parameters and transform modes. And this was not only a problem about design, it was a major performance issue.
I gave a thought about using expression parser. In some other project I've been involved, I've seen muParser library. But this was a no-go, as I had thousands of particles every frame and muParser was just too slow. And no, I couldn't offload particle computations onto GPU as GPU is heavily loaded with rendering.
But on muParser webpage I've found information about muParserSSE -- muParser fork, using JIT compilation (with AsmJit library) and SSE unit. Sounds great. I've made some tests and muParserSSE in fact was much faster than original muParser. But it had 3 blocking issues:
- only supported scalars -- no float4 datatype which is essential for particles (color, velocity, etc)
- even while it supported only scalars, it used scalar SSE instructions (like ADDSS)
- didn't support streams of data, only single evaluations
So I've decided to code a tool called QDT: Quick Data Transformation. My goals were:
- support float and float4
- for scalar-only expressions, use vector SSE ops (like ADDPS) to process 4 items at a time
- support streams (like process(float*in, float*out, size_t count))
- for code using <= 8 regs, don't use stack (muParserSSE uses stack also for 7 and 8 regs expressions)
- for code using > 8 regs, store highest regs in memory stack (because for deep expressions, most of the work is done near leaves -- muParserSSE stores root in regs and leaves in memory)
So I've grabbed Flex and Bison to build a compiler and AsmJit to generate machine code. I've spent some time on AST and machine code optimizations. And I've managed to obtain quite pleasing results.
Let's try with following expression:
2+min(x+x+x,1/5)*(2+x*3)
So first let's code a simple C++ version:
float fun(float x)
{
return 2+min(x+x+x,1.0f/5.0f)*(2.0f+x*3.0f);
}
void test_cpp(float *in, float *out, size_t count)
{
for (size_t i = 0; i < count; i++)
{
out[i] = fun(in[i]);
}
}
Now, SSE assembly using intrinsics (I post only vectorized version, scalar is similar but uses SS and advances only by 1 element at a time)
void test_asm(float *in, float *out, size_t count)
{
__m128 c1 = _mm_set1_ps(3);
__m128 c2 = _mm_set1_ps(2);
__m128 c3 = _mm_set1_ps(0.2f);
for (size_t i = 0; i < count; i+=4)
{
__m128 r1 = _mm_load_ps(in + i);
__m128 r2 = _mm_mul_ps(r1, c1);
__m128 r3 = _mm_min_ps(c3, r2);
__m128 r4 = _mm_add_ps(r2, c2);
__m128 r5 = _mm_mul_ps(r3, r4);
__m128 r6 = _mm_add_ps(c2, r5);
_mm_store_ps(out + i, r6);
}
}
OK, so now is time for resulting assembly code.
C++ (MSVC 2008)
mov edx,dword ptr [esp+10h]
test edx,edx
jbe L1
mov ecx,dword ptr [esp+0Ch]
mov eax,dword ptr [esp+8]
movss xmm1, {0.2}
movss xmm2, {2.0}
sub eax,ecx
L4: movss xmm0,dword ptr [eax+ecx]
mulss xmm0, {3.0}
comiss xmm0,xmm1
jbe L2
movaps xmm3,xmm1
jmp L3
L2: movaps xmm3,xmm0
L3: addss xmm0,xmm2
mulss xmm0,xmm3
addss xmm0,xmm2
movss dword ptr [ecx],xmm0
add ecx,4
dec edx
jne L4
L1: ret
Quite good, however we have branching instead of MINSS, and of course -- no vectorization.
Assembly
push ebp
mov ebp,esp
and esp,0FFFFFFF0h
mov eax,dword ptr [count]
movss xmm2, {3.0}
movss xmm1, {2.0}
movss xmm3, {0.2}
mov edx,dword ptr [in]
mov ecx,dword ptr [out]
shufps xmm2,xmm2,0
shufps xmm1,xmm1,0
shufps xmm3,xmm3,0
test eax,eax
jbe L1
dec eax
shr eax,2
sub edx,ecx
inc eax
mov edi,edi
L2: movaps xmm0,xmmword ptr [edx+ecx]
mulps xmm0,xmm2
movaps xmm4,xmm3
minps xmm4,xmm0
addps xmm0,xmm1
mulps xmm4,xmm0
addps xmm4,xmm1
movaps xmmword ptr [ecx],xmm4
add ecx,10h
sub eax,1
jne L2
L1: mov esp,ebp
pop ebp
ret
Well, compiler didn't break anything.
muParserSSE
As muParserSSE has no streaming evaluation mode, this is just single evaluation func. mov ecx,4EBE90h
movss xmm0,dword ptr [ecx] ; {2.0}
mov ecx,14F9C8h
movss xmm1,dword ptr [ecx] ; x
mov ecx,14F9C8h
movss xmm2,dword ptr [ecx] ; x
addss xmm1,xmm2
mov ecx,14F9C8h
movss xmm2,dword ptr [ecx] ; x
addss xmm1,xmm2
mov ecx,4EBED0h
movss xmm2,dword ptr [ecx] ; {0.2}
minss xmm1,xmm2
mov ecx,4EBEE4h
movss xmm2,dword ptr [ecx] ; {2.0}
mov ecx,14F9C8h
movss xmm3,dword ptr [ecx] ; x
mov ecx,4EBEFCh
movss xmm4,dword ptr [ecx] ; {3.0}
mulss xmm3,xmm4
addss xmm2,xmm3
mulss xmm1,xmm2
addss xmm0,xmm1
mov ecx,4E72BCh
movss dword ptr [ecx],xmm0
fld dword ptr [ecx]
ret
Pretty bad, no optimizations at all (except for constant folding). A lot of redundant memory accesses. And moving result with FPU hurts.
QDT
push ebp
mov ebp,esp
push esi
push edi
mov eax,610B40h
movaps xmm0,xmmword ptr [eax] ; {0.2}
mov eax,610B50h
movaps xmm1,xmmword ptr [eax] ; {3.0}
mov eax,610B60h
movaps xmm2,xmmword ptr [eax] ; {2.0}
mov esi,dword ptr [ebp+8] ; in
mov edi,dword ptr [ebp+0Ch]; out
mov ecx,dword ptr [ebp+10h]; count
L1: movaps xmm3,xmmword ptr [esi]
movaps xmm4,xmm2
movaps xmm5,xmm4
movaps xmm6,xmm3
movaps xmm7,xmm1
mulps xmm6,xmm1
addps xmm5,xmm6
movaps xmm6,xmm0
movaps xmm7,xmm3
mulps xmm7,xmm1
minps xmm6,xmm7
mulps xmm5,xmm6
addps xmm4,xmm5
movaps xmmword ptr [edi],xmm4
add esi,10h
add edi,10h
sub ecx,4
jne L1
pop edi
pop esi
mov esp,ebp
pop ebp
ret
Well... I think it's not that bad for starters. However inner loop could be greatly simplified (I hate you, 2-operand X86 instruction set!).
And now it's time for results...

(percentage of execution time compared to C++ code)
Those don't really surprise me. However why vectorized SSE code is only 2x and not 4x faster? Am I hitting memory bandwidth? Nevermind. The conclusion is simple: using JITed expression parser can greatly speed up data transformations. And now I'm going to try using LLVM for optimizations and/or assembly phases, because coding all optimizations is not easy...
PS. Original muParser was over 35x slower than C++ version so I didn't even include it in results ;)
Links
Comments:
-
Ben:
Would the c+ version benefit from the __restrict keyword?
Also, couldn't you use opencl for something like this?(Targetted at the cpu) If I recall correctly, you have to use the explicit vector types to get any automatic vectorization, but it would be interesting to see it in comparison to the other methods mentioned.18.11.2010 20:16:46
-
__restrict doesn't change anything in MSVC-generated code. However is makes a big change with GCC.
OpenCL for CPU? AFAIK Intel has released OpenCL SDK very recently and it's an alpha version for testing (non-commercial) purposes only.18.11.2010 21:31:41
-
Ben:
The amd opencl sdk supports cpu(intel included) with the restriction of at least 1 amd gpu or amd cpu in the system.
18.11.2010 22:14:05
-
So this is also a no-go for Intel/Nvidia fanboy ;)
18.11.2010 22:15:12