Quick Data Transformation

18.11.2010 01:54 in c++, tool

I was coding a particle system for my home engine. I wanted it to be quite flexible and after few iterations I had literally a dozens of parameters and transform modes. And this was not only a problem about design, it was a major performance issue.

I gave a thought about using expression parser. In some other project I've been involved, I've seen muParser library. But this was a no-go, as I had thousands of particles every frame and muParser was just too slow. And no, I couldn't offload particle computations onto GPU as GPU is heavily loaded with rendering.

But on muParser webpage I've found information about muParserSSE -- muParser fork, using JIT compilation (with AsmJit library) and SSE unit. Sounds great. I've made some tests and muParserSSE in fact was much faster than original muParser. But it had 3 blocking issues:

  1. only supported scalars -- no float4 datatype which is essential for particles (color, velocity, etc)
  2. even while it supported only scalars, it used scalar SSE instructions (like ADDSS)
  3. didn't support streams of data, only single evaluations

So I've decided to code a tool called QDT: Quick Data Transformation. My goals were:

  1. support float and float4
  2. for scalar-only expressions, use vector SSE ops (like ADDPS) to process 4 items at a time
  3. support streams (like process(float*in, float*out, size_t count))
  4. for code using <= 8 regs, don't use stack (muParserSSE uses stack also for 7 and 8 regs expressions)
  5. for code using > 8 regs, store highest regs in memory stack (because for deep expressions, most of the work is done near leaves -- muParserSSE stores root in regs and leaves in memory)

So I've grabbed Flex and Bison to build a compiler and AsmJit to generate machine code. I've spent some time on AST and machine code optimizations. And I've managed to obtain quite pleasing results.

Let's try with following expression:

2+min(x+x+x,1/5)*(2+x*3)

So first let's code a simple C++ version:

float fun(float x)
{
  return 2+min(x+x+x,1.0f/5.0f)*(2.0f+x*3.0f);
}
void test_cpp(float *in, float *out, size_t count)
{
  for (size_t i = 0; i < count; i++)
  {
    out[i] = fun(in[i]);
  }
}

Now, SSE assembly using intrinsics (I post only vectorized version, scalar is similar but uses SS and advances only by 1 element at a time)

void test_asm(float *in, float *out, size_t count)
{
  __m128 c1 = _mm_set1_ps(3);
  __m128 c2 = _mm_set1_ps(2);
  __m128 c3 = _mm_set1_ps(0.2f);

  for (size_t i = 0; i < count; i+=4)
  {
    __m128 r1 = _mm_load_ps(in + i);
    __m128 r2 = _mm_mul_ps(r1, c1);		
    __m128 r3 = _mm_min_ps(c3, r2);
    __m128 r4 = _mm_add_ps(r2, c2);
    __m128 r5 = _mm_mul_ps(r3, r4);
    __m128 r6 = _mm_add_ps(c2, r5);
    _mm_store_ps(out + i, r6);
  }
}

OK, so now is time for resulting assembly code.

C++ (MSVC 2008)

      mov         edx,dword ptr [esp+10h] 
      test        edx,edx 
      jbe         L1
      mov         ecx,dword ptr [esp+0Ch] 
      mov         eax,dword ptr [esp+8] 
      movss       xmm1, {0.2}
      movss       xmm2, {2.0}
      sub         eax,ecx 
 L4:  movss       xmm0,dword ptr [eax+ecx]
      mulss       xmm0, {3.0}
      comiss      xmm0,xmm1 
      jbe         L2
      movaps      xmm3,xmm1 
      jmp         L3
 L2:  movaps      xmm3,xmm0 
 L3:  addss       xmm0,xmm2 
      mulss       xmm0,xmm3 
      addss       xmm0,xmm2 
      movss       dword ptr [ecx],xmm0 
      add         ecx,4 
      dec         edx  
      jne         L4
 L1:  ret     

Quite good, however we have branching instead of MINSS, and of course -- no vectorization.

Assembly

      push        ebp  
      mov         ebp,esp 
      and         esp,0FFFFFFF0h 
      mov         eax,dword ptr [count] 
      movss       xmm2, {3.0}
      movss       xmm1, {2.0}
      movss       xmm3, {0.2}
      mov         edx,dword ptr [in] 
      mov         ecx,dword ptr [out] 
      shufps      xmm2,xmm2,0 
      shufps      xmm1,xmm1,0 
      shufps      xmm3,xmm3,0 
      test        eax,eax 
      jbe         L1
      dec         eax  
      shr         eax,2 
      sub         edx,ecx 
      inc         eax  
      mov         edi,edi 
 L2:  movaps      xmm0,xmmword ptr [edx+ecx] 
      mulps       xmm0,xmm2 
      movaps      xmm4,xmm3 
      minps       xmm4,xmm0 
      addps       xmm0,xmm1 
      mulps       xmm4,xmm0 
      addps       xmm4,xmm1 
      movaps      xmmword ptr [ecx],xmm4 
      add         ecx,10h 
      sub         eax,1 
      jne         L2
 L1:  mov         esp,ebp 
      pop         ebp  
      ret 

Well, compiler didn't break anything.

muParserSSE

As muParserSSE has no streaming evaluation mode, this is just single evaluation func.
      mov         ecx,4EBE90h 
      movss       xmm0,dword ptr [ecx] ; {2.0}
      mov         ecx,14F9C8h 
      movss       xmm1,dword ptr [ecx] ; x
      mov         ecx,14F9C8h 
      movss       xmm2,dword ptr [ecx] ; x
      addss       xmm1,xmm2 
      mov         ecx,14F9C8h 
      movss       xmm2,dword ptr [ecx] ; x
      addss       xmm1,xmm2 
      mov         ecx,4EBED0h 
      movss       xmm2,dword ptr [ecx] ; {0.2}
      minss       xmm1,xmm2 
      mov         ecx,4EBEE4h 
      movss       xmm2,dword ptr [ecx] ; {2.0}
      mov         ecx,14F9C8h 
      movss       xmm3,dword ptr [ecx] ; x
      mov         ecx,4EBEFCh 
      movss       xmm4,dword ptr [ecx] ; {3.0}
      mulss       xmm3,xmm4 
      addss       xmm2,xmm3 
      mulss       xmm1,xmm2 
      addss       xmm0,xmm1 
      mov         ecx,4E72BCh 
      movss       dword ptr [ecx],xmm0 
      fld         dword ptr [ecx] 
      ret        

Pretty bad, no optimizations at all (except for constant folding). A lot of redundant memory accesses. And moving result with FPU hurts.

QDT

      push        ebp  
      mov         ebp,esp 
      push        esi  
      push        edi  
      mov         eax,610B40h 
      movaps      xmm0,xmmword ptr [eax] ; {0.2}
      mov         eax,610B50h 
      movaps      xmm1,xmmword ptr [eax] ; {3.0}
      mov         eax,610B60h 
      movaps      xmm2,xmmword ptr [eax] ; {2.0}
      mov         esi,dword ptr [ebp+8]  ; in
      mov         edi,dword ptr [ebp+0Ch]; out
      mov         ecx,dword ptr [ebp+10h]; count
 L1:  movaps      xmm3,xmmword ptr [esi] 
      movaps      xmm4,xmm2 
      movaps      xmm5,xmm4 
      movaps      xmm6,xmm3 
      movaps      xmm7,xmm1 
      mulps       xmm6,xmm1 
      addps       xmm5,xmm6 
      movaps      xmm6,xmm0 
      movaps      xmm7,xmm3 
      mulps       xmm7,xmm1 
      minps       xmm6,xmm7 
      mulps       xmm5,xmm6 
      addps       xmm4,xmm5 
      movaps      xmmword ptr [edi],xmm4 
      add         esi,10h 
      add         edi,10h 
      sub         ecx,4 
      jne         L1
      pop         edi  
      pop         esi  
      mov         esp,ebp 
      pop         ebp  
      ret      

Well... I think it's not that bad for starters. However inner loop could be greatly simplified (I hate you, 2-operand X86 instruction set!).

And now it's time for results...


(percentage of execution time compared to C++ code)

Those don't really surprise me. However why vectorized SSE code is only 2x and not 4x faster? Am I hitting memory bandwidth? Nevermind. The conclusion is simple: using JITed expression parser can greatly speed up data transformations. And now I'm going to try using LLVM for optimizations and/or assembly phases, because coding all optimizations is not easy...

PS. Original muParser was over 35x slower than C++ version so I didn't even include it in results ;)

Links

Comments:

  1. Ben

    Ben:

    Would the c+ version benefit from the __restrict keyword?

    Also, couldn't you use opencl for something like this?(Targetted at the cpu) If I recall correctly, you have to use the explicit vector types to get any automatic vectorization, but it would be interesting to see it in comparison to the other methods mentioned.

    18.11.2010 20:16:46

  2. Tomasz Dabrowski

    Tomasz Dabrowski:

    __restrict doesn't change anything in MSVC-generated code. However is makes a big change with GCC.

    OpenCL for CPU? AFAIK Intel has released OpenCL SDK very recently and it's an alpha version for testing (non-commercial) purposes only.

    18.11.2010 21:31:41

  3. Ben

    Ben:

    The amd opencl sdk supports cpu(intel included) with the restriction of at least 1 amd gpu or amd cpu in the system.

    18.11.2010 22:14:05

  4. Tomasz Dabrowski

    Tomasz Dabrowski:

    So this is also a no-go for Intel/Nvidia fanboy ;)

    18.11.2010 22:15:12

Leave comment: