LLVM -- sadly not the Ultimate Solution

24.11.2010 02:50 in c++, compilers

When I was coding optimization stuff in QDT I was quite sure that I'm re-inventing the (low level virtual) wheel. So it was quite natural to plug it in and test-drive LLVM-ed QDT. Results were somehow unexpected and somewhat sad.

LLVM is a wonderful tool -- without a question. However as they say, from performance, versatility and readibilty you may choose *at most* two. LLVM is actually trying to choose e.

To cut long story short: LLVM built-in floating point is implemented in very cautious manners. Emphasis is placed on precision and safety, not performance. And for this reason almost no optimizations on floating-point operations are performed, even if you enable so-called ,,fast-math'' mode. As LLVM's live demo is available, you can try it yourself.

int foo(int x) { return 1+x+1+x+1+x+1+x+1+x+1+x+1+x; }
float bar(float x) { return 1+x+1+x+1+x+1+x+1+x+1+x+1+x; }

Results (LLVM IR representation):

foo
{
  %tmp = mul i32 %x, 7                            ;  [#uses=1]
  %0 = add nsw i32 %tmp, 7                        ;  [#uses=1]
  ret i32 %0
}

bar
{
  %0 = fadd float %x, 1.000000e+00                ;  [#uses=1]
  %1 = fadd float %0, 1.000000e+00                ;  [#uses=1]
  %2 = fadd float %1, %x                          ;  [#uses=1]
  %3 = fadd float %2, 1.000000e+00                ;  [#uses=1]
  %4 = fadd float %3, %x                          ;  [#uses=1]
  %5 = fadd float %4, 1.000000e+00                ;  [#uses=1]
  %6 = fadd float %5, %x                          ;  [#uses=1]
  %7 = fadd float %6, 1.000000e+00                ;  [#uses=1]
  %8 = fadd float %7, %x                          ;  [#uses=1]
  %9 = fadd float %8, 1.000000e+00                ;  [#uses=1]
  %10 = fadd float %9, %x                         ;  [#uses=1]
  %11 = fadd float %10, 1.000000e+00              ;  [#uses=1]
  %12 = fadd float %11, %x                        ;  [#uses=1]
  ret float %12
}

As you can see, LLVM is capable of optimizations -- but only integer, not floating point.

At least LLVM code is very easy to use and extend (for example, adding optimization passes). After some tweaks I've added two FP optimization passes, one removing no-ops and another compressing additions. Here are results for (slightly modified) expression: 2+min(2+x+x+2*x,1/5)*(1+x*4+1):

Default (maximum) optimizations:

  movss       xmm0,{2.0}
  movss       xmm1,dword ptr [esp+8] 
  movaps      xmm2,xmm1 
  addss       xmm2,xmm0 
  addss       xmm2,xmm1 
  movaps      xmm3,xmm1 
  addss       xmm3,xmm3 
  mulss       xmm1,{4.0}
  addss       xmm1,xmm0 
  addss       xmm3,xmm2 
  minss       xmm3,{0.2}
  mulss       xmm3,xmm1 
  addss       xmm3,xmm0 
  movss       dword ptr [esp],xmm3 

Additional FP passes:

  movss       xmm0,dword ptr [esp+8] 
  mulss       xmm0,{4.0}
  movss       xmm1,{2.0}
  addss       xmm0,xmm1 
  movaps      xmm2,xmm0 
  minss       xmm2,{0.2}
  mulss       xmm2,xmm0 
  addss       xmm2,xmm1 
  movss       dword ptr [esp],xmm2

No loop and vectorization code yet -- working on it.

Comments:

  1. Arthur

    Arthur:

    Interesting post. On a very related note, it's interesting and surprising to see that "unnecessary" parentheses can have an effect on the generated FP code:

    float bar1(float x) {
    return (x + x) + (x + x) + (x + x);
    }
    float bar2(float x) {
    return x + x + x + x + x + x;
    }


    define float @_Z4bar1f(float %x) nounwind readnone {
    entry:
    %0 = fmul float %x, 2.000000e 00 ; <float> [#uses=2]
    %1 = fmul float %0, 2.000000e 00 ; <float> [#uses=1]
    %2 = fadd float %1, %0 ; <float> [#uses=1]
    ret float %2
    }

    define float @_Z4bar2f(float %x) nounwind readnone {
    entry:
    %0 = fmul float %x, 2.000000e 00 ; <float> [#uses=1]
    %1 = fadd float %0, %x ; <float> [#uses=1]
    %2 = fadd float %1, %x ; <float> [#uses=1]
    %3 = fadd float %2, %x ; <float> [#uses=1]
    %4 = fadd float %3, %x ; <float> [#uses=1]
    ret float %4
    }

    28.11.2010 13:07:49

  2. Arthur

    Arthur:

    Not sure why the plus symbols where removed from my prior comment - but I hope the example is still clear, nonetheless.

    28.11.2010 13:09:36

  3. Tomasz Dabrowski

    Tomasz Dabrowski:

    Sorry, it's a problem with my poor javascript skills. I'll try to fix that soon and meanwhile I've adjusted your comment.

    28.11.2010 14:56:28

  4. Tomasz Dabrowski

    Tomasz Dabrowski:

    Fixed that. And I completely don't understand why the first x+x is parsed as fmul(x, 2). How does it comply with "precision" issues? Does IEEE floating point guarantee that x*2 is bitwise equal to x+x? Anyway it's just plain crazy because multiplication is usually slower than addition, so why convert x+x -> 2*x at all? Machine code pass will convert it back, but it seems horrible anyway.

    28.11.2010 15:30:22

  5. Arthur

    Arthur:

    BTW, here's a surprisingly recent mailing list discussion on the issue:

    http://old.nabble.com/Poor-floating-point-optimizations--td30268408.html

    28.11.2010 20:14:11

Leave comment: