When I was coding optimization stuff in QDT I was quite sure that I'm re-inventing the (low level virtual) wheel. So it was quite natural to plug it in and test-drive LLVM-ed QDT. Results were somehow unexpected and somewhat sad.
LLVM is a wonderful tool -- without a question. However as they say, from performance, versatility and readibilty you may choose *at most* two. LLVM is actually trying to choose e.
To cut long story short: LLVM built-in floating point is implemented in very cautious manners. Emphasis is placed on precision and safety, not performance. And for this reason almost no optimizations on floating-point operations are performed, even if you enable so-called ,,fast-math'' mode. As LLVM's live demo is available, you can try it yourself.
int foo(int x) { return 1+x+1+x+1+x+1+x+1+x+1+x+1+x; }
float bar(float x) { return 1+x+1+x+1+x+1+x+1+x+1+x+1+x; }
Results (LLVM IR representation):
foo
{
%tmp = mul i32 %x, 7 ; [#uses=1]
%0 = add nsw i32 %tmp, 7 ; [#uses=1]
ret i32 %0
}
bar
{
%0 = fadd float %x, 1.000000e+00 ; [#uses=1]
%1 = fadd float %0, 1.000000e+00 ; [#uses=1]
%2 = fadd float %1, %x ; [#uses=1]
%3 = fadd float %2, 1.000000e+00 ; [#uses=1]
%4 = fadd float %3, %x ; [#uses=1]
%5 = fadd float %4, 1.000000e+00 ; [#uses=1]
%6 = fadd float %5, %x ; [#uses=1]
%7 = fadd float %6, 1.000000e+00 ; [#uses=1]
%8 = fadd float %7, %x ; [#uses=1]
%9 = fadd float %8, 1.000000e+00 ; [#uses=1]
%10 = fadd float %9, %x ; [#uses=1]
%11 = fadd float %10, 1.000000e+00 ; [#uses=1]
%12 = fadd float %11, %x ; [#uses=1]
ret float %12
}
As you can see, LLVM is capable of optimizations -- but only integer, not floating point.
At least LLVM code is very easy to use and extend (for example, adding optimization passes). After some tweaks I've added two FP optimization passes, one removing no-ops and another compressing additions. Here are results for (slightly modified) expression: 2+min(2+x+x+2*x,1/5)*(1+x*4+1):
Default (maximum) optimizations:
movss xmm0,{2.0}
movss xmm1,dword ptr [esp+8]
movaps xmm2,xmm1
addss xmm2,xmm0
addss xmm2,xmm1
movaps xmm3,xmm1
addss xmm3,xmm3
mulss xmm1,{4.0}
addss xmm1,xmm0
addss xmm3,xmm2
minss xmm3,{0.2}
mulss xmm3,xmm1
addss xmm3,xmm0
movss dword ptr [esp],xmm3
Additional FP passes:
movss xmm0,dword ptr [esp+8]
mulss xmm0,{4.0}
movss xmm1,{2.0}
addss xmm0,xmm1
movaps xmm2,xmm0
minss xmm2,{0.2}
mulss xmm2,xmm0
addss xmm2,xmm1
movss dword ptr [esp],xmm2
No loop and vectorization code yet -- working on it.
Comments:
-
Arthur:
Interesting post. On a very related note, it's interesting and surprising to see that "unnecessary" parentheses can have an effect on the generated FP code:
float bar1(float x) {
return (x + x) + (x + x) + (x + x);
}
float bar2(float x) {
return x + x + x + x + x + x;
}
define float @_Z4bar1f(float %x) nounwind readnone {
entry:
%0 = fmul float %x, 2.000000e 00 ; <float> [#uses=2]
%1 = fmul float %0, 2.000000e 00 ; <float> [#uses=1]
%2 = fadd float %1, %0 ; <float> [#uses=1]
ret float %2
}
define float @_Z4bar2f(float %x) nounwind readnone {
entry:
%0 = fmul float %x, 2.000000e 00 ; <float> [#uses=1]
%1 = fadd float %0, %x ; <float> [#uses=1]
%2 = fadd float %1, %x ; <float> [#uses=1]
%3 = fadd float %2, %x ; <float> [#uses=1]
%4 = fadd float %3, %x ; <float> [#uses=1]
ret float %4
}28.11.2010 13:07:49
-
Arthur:
Not sure why the plus symbols where removed from my prior comment - but I hope the example is still clear, nonetheless.
28.11.2010 13:09:36
-
Sorry, it's a problem with my poor javascript skills. I'll try to fix that soon and meanwhile I've adjusted your comment.
28.11.2010 14:56:28
-
Fixed that. And I completely don't understand why the first x+x is parsed as fmul(x, 2). How does it comply with "precision" issues? Does IEEE floating point guarantee that x*2 is bitwise equal to x+x? Anyway it's just plain crazy because multiplication is usually slower than addition, so why convert x+x -> 2*x at all? Machine code pass will convert it back, but it seems horrible anyway.
28.11.2010 15:30:22
-
BTW, here's a surprisingly recent mailing list discussion on the issue:
http://old.nabble.com/Poor-floating-point-optimizations--td30268408.html28.11.2010 20:14:11