Since the start of this year I'm employed as a postdoc on a project for designing a new domain specific language for writing dsp algorithms. The language is called Feldspar and if you want to check it out you can look at the official homepage or download it from hackage. As you can see on these pages Feldspar is a collaboration between Chalmers (where I'm employed), Ericsson and Elte University in Budapest.
There are two main goals for the design of Feldspar: first of all it should be possible to program at a very high level, close to how dsp algorithms are normally specified. Secondly, the generated code needs to be very efficient as the kind of applications that Ericsson has for dsp applications are performance critical.
I'm involved in various parts of Feldspar but one particular thing on my plate is making sure that the generated code is fast.
Recently the language has been used in a pilot project within Ericsson to implement part of the 3gpp standard (a mobile broadband standard, unsurprisingly, given Ericssons involvement). Having people using the language is tremendously useful for us language implementors as we really get a chance to see how well the language works in its intended environment.
The Feldspar code that was written in this pilot project was kept very close to the standard and was very similar to the mathematical specification of the algorithms. It's a very nice feature of Feldspar that this is possible, but it poses a challenge for us language implementers. While eyeballing some of the code I notice a little piece of code that I would like to discuss a little:
(-1) ^ v30
Just to clarify, in Feldspar this means minus one to the power of v30
and it has nothing to do with xor. v30
is just a variable name.
Powers of minus one is a common idiom in mathematics for saying that a value should change sign. If the exponent is even the result will be positive and if the exponent is odd the result is negative. This kind of thing is useful in various places and apparently also in the 3gpp standard.
However, actually performing exponentiation would in this case be ridiculously inefficient. But the question is what kind of code we should generate for this? Currently our compiler generates C so the code examples I will show from here on will be written in C.
Remember what I wrote above that the result only depends on whether the exponent is even or not. That is very easy to check, it's just the least significant bit! So we might generate the following code (assuming that the exponent is v30
as in the example above):
v30 & 1 ? -1 : 1;
This is very short and nice and most likely as good as we can hope for, at least for the kind of processors we are targeting.
However, I started programming in the 80's and I still instinctively flinch when I see branches in performance critical code. So I got curious to see if I could write the above as straight line code.
A straight line code solution which computes the above function would most likely involve some bit twiddling. I spent some time trying to come up with a solution on my own but wasn't very happy with what I managed to produce. I was aiming for a three instruction solution and my solutions were nowhere near that. So I decided to look around, maybe someone else had solved the problem before me.
Bit Twiddling Hacks to the rescue! This wonderful list of various bit twiddling tricks doesn't have anything which solves my particular problem but there is one little nugget which is close enough that I could make good use of it: "Conditionally negate a value without branching". The value that we will be negating is 1
, we want to negate it depending on the least significant bit of our input value (v30
in our example above). I will not reproduce the code from Bit Twiddling Hacks, you can check it out yourself via the link. Instead I'm just going to present the final result of using that code on my particular problem (using C code):
int v; //The exponent
int r; //Will contain -1 to the power of
int isOdd = v & 1; //Is the exponent odd?
r = (1 ^ (-isOdd)) + isOdd;
This is a fairly clever piece of code and I'm quite happy with it. However, it will compile to four instructions on typical architectures and I was really hoping for a three instruction solution. Anyone out there who knows of a shorter solution?