Bitwise conversion of doubles using only floating-point multiplication and addition

In the words of Tom Lehrer, “this is completely pointless, but may prove useful to some of you some day, perhaps in a somewhat bizarre set of circumstances.”

The problem is as follows: suppose you’re working in a programming environment that provides only an IEEE-754 double-precision floating point (“double”) type, and no operations that can access that type’s representation (such as C++ bitwise cast operations, or Javascript’s DataView object). You have a double and you want to convert it to its bitwise representation as two unsigned 32-bit integers (stored as doubles), or vice versa. This problem comes up from time to time, but I was curious about a different question: how restricted can your programming environment be? Could you do it with just floating point multiplication and addition?

Bitwise conversion using floating point operations can be useful in situations like limited interpreted languages, or C++ constexpr contexts. Generally double to int conversion can be done using a binary search, comparing with powers of two to figure out the bits of the exponent. From there the fraction bits can be extracted, either by binary searching more, or using the knowledge of the exponent to scale the fraction bits into the integer range.

But can it be done without bitwise operations, branches, exponentiation, division, or floating point comparisons?

It seemed improbable at first, but I’ve discovered the answer is yes, multiplication and addition are mostly sufficient, although with a few notable caveats. Even without these restrictions different NaN values cannot be distinguished or generated (without bitwise conversion) in most environments, but using only multiplication and addition it is impossible to convert NaN, Infinity or -Infinity into an unsigned 32-bit value. The other problematic value is “negative zero”, which cannot be differentiated from “positive zero” using addition and multiplication. All my code uses subtraction, although it could be removed by substituting a - b with a + (b * -1). And finally, this relies on IEEE-754 operations (in the usual rounding mode, “round to nearest, ties to even”), so it wouldn’t work in environments that use unsafe maths optimisations (the default in shader compilers, or enabled by a flag such as /fp:fast in many other compilers).

So, if you just need a solution, here it is, but otherwise stick around for an explanation:

function double_as_uint32s(double) {
  // Doesn't handle NaN, Infinity or -Infinity. Treats -0 as 0.

  var a = double, b, c, d, e, f, g, h, i, j, k, l, m, n, low, high;

  f=2.2250738585072014e-308+a; j=5e-324; b=j+f; b-=f; m=-5e-324; d=m+b; b=4.4989137945431964e+161; d=b*d; d=b*d; g=d*d;
  d=1.0; g=d-g; h=m+f; f=h-f; f=j+f; f=b*f; f=b*f; f*=f; f=d-f; f*=g; g=-2.2250738585072014e-308+a; h=j+g; h-=g; h=m+h;
  h=b*h; h=b*h; h*=h; h=d-h; c=m+g; c-=g; c=j+c; c=b*c; c=b*c; c*=c; c=d-c; c*=h; k=c*f; c=5.562684646268003e-309*a;
  g=j+c; g-=c; g=m+g; g=b*g; g=b*g; g*=g; g=d-g; h=m+c; h-=c; h=j+h; h=b*h; h=b*h; h*=h; h=d-h; g=h*g; h=a*g; g=d-g;
  c=g*c; g=1024.0*g; f=2.0+g; c+=h; h=7.458340731200207e-155*c; l=1.0000000000000002; g=l*h; g=m+g; e=j+g; e-=g; e=b*e;
  e=b*e; c=e*c; e=d-e; g=e*h; c=g+c; e=512.0*e; g=8.636168555094445e-78*c; e+=f; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f;
  f=b*f; c=f*c; f=d-f; g=f*g; f=256.0*f; c=g+c; e=f+e; f=2.938735877055719e-39*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g;
  g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=128.0*g; g=5.421010862427522e-20*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f;
  f=b*f; c=f*c; f=d-f; g=f*g; f=64.0*f; c=g+c; e=f+e; i=2.3283064365386963e-10; f=i*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g;
  g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=32.0*g; g=1.52587890625e-05*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f; f=b*f;
  c=f*c; f=d-f; g=f*g; f=16.0*f; c=g+c; e=f+e; f=0.00390625*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g;
  f=g*f; c=f+c; f=8.0*g; g=0.0625*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f; f=b*f; c=f*c; f=d-f; g=f*g; f=4.0*f;
  c=g+c; e=f+e; f=0.25*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=g+g; e=f+e; n=0.5;
  f=n*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; e=g+e; f=d-k; g=j+a; g-=a; g=m+g; g=b*g;
  g=b*g; g*=g; g=d-g; h=m+a; a=h-a; a=j+a; a=b*a; a=b*a; a*=a; a=d-a; a*=g; g=f*a; a=d-a; a=e*a; a+=g; e=l*c; e=m+e;
  g=j+e; e=g-e; e=b*e; e=b*e; g=n*c; c=e*c; e=d-e; e*=g; c=e+c; e=4.450147717014403e-308+c; g=j+e; g-=e; g=m+g; g=b*g;
  g=b*g; g*=g; g=d-g; h=m+e; e=h-e; e=j+e; e=b*e; e=b*e; e*=e; e=d-e; e*=g; g=e+e; d-=g; c=d*c; c=b*c; b*=c;
  c=-4503599627370496.0*f; c+=b; b=i*c; b=-0.4999999998835847+b; b=4503599627370497.0+b; d=-4503599627370497.0+b;
  b=2147483648.0*e; a=1048576.0*a; a=b+a; b=d+a; a=-4294967296.0*d; a+=c; low=a; high=b;

  return [low, high];
}

function uint32s_as_double(low, high) {
  var a = low, b = high, c, d, e, f, g, h, i, j, k, l, m;

  b=9.5367431640625e-07*b; f=-0.4999999998835847; c=f+b; g=4503599627370497.0; c=g+c; e=-4503599627370497.0; c=e+c;
  d=b-c; c=0.00048828125*c; b=f+c; b=g+b; k=e+b; l=c-k; j=2.2250738585072014e-308; c=j+l; c-=l; i=4.49423283715579e+307;
  b=i*c; c=1.0; b=c-b; a=2.220446049250313e-16*a; h=-0.00048828125+l; a=d+a; d=b*h; d+=d; h=f+d; h=g+h; h=e+h; d-=h;
  b+=a; b=j*b; m=1.3407807929942597e+154; h=m*h; h=c+h; b=h*b; b*=h; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=m*h; h=c+h;
  b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=1.157920892373162e+77*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h;
  h=3.402823669209385e+38*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=1.8446744073709552e+19*h; h=c+h; b=h*b;
  d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=4294967295.0*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=65535.0*h; h=c+h;
  b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=255.0*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=15.0*h; h=c+h;
  b=h*b; d+=d; f+=d; f=g+f; e+=f; d-=e; e=3.0*e; e=c+e; b=e*b; d+=d; d=c+d; b=d*b; d=-0.99951171875+l; e=j+d; d=e-d;
  d=i*d; e=j+a; a=e-a; a=i*a; a=c-a; a=d*a; a=m*a; a=m*a; a-=a; a=b+a; b=k+k; b=c-b; a*=b;

  return a;
}

(I’m mostly joking, but it would be pretty funny to find that code buried in a library someday, and it should be pretty easy to port to just about any language.)

Background: IEEE-754 Doubles

This aims to be a concise explanation of everything you need to know about doubles to understand the rest of the article. Skip it if you know it already.

A double is a 64-bit value. Going from most-significant-bit to least-significant-bit, it is comprised of a 1-bit sign, an 11-bit exponent and a 52-bit fraction. These bits are interpreted as either a special value, or a numerical value as described in the following pseudocode. The operations and values in the pseudocode have infinite precision, and ** is the exponentiation operation.

if (sign == 1)
  s = -1;
else
  s = 1;

if (exponent == 0x7FF) {
  // Maximum exponent means a special value
  if (fraction == 0)
    return NaN;       // Not a Number
  else if (sign == 1)
    return -Infinity;
  else
    return Infinity;
} else if (exponent == 0) {
  // Zero exponent means a subnormal value.
  return s * (0.0 + fraction * (2 ** -52)) * (2 ** -1022);
} else {
  // Everything else is a normal value.
  return s * (1.0 + fraction * (2 ** -52)) * (2 ** (exponent-1023));
}

Normal values have an implicit leading 1, and can be thought of as “1.fraction”. Subnormals do not, so can be thought of as “0.fraction”, but they are otherwise the same as exponent == 1.

This has been carefully designed, and gives a few interesting properties:

The implicit leading one ensures that each value has a unique representation (except for 0/-0 and NaN).
The subnormals ensure that distance between representable numbers only ever decreases as you get closer to zero, so the difference between two sequential values (also known as a “unit in last place” or ULP) is always exactly representable.
For positive numbers, the floating point value increases with its 64-bit integer representation, so they could be compared as integers, or you can find the next representable value by adding 1 to its int64 representation.

Addition and multiplication of doubles is defined as exact, infinitely-precise, mathematical addition and multiplication. If the exact result can be represented by a double, that double is the result, otherwise rounding occurs. IEEE-754 specifies several rounding modes that can be used, but I’ll focus on the most widely used one “round to nearest, ties to even”. This means that the nearest representable value is used, or if the exact result is half way between two representable values, the value with zero in its least significant fraction bit is used. If the infinitely precise result gets too large or too small, it will be rounded to Infinity or -Infinity (see IEEE-754-2008 section 4.3.1 for a formal definition).

Finally, we should consider the special values. If NaN is an input to an addition or multiplication, the result will always be NaN. Multiplication and addition with Infinity or -Infinity will result in other Infinity or -Infinity values, with the exceptions of multiplying Infinity by zero, or subtracting Infinity from Infinity, both of which will result in NaN.

Notation

From this point onward, this is an attempt at something like literate programming, presented in essentially the order I created it, starting with just multiply, add and subtract, then building progressively more powerful functions. The code was written as C++, and has been refactored to simplify the explanation. I do make use of loops and functions, but only where they can be completely unrolled or inlined by the compiler.

I’ve omitted the function double p2(int e), which provides a power of two – everywhere it is used it gets inlined as a constant, but the easiest way to ensure this was to use a lookup table with 2098 values.

The macro CONSTEXPR is defined as follows, mostly to allow adjustments to inlining, or removing the constexpr keyword from everything easily:

#define CONSTEXPR \
  constexpr static inline __attribute__((always_inline))

Throughout this text I’ve used exponent to mean the encoded exponent bits in a double, as opposed to the unbiased/decoded exponent (exponent - 1023). Hopefully that’s not too confusing.

Logic Operations

I started by investigating what you can do with only addition and multiplication. Supposing “true” is 1.0 and “false” is 0.0, I implemented some logic operations:

CONSTEXPR double double_and(double a, double b) {
  return a * b;
}

CONSTEXPR double double_not(double a) {
  return 1 - a;
}

CONSTEXPR double double_or(double a, double b) {
  return a + b - a * b;
}

CONSTEXPR double select(
    double condition, double if_true, double if_false) {
  return condition * if_true + double_not(condition) * if_false;
}

These are mostly presented without further comment, as they can be tested exhaustively. However select is where things get a bit tricky. Because Infinity * 0 = NaN and NaN + anything = NaN, we can never ignore Infinity values and must be meticulous about never performing operations that could create them.

Avoiding Infinities

Given I want to convert an arbitrary floating point number to its bitwise representation, I had to start by figuring out what operations I could do on any floating point number without risking creating an Infinity.

One option here is multiplying by values between 1.0 and -1.0 inclusive as the result will never increase in magnitude. This works in any rounding mode.

We can also add any constant value between p2(969) and -p2(969) exclusive, as this will not round to infinity when added to the positive or negative values of greatest magnitude. However, this only works in round-to-nearest or round-toward-zero modes, as round-toward-positive and round-toward-negative may round to Infinity when adding even the smallest non-zero value.

An Initial Comparison

I figured I would need to construct (x == y) and (x < y) comparisons – something that would give me a boolean 0.0 or 1.0 that I could use with my logic functions. But I couldn’t even come up with a way to compute (x == 0). So I instead started with the question: what boolean value can I create?

Consider floating point addition of the smallest positive value (p2(-1074)) to a number. If exponent (the value of the encoded bits) is zero or one, this value is the ULP (distance between subsequent representable floating point values), so the result will be exact. When exponent is two, the ULP is doubled, so the exact result will land between two representable values, so it will “round to even”, and either round up (adding p2(-1073) instead) or round down (leaving the value unchanged). Finally, if the exponent is four or above, the exact result of the addition will never reach the midpoint between representable values, so rounding to nearest will leave the value unchanged.

That explanation doesn’t completely cover the boundaries between exponent values. Importantly, when adding p2(-1074) to the negative number with exponent two and fraction zero, the result will have exponent one, and therefore is exactly representable (although the same is not true for the corresponding positive number).

So, supposing we compute x + p2(-1074) - x we will get either p2(-1074) * 2 or 0 if there was rounding, or p2(-1074) if the result of the addition is accurate.

This can be turned into a boolean like so:

CONSTEXPR double adding_smallest_is_precise(double x) {
  double add_error = x + p2(-1074) - x;

  // add_error is in {0, p2(-1074), 2 * p2(-1074)}
  add_error -= p2(-1074);

  // add_error is in {-p2(-1074), 0, p2(-1074)}
  // divide by p2(-1074), by multiplying by p2(1074). p2(1074) is
  // out of range, so multiply by its square root twice instead.
  add_error = add_error * p2(1074/2) * p2(1074/2);

  // add_error is in {-1, 0, 1}
  add_error *= add_error;

  // add_error is in {1, 0, 1}
  return double_not(add_error);
}

This function computes -p2(-1021) <= d < p2(-1021), which is enough to start constructing other comparisons.

However, this comparison is frustratingly asymmetric, so we’ll compute -p2(-1021) < d < p2(-1021) as follows. This is equivalent to checking if the exponent is zero or one.

CONSTEXPR double is_exp_0_or_1(double x) {
  double precise_add = adding_smallest_is_precise(x);
  double precise_sub = adding_smallest_is_precise(-x);
  return double_and(precise_add, precise_sub);
}

Equality Comparisons

To start with, it’d be good to compute x == 0. We can now do that by taking the minimum and maximum values that satisfy is_exp_0_or_1 and checking that x + v still satisfies is_exp_0_or_1 for both:

CONSTEXPR double is_zero(double x) {
  double magic = p2(-1021) - p2(-1074);
  return double_and(is_exp_0_or_1(x + magic),
                    is_exp_0_or_1(x - magic));
}

This works, and is Infinity-safe, as the magic number is nowhere near the limit of p2(969). It also gives us a way to implement x == y, by checking is_zero(x - y). However, x - y may be Infinity, so we must first implement a safe subtraction operation for comparisons:

CONSTEXPR double cmp_sub(double x, double y) {
  // return a number with the same sign as x-y (or zero
  // if x-y==0), while avoiding returning infinity.

  double small = double_or(is_exp_around_0_or_1(x),
                           is_exp_around_0_or_1(y));
  double multiplier = (small + 1) * p2(-1);
  return (x * multiplier) - (y * multiplier);
}

If either value has a tiny exponent, then x - y cannot become infinite. However, if both values have an exponent >= 2, multiplying by p2(-1) will be lossless (it just subtracts 1 from the exponent). As such, the result will be zero when x == y, will be positive when x > y and will be negative when x < y. So we can test equality like so:

CONSTEXPR double is_equal(double x, double y) {
  return is_zero(cmp_sub(x, y));
}

Unfortunately, we still don’t have a way to calculate x < 0 (which would give us x < y), but we’ll get back to that later.

Getting the Exponent

If we want to convert a double to its bitwise representation, we’ll need to extract its encoded exponent. So far, we can check if the exponent is zero or one.

We can use that to build a test for if the exponent is zero (i.e. the value is a subnormal), by adding constants that shift values with exponent one outside of the range:

CONSTEXPR double is_exp_0(double x) {
  return double_and(is_exp_0_or_1(x + p2(-1022)),
                    is_exp_0_or_1(x - p2(-1022)));
}

The other thing we want is to multiply by negative powers of two. This will subtract a constant from the exponent (leaving the fraction unchanged), unless the exponent reaches zero, in which case rounding will occur (possibly rounding up to a value with exponent one). This can be used to build tests for if the exponent is less than a given value. For example, is_exp_0_or_1(v * p2(-1024)) will be true if the exponent is less than 1024 + 2.

This can be used to binary search the value of the exponent:

CONSTEXPR double get_encoded_exponent(double v) {
  double tmp = v;
  double e = 0;

  #pragma unroll
  for (int test = 1024; test >= 1; test /= 2) {
    double trial = tmp * p2(-test);
    double too_small = is_exp_0_or_1(trial);

    tmp = select(too_small, tmp, trial);
    e += select(too_small, 0, test);
  }
  
  return select(is_exp_0_or_1(v), double_not(is_exp_0(v)), e + 2);
}

This will check if the encoded exponent is less than 2 + 1024, and if not, it’ll subtract 1024 from the encoded exponent (by multiplying by p2(-1024)), and add 1024.0 to our exponent value. This is repeated with smaller powers of two, until we know that the remaining encoded exponent is 0, 1, or 2, and the e variable will contain the amount subtracted. Finally, it uses the is_exp_0_or_1 and is_exp_0 functions to handle the zero and one cases explicitly.

Complete Comparisons

This is a great step towards bitwise casts, but tmp in get_encoded_exponent is interesting. By the end of the function, we’ve preserved its sign and fraction bits, but its exponent has been converted to only 0, 1, or 2. This makes the challenge of testing x < 0 much simpler.

We can easily define a make_exp_0_or_1 function, that does the same thing, but also halves values that were left with exponent two:

CONSTEXPR double make_exp_0_or_1(double v) {
  double res = v;

  #pragma unroll
  for (int test = 1024; test >= 1; test /= 2) {
    double trial = res * p2(-test);
    res = select(is_exp_0_or_1(trial), res, trial);
  }

  return select(is_exp_0_or_1(res), res, res * p2(-1));
}

Now we can add a constant to shift all non-negative values out of the zero-or-one exponent range, such that only values less than zero pass the is_exp_0_or_1 test.

CONSTEXPR double is_less_than_zero(double v) {
  return is_exp_0(make_exp_0_or_1(v) + p2(-1022));
}

And, using our cmp_sub from earlier, we can compute (x < y):

CONSTEXPR double is_less_than(double a, double b) {
  return is_less_than_zero(cmp_sub(a, b));
}

Floor

The final tool we need before we can put together out bitwise casts is floor. For this, we’ll consider only numbers between zero and p2(52), and we’ll use a trick I’ve seen in the past (e.g. in musl libc’s floor.c). The trick is to add and subtract p2(52). Within the range p2(52) to p2(53), the ULP is exactly 1, so x + p2(52) - p2(52) performs a round-to-nearest-integer operation. From here, we can simply check if it rounded up, and subtract 1 if it did:

CONSTEXPR double small_positive_floor(double v) {
  // WARNING: incorrect for negative numbers and some
  // values over p2(52)
  // (but works for zero despite the name)
  double r = v + p2(52) - p2(52);
  return select(is_less_than(v, r), r - 1, r);
}

This lets us extract specific bits from a floating point integer. Specifically, I use the following idiom to split n low bits from an integer x: high_part = floor(x * p2(-n)); low_part = x - high_part * p2(n);

Double to bits

So, how close are we to converting a double to its bits? get_encoded_exponent gives us the exponent bits. is_less_than_zero gives us the sign bit.

For the fraction, make_exp_0_or_1 has given us all the fraction bits, but preserved the sign, and the implicit leading 1 if the number isn’t subnormal.

We can clear the sign bit by multiplying by -1 if the value is negative. We can subtract the implicit leading 1 if the value isn’t subnormal to be left with only the fraction bits, and then scale it up by p2(1047) so that a fraction of 1 is 1.0:

CONSTEXPR double get_fraction(double v) {
  double result = make_exp_0_or_1(v) *
                  select(is_less_than_zero(v), -1, 1);
  result -= select(is_exp_0(v), 0, p2(-1022));
  result = result * p2(1074 / 2) * p2(1074 / 2); 
  return result;
}

This gives us a 1-bit sign value, an 11-bit exponent value, and a 52-bit fraction value (all stored as integers within doubles), so we just need to split that into two 32-bit values.

These traditionally bitwise ops are written using multiplication by powers of two as a constant shift (with floor to truncate the result), addition to set bits (instead of bitwise “or”), and subtraction to clear bits (instead of bitwise “and”):

struct low_high_doubles {
  double low;
  double high;
};

CONSTEXPR struct low_high_doubles constexpr_double_as_ints(double v){
  double sign = is_less_than_zero(v);
  double exponent = get_encoded_exponent(v);
  double fraction = get_fraction(v);

  double high_fraction = small_positive_floor(fraction * p2(-32));
  double high = sign * p2(31) + exponent * p2(20) + high_fraction;
  double low = fraction - high_fraction * p2(32);
  return { low, high };
}

Bits to double

To convert bits to double, we can roughly follow the inverse. This is conceptually a bit simpler, so it’s only explained lightly in the comments:

CONSTEXPR double double_from_sign_exp_fraction(
    double sign, double exponent, double fraction) {
  double exp_is_non_zero = double_not(is_zero(exponent));

  // scale fraction down to exponent 0
  double v = fraction * p2(-1074);

  // add the implicit leading one if needed (so exponent = 1)
  v += select(exp_is_non_zero, p2(-1022), 0);

  // compute how much we need to increment the exponent by
  double e = select(exp_is_non_zero, exponent - 1, 0);

  // shift it so that all but the first bit is after the point
  e *= p2(-10);

  #pragma unroll
  for (int test = 1024; test >= 1; test >>= 1) {
    // cond will be 1 if the relevant bit is set, otherwise 0
    double cond = small_positive_floor(e);

    // clear the current bit and shift the next bit into the
    // ones place
    e = (e - cond) * 2;
    if (test == 1024) {
      // p2(1024) is unrepresentable, so multiply by its
      // square root twice
      v *= select(cond, p2(512), 1.0);
      v *= select(cond, p2(512), 1.0);
    } else {
      v *= select(cond, p2(test), 1.0);
    }
  }

  // generate a NaN value if one is expected.
  double is_nan = double_and(is_equal(exponent, 2047),
                             double_not(is_zero(fraction)));

  // if it's a NaN, "v" will already be Infinity, so multiply by
  // zero to make it NaN, otherwise multiply by one to leave it
  // as-is.
  v *= double_not(is_nan);

  // set the sign bit
  v *= select(sign, -1, 1);

  return v;
}

Finally, we just need to extract the sign, exponent and fraction fields from the high and low unsigned 32-bit integers:

CONSTEXPR double constexpr_ints_as_double(double l, double h) {
  double exp_and_sign = small_positive_floor(h * p2(-20));

  double sign = small_positive_floor(h * p2(-31));
  double exponent = exp_and_sign - sign * p2(11);

  double fraction = (h - exp_and_sign * p2(20)) * p2(32) + l;

  return double_from_sign_exp_fraction(sign, exponent, fraction);
}

The code presented above is true to my initial implementation, but ends up quite bloated, compiling to around 5000 add, subtract or multiply operations (assuming it’s all inlined and unrolled). You can see it on Compiler Explorer or gist.

“Dirty” floor trick

Perhaps that would be a good place to leave it, but I tried to optimise the number of operations a little. To decrease the size to something comparable to what’s shown in the initial Javascript (around 368 operations), a number of less safe or less clear functions and techniques are used.

The biggest problem is floor, which requires the make_exp_0_or_1 operation every time (binary searching the exponent takes a fair number of instructions). In every situation we use “floor” we know a lot about the range of the value, and the number of bits present after the point. This lets us implement floor without a comparison, by just biasing the input numbers such that round-to-nearest-ties-to-even will round down.

CONSTEXPR double dirty_floor(double v) {
  // for values between 0 and 0x100000 with up to 32 significant bits
  // after the "decimal" point.
  return v - (0.5 - p2(-33)) + (p2(52)+1) - (p2(52)+1);
}

This might be the most complex trick, so to explain a little more: ignoring edge cases we could say that floor(x) == roundToNearestEven(x - 0.5). But the “edge case” here is integers, which will end up exactly between two integers, so round-to-even will round half of all integers down, giving the wrong result.

We can get the right result by subtracting slightly less than 0.5 instead. How much less? Well, it can’t make any other value land on 0.5, so it must be smaller than the smallest distance between possible inputs. But it also can’t get rounded off, so it must be at least the ULP for the biggest possible input.

This is impossible to solve if you have 53 significant bits, but fortunately we don’t. The constant chosen works out exactly for our 52-bit fraction being shifted right by 32, and happens to work everywhere else, as there are both fewer significant bits and no larger values.

More tweaks

Revisiting the initial comparison, a cheaper symmetrical boolean test was found. This computes -p2(-1021) <= d <= p2(-1021) (i.e. the same as is_exp_0_or_1 but including one value on either side).

CONSTEXPR double is_exp_around_0_or_1(double v) {
  double biased = v - p2(-1074);
  return (biased + p2(-1074) - biased) * p2(1074 / 2) * p2(1074 / 2);
}

(This can be analysed case-by-case, but essentially the initial bias both makes it symmetrical, and prevents a subsequent round-to-even from ever rounding away from the biased value, simplifying the conversion to boolean.)

We can go a bit further to try to replace is_exp_0_or_1 by multiplying the input by the smallest double greater than one. Unfortunately, this can generate Infinity when called on arbitrary values, but we can use it on all but the first iteration of our exponent decreasing loops.

CONSTEXPR double unsafe_is_exp_0_or_1(double v) {
  // only works for non-huge numbers
  return is_exp_around_0_or_1(v * (p2(0) + p2(-52)));
}

We can use much coarser comparisons when we know a number is either zero or a long way from zero, as we do when comparing the “exponent” or “fraction” values:

CONSTEXPR double is_integer_zero(double v) {
  return (v + p2(-1022) - v) * p2(1022);
}

CONSTEXPR double is_small_integer_equal(double a, double b) {
  return is_integer_zero(a - b);
}

Despite the names, I can and did use these method on non-integers without worrying, when I knew they were well above roughly p2(-900) (around which we might have to worry about the addition being accurate for a non-zero value).

Finally, there were just a lot of little simplifications that the compiler cannot perform. A lot of duplicate work was removed by computing sign, exponent and fraction at the same time in one big function. Throughout the code, select(cond, x, y) with constant x and y could often be written as (x - y) * cond + y, which simplifies even further if y is zero. And there were plenty of other algebraic simplifications of little note.

You can find my optimised code on Compiler Explorer or gist. (Although this doesn’t quite match the code in this post, it should match the Javascript at the top closely.)

The Javascript was generated by compiling the optimised code with clang++ -O2 -fno-slp-vectorize -march=skylake -std=c++14 -fomit-frame-pointer -S, which generated an assembly file containing a stream of vaddsd, vsubsd and vmulsd instructions, as well as vmovsd instructions to load constants. These instructions were translated into Javascript using a terrible Python script.

Future work

As noted, this was a completely pointless exercise, but it does open up some avenues for further pointless exercises:

Can it be generalised to work for float as well (splitting to two 16-bit values)?
Can it be extended to other rounding modes? All other rounding modes?
Are there simpler or smaller implementations of the various operations used?
Could it be turned into a reasonable expression, with no variables, just nested additions and subtractions? Doing so naively gives multi-gigabyte results, but no effort was made to optimise for this.
This roughly shows that any function from finite doubles to finite doubles can be implemented. How hard is it to approximate division? How many operations would it take to implement correctly rounded division?

I’d also like to generate a version of the Javascript where all the constants are synthesised from “2.0” and “0.5”, so as to try to be portable to restricted environments with potentially inaccurate floating-point constant parsing.

3 thoughts on “Bitwise conversion of doubles using only floating-point multiplication and addition”

diffuse on November 25, 2020 at 6:21 pm said:

Minor nitpick – shouldn’t 0x7FFF be 0x7FF?

Reply ↓
- dougallj on March 16, 2021 at 11:56 pm said:
  
  Ah, just saw this, fixed, thanks!
  
  Reply ↓
diffuse on November 25, 2020 at 6:24 pm said:

Awsome!

Reply ↓

dougallj

Bitwise conversion of doubles using only floating-point multiplication and addition

Background: IEEE-754 Doubles

Notation

Logic Operations

Avoiding Infinities

An Initial Comparison

Equality Comparisons

Getting the Exponent

Complete Comparisons

Floor

Double to bits

Bits to double

“Dirty” floor trick

More tweaks

Future work

Further Reading

3 thoughts on “Bitwise conversion of doubles using only floating-point multiplication and addition”

Leave a comment Cancel reply

Background: IEEE-754 Doubles

Notation

Logic Operations

Avoiding Infinities

An Initial Comparison

Equality Comparisons

Getting the Exponent

Complete Comparisons

Floor

Double to bits

Bits to double

“Dirty” floor trick

More tweaks

Future work

Further Reading

Share this:

Related

3 thoughts on “Bitwise conversion of doubles using only floating-point multiplication and addition”

Leave a comment Cancel reply