Skip to content

Instantly share code, notes, and snippets.

@neshume
Created May 21, 2017 22:05
Show Gist options
  • Save neshume/0edc6ae1c5ad332bb4c62026be68a2fb to your computer and use it in GitHub Desktop.
Save neshume/0edc6ae1c5ad332bb4c62026be68a2fb to your computer and use it in GitHub Desktop.
Fast half-precision to single-precision floating point conversion
// float32
// Martin Kallman
//
// Fast half-precision to single-precision floating point conversion
// - Supports signed zero and denormals-as-zero (DAZ)
// - Does not support infinities or NaN
// - Few, partially pipelinable, non-branching instructions,
// - Core opreations ~6 clock cycles on modern x86-64
void float32(float* __restrict out, const uint16_t in) {
uint32_t t1;
uint32_t t2;
uint32_t t3;
t1 = in & 0x7fff; // Non-sign bits
t2 = in & 0x8000; // Sign bit
t3 = in & 0x7c00; // Exponent
t1 <<= 13; // Align mantissa on MSB
t2 <<= 16; // Shift sign bit into position
t1 += 0x38000000; // Adjust bias
t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero
t1 |= t2; // Re-insert sign bit
*((uint32_t*)out) = t1;
};
// float16
// Martin Kallman
//
// Fast single-precision to half-precision floating point conversion
// - Supports signed zero, denormals-as-zero (DAZ), flush-to-zero (FTZ),
// clamp-to-max
// - Does not support infinities or NaN
// - Few, partially pipelinable, non-branching instructions,
// - Core opreations ~10 clock cycles on modern x86-64
void float16(uint16_t* __restrict out, const float in) {
uint32_t inu = *((uint32_t*)&in);
uint32_t t1;
uint32_t t2;
uint32_t t3;
t1 = inu & 0x7fffffff; // Non-sign bits
t2 = inu & 0x80000000; // Sign bit
t3 = inu & 0x7f800000; // Exponent
t1 >>= 13; // Align mantissa on MSB
t2 >>= 16; // Shift sign bit into position
t1 -= 0x1c000; // Adjust bias
t1 = (t3 > 0x38800000) ? 0 : t1; // Flush-to-zero
t1 = (t3 < 0x8e000000) ? 0x7bff : t1; // Clamp-to-max
t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero
t1 |= t2; // Re-insert sign bit
*((uint16_t*)out) = t1;
};
@TriceHelix
Copy link

For future reference, the float16 function is broken and produces incorrect results. These two lines cause the problem:
https://gist.github.com/neshume/0edc6ae1c5ad332bb4c62026be68a2fb#file-float16-c-L54-L55
The comparison operators are flipped. The correct code should be:
t1 = (t3 < 0x38800000) ? 0 : t1; <- flushing to zero if the exponent is less than the minimum allowed
t1 = (t3 > 0x8e000000) ? 0x7bff : t1; <- clamping to the max value if the exponent is greater than the maximum allowed

Cheers!

@TriceHelix
Copy link

TriceHelix commented Nov 12, 2022

Another fix!
Clamping to the max value was behaving incorrectly for numbers much larger than the maximum. These are the final corrected lines:
t1 = (t3 < 0x38800000) ? 0 : t1;
t1 = (t1 > 0x7bff) ? 0x7bff : t1; <- notice how we are now comparing all non-sign bits (aka the absolute value) to the maximum possible value, essentially performing a simple clamping operation like with traditional numbers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment