Created
May 21, 2017 22:05
-
-
Save neshume/0edc6ae1c5ad332bb4c62026be68a2fb to your computer and use it in GitHub Desktop.
Fast half-precision to single-precision floating point conversion
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// float32 | |
// Martin Kallman | |
// | |
// Fast half-precision to single-precision floating point conversion | |
// - Supports signed zero and denormals-as-zero (DAZ) | |
// - Does not support infinities or NaN | |
// - Few, partially pipelinable, non-branching instructions, | |
// - Core opreations ~6 clock cycles on modern x86-64 | |
void float32(float* __restrict out, const uint16_t in) { | |
uint32_t t1; | |
uint32_t t2; | |
uint32_t t3; | |
t1 = in & 0x7fff; // Non-sign bits | |
t2 = in & 0x8000; // Sign bit | |
t3 = in & 0x7c00; // Exponent | |
t1 <<= 13; // Align mantissa on MSB | |
t2 <<= 16; // Shift sign bit into position | |
t1 += 0x38000000; // Adjust bias | |
t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero | |
t1 |= t2; // Re-insert sign bit | |
*((uint32_t*)out) = t1; | |
}; | |
// float16 | |
// Martin Kallman | |
// | |
// Fast single-precision to half-precision floating point conversion | |
// - Supports signed zero, denormals-as-zero (DAZ), flush-to-zero (FTZ), | |
// clamp-to-max | |
// - Does not support infinities or NaN | |
// - Few, partially pipelinable, non-branching instructions, | |
// - Core opreations ~10 clock cycles on modern x86-64 | |
void float16(uint16_t* __restrict out, const float in) { | |
uint32_t inu = *((uint32_t*)&in); | |
uint32_t t1; | |
uint32_t t2; | |
uint32_t t3; | |
t1 = inu & 0x7fffffff; // Non-sign bits | |
t2 = inu & 0x80000000; // Sign bit | |
t3 = inu & 0x7f800000; // Exponent | |
t1 >>= 13; // Align mantissa on MSB | |
t2 >>= 16; // Shift sign bit into position | |
t1 -= 0x1c000; // Adjust bias | |
t1 = (t3 > 0x38800000) ? 0 : t1; // Flush-to-zero | |
t1 = (t3 < 0x8e000000) ? 0x7bff : t1; // Clamp-to-max | |
t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero | |
t1 |= t2; // Re-insert sign bit | |
*((uint16_t*)out) = t1; | |
}; |
Another fix!
Clamping to the max value was behaving incorrectly for numbers much larger than the maximum. These are the final corrected lines:
t1 = (t3 < 0x38800000) ? 0 : t1;
t1 = (t1 > 0x7bff) ? 0x7bff : t1;
<- notice how we are now comparing all non-sign bits (aka the absolute value) to the maximum possible value, essentially performing a simple clamping operation like with traditional numbers
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
For future reference, the
float16
function is broken and produces incorrect results. These two lines cause the problem:https://gist.github.com/neshume/0edc6ae1c5ad332bb4c62026be68a2fb#file-float16-c-L54-L55
The comparison operators are flipped. The correct code should be:
t1 = (t3 < 0x38800000) ? 0 : t1;
<- flushing to zero if the exponent is less than the minimum allowedt1 = (t3 > 0x8e000000) ? 0x7bff : t1;
<- clamping to the max value if the exponent is greater than the maximum allowedCheers!