rfloat is a header-only library that makes your floating point code reproducible with zero overhead.
- Features
- Quick Start
- Why is Reproduccibility Important
- Examples
- Design
- Supported Platforms
- Goals
- Limitations
- Issues
- Benchmarks
- Reproducibility Guarantees
- License
- Credit
- Additional Resources
- Zero unnecessary overhead
- Header-only
- Convert existing code with an
rprefix - Supports all major compilers
- Supports all major platforms
- Zero dependencies
- Full support
- Compatible with C++17 and newer
-
Include the
<rfloat>header. -
Replace usages of
float&doublewithrfloat&rdouble. -
Optional Replace
<cmath>includes with<rcmath>. -
Optional Replace
cmathfunctions with theirrstdequivalents.std::sqrt→rstd::sqrt
IEEE-754 floating point arithmetic is reproducible under certain conditions. However, floating point code is rarely reproducible in real programs due to:
- compiler flags & optimizations
- non-compliant hardware
- non-compliant implementations
- ambiguous standard semantics
Floating point non-reproducibility creates problems for many applications, especially games, robotics, and high-assurance code. Other libraries like dmath implement reproducible floating point through a variety of strategies, each of which have their own significant tradeoffs and performance implications.
rfloat uses a different approach focused on preventing the compiler from performing dangerous optimizations while still allowing it to generate the same instructions it otherwise would. This isn't perfect, but it's an excellent tradeoff that provides reproducibility in practice for almost all applications.
Note
It's impossible for any hard float library to guarantee complete reproducibility. Users that need such guarantees should consider soft float alternatives like Berkeley SoftFloat and GNU MPFR. See the section on Reproducibility Guarantees on why this is difficult.
#include <rfloat>
rfloat sum(rfloat a, rfloat b) {
return a + b;
}
You can mix regular floating point types and reproducible types. The result will be a reproducible type.
float a;
rfloat b;
auto c = a + b;
static_assert(std::is_same<decltype(c), rfloat>::value);
Reproducible types prohibit usage with types of differing sizes, as well as types with different rounding modes if the results would be ambiguous.
rfloat a;
rdouble b;
rfloat c = a + b; // compile error
Reproducible types also allow unwrapping values to interact with existing code.
rfloat a;
float b = a.underlying_value();
rfloat c;
rdouble d = c.fp64(); // Casts are allowed as long as they don't lose precision
rdouble e;
rfloat f = e.fp32(); // Compile error to prevent accidental narrowing
float f = e.underlying_value(); // Escape hatches exist
Expressions that compile with reproducible types should return the same results under any combination of compiler flags. There should be little to no performance cost beyond the operations themselves.
Code requiring specific rounding modes should call rstd::SetRoundingMode<T>() manually to ensure the environment is initialized to the correct rounding mode. Refer to compiler documentation for how this interacts with the runtime environment.
Note
Ensuring the environment has the correct rounding mode at runtime is left to the user. This is a rare requirement and can introduce significant overhead.
<stdfloat> is supported by defining the ENABLE_STDFLOAT macro.
rfloat also provides overloads for all of the <cmath> functions. Only reproducible overloads are enabled by default. This encompasses the abs, fma, sqrt() and other basic operations on most platforms. Certain platforms do not implement all operations in a reproducible way. When this occurs, the affected functions can be enabled by defining RSTD_NONDETERMINISM.
rdouble loan_cost(rdouble principal, rdouble interest_rate) {
constexpr rdouble term = 5;
rdouble rate = annual_rate / 100.0;
return principal * pow(1 + rate, term);
}
Overloads are only as reproducible as the underlying standard library implementation. Users who need guaranteed reproducibility should evaluate dedicated soft float implementations like Berkeley SoftFloat and GNU MPFR, or other libraries with custom implementations like dmath, crlibm, and rlibm.
Note
If you want to evaluate standard library reproducibility on your platform, the reproducibility tests can check them by defining RSTD_DETERMINISM and ENABLE_NONDETERMINISTIC_TESTS when
building the reproducibility tests, or by using --preset debug during CMake configuration with Clang or GCC.
Platform reproducibility issues are documented in the Issues section.
Inspiration for this library comes from Sherry Ignatchenko's talk on floating point reproducibility, which observed that C++ can be made practically reproducible if we can ensure sequencing between subsequent expressions with semicolons ';'. In practice, Clang and GCC may optimize across lines, for example converting:
float foo = a * b;
float bar = foo + c;
This may compile into something equivalent to
float bar = fma(a, b, c);
Compilers do this because it produces a result faster with less rounding error. Compilers are imperfect though and don't apply this optimization consistently. The same code built in different translation units, built by different compilers, or even built by the same compiler for another platform may produce a different result for the same inputs.
rfloat prevents Clang and GCC from optimizing between expressions with two different strategies. The default strategy inserts a compiler intrinsic between potentially problematic expressions. This prevents some codegen options that might harm reproducibility like contraction without additional side effects. This method has negligible overhead at both compile time and runtime.
The second strategy uses inline assembly to prevent the compiler from reordering or fusing operations that could lead to reproducibility issues by forcing the compiler to spill intermediate results into registers. This is a no-op on GCC and comes at the cost of an additional memory store on Clang. This approach may also lead to increased compile times. This strategy can be manually enabled by defining the BARRIER_IMPL_ASM at compile time.
GCC provides a functioning barrier intrinsic (__builtin_assoc_barrier) that is used by default.
MSVC does not support inline assembly blocks or optimization barriers. Instead, /fp:fast is simply disabled for the implementation class. This has no overhead in most cases, but does produce in an additional call per operation when using reproducible types mixed with non-reproducible types within translation units where /fp:fast is enabled. Code that does not mix non-reproducible types does not incur an additional overhead.
rfloat on Clang resorts to a combination of barrier intrinsics and inline assembly to balance reproducibility and performance. The barrier intrinsic available with Clang (__arithmetic_fence) is broken on x86 and x64 platforms, but can be combined with inline assembly approach to approach full performance. If -ffast-math is set, rfloat has to fallback to the full costs of the inline assembly approach discussed above to ensure reproducibility.
| Support | Windows x64 | MacOS M1 | Linux x64 |
|---|---|---|---|
| Clang 14/15/16/18 | Untested | ✔️ | ✔️ |
| GCC 11 | Untested | Untested | ✔️ |
| MSVC | ✔️ | Untested | Untested |
The following platforms are all continuously tested via QEMU.
| Support | GCC | Clang |
|---|---|---|
| arm32 | ✔️ | ✔️ |
| aarch64 | ✔️ | ✔️ |
| ppc64el | ✔️ | ✔️* |
| s390x | ✔️ | ✔️ |
| mips64el | ✔️ | ✔️ |
Note
Platform combinations with an asterisk have documented issues
rfloat aims to provide the best tradeoff between performance, reproducibility, and ease of use for most applications.
- Zero unnecessary overhead
- Easy to integrate with existing code
- Converting most code involves adding a single 'r' prefix
- Supports the full range of standard library functionality, including
<cmath>andstd::numeric_limits - Reproducible by default
- If it compiles, it is reproducible unless the user explicitly opts out
- Safely supports dangerous compiler flags like
-ffast-mathand-funsafe-math-optimizations - Supports all modern, IEEE-754 compliant architectures.
Note
Although supporting non-IEEE-754 platforms and runtimes is explicitly not a goal, using rfloat will improve reproducibility even on platforms that are not fully compliant.
- rfloat prohibits some compiler optimizations on code using reproducible types. Code incrementing a variable
tmp += a;10 times will result in 10 floating point additions at runtime. - Ensuring the environment has the correct rounding mode is left up to the user
- rfloat inherently cannot prevent all possible instances of compiler non-reproducibility.
- rfloat does not eliminate reproducibility issues caused by buggy or incomplete hardware implementations
- rfloat is only as reproducible as the inputs provided
- The user is responsible for ensuring the same inputs are passed to the same operations in the same order
- Float serialization can lead to reproducibility issues
- MSVC with
/fp:fastresults in additional overhead because the compiler is forced to convert every operation into a function call. Suggestions for improvement are welcome. - Clang incurs unnecessary overhead on x64.
- This overhead results from the need to work around Clang's broken
__arithmetic_fence()intrinsic. The compiler frontend generates the correct IR for this intrinsic and tells the backend that contraction is disabled, the backend will occasionally ignore this information and generate contractions regardless. One additional implication of this bug is that the Clang-fprotect-parensflag is broken in the same situations as it shares the same implementation internally. Source level pragmas cannot be used to work around this problem either, as Clang ignores source code directives for floating point accuracy that conflict with build flags. Additionally, while the__FAST_MATH__macro is defined when-ffast-mathis set, there is no compiler macro to detect-ffp-contract=fastand generate a build error. Since there's no reliable way to detect dangerous build flags,rfloatis forced to implement operations in ways that incur minor overhead on Clang. The overhead is increased for code used in translation units that set-ffast-math.
- This overhead results from the need to work around Clang's broken
- Due to GCC Issue #71246, code that sets
BARRIER_IMPL_ASMmay have issues with certain combinations of compiler flags and platforms that have not been detected despite extensive testing. - Test code uses infinities and NaNs even under -ffast-math, causing potential undefined behavior and generating diagnostic warnings on Clang-18+.
- **rfloat is explicitly intended to work even in this scenario. The compiler diagnostics are simply unhelpful here.
The following list is known issues that affect reproducibility:
- LLVM does not propagate NaN payloads according to IEEE754.
- This is documented, intended behavior by the LLVM project
- Clang targeting PPC64el with
-ffast-mathgenerates a non-IEEE754 compliantsqrt()when inlined into unrolled loops. Seeppc64el.SqrtRoundingBugtest case for details.- rfloat disables inlining for
std::sqrtoverloads for ppc64.
- rfloat disables inlining for
Unidentified reproducibility issues are considered bugs, please report them.
A whetstone benchmark is provided as a basic example and can be built by enabling the RFLOAT_BENCHMARKS option in CMake.
Note
rfloat is inherently sensitive to source code, toolchain and platform support for performance. Measurements are indicative only, and may not be valid on your source code, with your toolchain, on your hardware.
No performance differential observed on -O2 at 1000000 iterations.
| Compiler | double | rdouble |
|---|---|---|
| Clang 16 | 5882.4 MWIPS | 5882.4 MWIPS |
| GCC 11 | 5882.4 MWIPS | 5882.4 MWIPS |
6% performance differential observed on -O3 -ffast-math -funsafe-math-optimizations at 1000000 iterations. Note that the performance of rfloat has not decreased, but additional optimizations
have benefited the baseline implementation, at the cost of non-reproducible outputs.
| Compiler | double | rdouble |
|---|---|---|
| Clang 16 | 6250 MWIPS | 5882.4 MWIPS |
| GCC 11 | 6250 MWIPS | 5882.4 MWIPS |
All numbers in GFLOPS, rounded to 2 decimal digits.
| Size | double | rdouble | Slowdown relative to double |
|---|---|---|---|
| 32k | 5.48 | 5.47 | 0.18% |
| 64k | 5.47 | 5.48 | -0.10% |
| 128k | 5.48 | 5.52 | -0.64% |
| 256k | 5.47 | 5.51 | -0.66% |
| Size | double | rdouble | Slowdown relative to double |
|---|---|---|---|
| 32k | 7.69 | 6.29 | 22.25% |
| 64k | 7.71 | 6.30 | 22.38% |
| 128k | 7.68 | 6.30 | 21.90% |
| 256k | 7.70 | 6.18 | 24.59% |
The Clang slowdown results from Clang emitting unnecessary memory stores after some floating point operation and entirely eliding some functions in the benchmark. This is not representative of typical overheads, but is included to illustrate what may occur.
Floating point reproducibility is extremely challenging to guarantee because it requires many different layers to work together.
- The IEEE-754 standard must specify a required behavior
- The floating point hardware must perfectly implement that behavior
- The compiler must provide a runtime that uses the hardware properly
- The compiler must not break the floating point semantics of your code when translating it to an executable
All of these layers have reproducibility issues. At the lowest level, The IEEE-754 standard does not specify required behaviors in all situations. For example:
- operations taking multiple inputs do not have a specified result with multiple NaNs
- bounds for transcendental math functions are difficult to specify as a result of the Tablemaker's Dilemma
This leads to implementations choosing their own behaviors or worse, providing indeterminate results. Hardware vendors also do not perfectly implement the parts of IEEE-754 which are well specified for reasons of efficiency and cost. For example:
- ARMv7 FPUs do not support subnormal numbers for performance reasons
- x86 SSE does not implement the
maxssandminssinstructions (max and min functions respectively) according to IEEE-754 semantics- IEEE-754 requires returning the non-NaN argument if one argument is NaN
- x86 returns the second argument unconditionally. This allows compilers to implement ternary operators like the
MAXandMINmacros in a single instruction.
Occasionally, compiler runtimes also do not implement the IEEE-754 standard correctly. rfloat does not define rstd::sqrt when built with clang for the PPC64el target because the optimizer incorrectly optimizes the runtime implementation in certain situations. These situations are rare and generally not encountered in real programs, although rfloat maintains compiler bug tests to document discovered examples.
Compilers sometimes simply fail to maintain IEEE-754 semantics while translating code into executables. This can have significant optimization benefits, but in many cases breakages are simply not noticeable for typical applications. For example:
- LLVM does not correctly propagate NaN values between operations even in
strictfpmode. LLVM also does not commit to a single strategy either, choosing between a number of NaN strategies depending on the situation.
A more common reproducibility issue is an intentional breakage that often benefits application code, floating point contraction. Compilers are not able to guarantee that all possible contractions are optimized across all possible targets, which means that different programs accumulate rounding errors slightly differently when built for new targets or even the same target with a new compiler version. This behavior can usually be disabled with appropriate compiler flags, but disabling the behavior at a more fine-grained level or tracking down where all the differences have occurred is not generally possible.
Many compilers also implement configuration flags that explicitly allow them to break IEEE-754 semantics when translating code. These flags are often specified at the level of an entire translation unit or even an entire project, making it difficult to build applications where only part of the code must ensure reproducibility.
Compiler breakages introduce the vast majority of floating point reproducibility issues. This is also the layer that rfloat targets. By resolving these breakages, most programs that anyone would intentionally write can be made reproducible without additional overhead, or significant changes.
This project is licensed under the MIT License.
This library inspired by Guy Davidson's P3375R0 proposal and Sherry Ignatchenko's talk on Cross-Platform Floating-Point Determinism Out of the Box.