Tracking floating point errors with NaNs
Posted: 2024-06-14, 10:08:30
Most current computer systems use traps (software interrupts) or a global status flag to indicate floating point errors, such as overflow and division by zero. These methods are inefficient for vector processing, out-of-order processing, and speculative execution, as I have explained in this document.
The ForwardCom design is proposing a new efficient way of tracking floating point errors: A floating-point error generates a NaN result (not a number) with an error code embedded as a payload in the NaN. This NaN is propagating through subsequent calculations, including the error code. The error can then be observed in the final result of a series of calculations, or the code can check for errors at convenient places in the code, for example indicated by try/catch blocks.
The error code includes information about the type of error and possibly additional diagnostic information such as the code address where the error occurred. This error tracking mechanism can be enabled or disabled for each type of error.
I want to discuss certain details of this mechanism to hear your opinion on the best implementation. Three issues are being discussed:
1. Which part of the NaN payload is preserved when converting from double precision to single precision?
2. What happens when two different NaNs are combined, i.e. NaN1 + NaN2 = ?
3. Which error codes should have highest priority when two NaNs are combined?
Ad 1.
Current computers with binary floating point are preserving the most significant bits of the payload when converting a NaN from double precision to single precision. This behavior is undocumented and not described in the IEEE 754 standard for floating point arithmetic. (Computers with decimal floating-point processing have it opposite, but such systems are rarely used today anyway).
Should ForwardCom preserve the most significant bits or the least significant bits?
An argument for preserving the least significant bits is that the NaN payload can then be interpreted as a simple integer. The most significant bits will just be zero when converting from low to high precision.
Arguments for preserving the most significant bits are: compatibility with current systems; simple to implement in hardware; the ‘quiet’ bit is normally the most significant bit, and this bit has to be preserved anyway.
The current specification of ForwardCom indicates that the least significant bits and the quiet bit are preserved. I am now considering changing this to preserve the most significant bits.
Ad 2.
ForwardCom specifies that if two different NaNs are combined, then the one with the highest payload is preserved in the result. This makes sure that NaN1 + NaN2 = NaN2 + NaN1. Current systems are just propagating the first NaN operand.
Ad 3.
If the NaN with the highest payload is propagated, then we can define a priority in case there is more than one error. We don’t want to use a timestamp for this because this would not work with out-of-order processing. We may base the priority on the code address in order to facilitate finding the error that occurred first, in case there are multiple errors. Or we may base the priority on the type of error.
The current specification for ForwardCom is prioritizing low addresses in order to find the error that occurred first. This requires that the code address is placed in the most significant bits of the NaN payload, while the error type is placed in the least significant bits. The address bits are inverted so that low addresses get the highest priority.
This method has a problem for non-linear code that jumps back and forth. If you want a debugger to find the first error in case of multiple errors then the debugger has to check for NaNs in all variables every time the code jumps backwards.
It would be useful to be able to find the first error, but I am now considering if this is too complicated. It would be easier to prioritize errors based on the type of error. This will require that the error code is placed in the most significant bits of the payload, while code address or other diagnostic information can be placed in the least significant bits. Unfortunately, the low part of the address will get lost when converting from high to low precision, unless the address bits are placed in a different order.
Now, we have to discuss which kinds of errors should have the highest priority.
The errors for ‘inexact’ and ‘underflow’ should perhaps have low priority because these error types are rarely used. Tracking of ‘inexact’ and ‘underflow’ will be disabled in most cases, so the priority is not important.
‘Division by zero’ and ‘overflow’ can probably have high priority. We may distinguish between overflow in addition, multiplication, division, conversion, and other functions.
Invalid operations such as 0/0, ∞/∞, ∞ - ∞, 0*∞ may also have high priority.
Invalid inputs to mathematical functions such as sqrt, log, pow, etc. may have each their error code, perhaps with intermediate priority.
We may prioritize the kinds of errors that are most useful when debugging. Which ones would that be?
The ForwardCom design is proposing a new efficient way of tracking floating point errors: A floating-point error generates a NaN result (not a number) with an error code embedded as a payload in the NaN. This NaN is propagating through subsequent calculations, including the error code. The error can then be observed in the final result of a series of calculations, or the code can check for errors at convenient places in the code, for example indicated by try/catch blocks.
The error code includes information about the type of error and possibly additional diagnostic information such as the code address where the error occurred. This error tracking mechanism can be enabled or disabled for each type of error.
I want to discuss certain details of this mechanism to hear your opinion on the best implementation. Three issues are being discussed:
1. Which part of the NaN payload is preserved when converting from double precision to single precision?
2. What happens when two different NaNs are combined, i.e. NaN1 + NaN2 = ?
3. Which error codes should have highest priority when two NaNs are combined?
Ad 1.
Current computers with binary floating point are preserving the most significant bits of the payload when converting a NaN from double precision to single precision. This behavior is undocumented and not described in the IEEE 754 standard for floating point arithmetic. (Computers with decimal floating-point processing have it opposite, but such systems are rarely used today anyway).
Should ForwardCom preserve the most significant bits or the least significant bits?
An argument for preserving the least significant bits is that the NaN payload can then be interpreted as a simple integer. The most significant bits will just be zero when converting from low to high precision.
Arguments for preserving the most significant bits are: compatibility with current systems; simple to implement in hardware; the ‘quiet’ bit is normally the most significant bit, and this bit has to be preserved anyway.
The current specification of ForwardCom indicates that the least significant bits and the quiet bit are preserved. I am now considering changing this to preserve the most significant bits.
Ad 2.
ForwardCom specifies that if two different NaNs are combined, then the one with the highest payload is preserved in the result. This makes sure that NaN1 + NaN2 = NaN2 + NaN1. Current systems are just propagating the first NaN operand.
Ad 3.
If the NaN with the highest payload is propagated, then we can define a priority in case there is more than one error. We don’t want to use a timestamp for this because this would not work with out-of-order processing. We may base the priority on the code address in order to facilitate finding the error that occurred first, in case there are multiple errors. Or we may base the priority on the type of error.
The current specification for ForwardCom is prioritizing low addresses in order to find the error that occurred first. This requires that the code address is placed in the most significant bits of the NaN payload, while the error type is placed in the least significant bits. The address bits are inverted so that low addresses get the highest priority.
This method has a problem for non-linear code that jumps back and forth. If you want a debugger to find the first error in case of multiple errors then the debugger has to check for NaNs in all variables every time the code jumps backwards.
It would be useful to be able to find the first error, but I am now considering if this is too complicated. It would be easier to prioritize errors based on the type of error. This will require that the error code is placed in the most significant bits of the payload, while code address or other diagnostic information can be placed in the least significant bits. Unfortunately, the low part of the address will get lost when converting from high to low precision, unless the address bits are placed in a different order.
Now, we have to discuss which kinds of errors should have the highest priority.
The errors for ‘inexact’ and ‘underflow’ should perhaps have low priority because these error types are rarely used. Tracking of ‘inexact’ and ‘underflow’ will be disabled in most cases, so the priority is not important.
‘Division by zero’ and ‘overflow’ can probably have high priority. We may distinguish between overflow in addition, multiplication, division, conversion, and other functions.
Invalid operations such as 0/0, ∞/∞, ∞ - ∞, 0*∞ may also have high priority.
Invalid inputs to mathematical functions such as sqrt, log, pow, etc. may have each their error code, perhaps with intermediate priority.
We may prioritize the kinds of errors that are most useful when debugging. Which ones would that be?