More efficient ways of detecting exceptions
Posted: 2020-02-02, 7:46:04
Floating point errors are traditionally detected in two ways: with a global status register or by traps (software interrupts). Both methods are problematic with out-of-order execution and vector processing (SIMD) for the following reasons.
A global status register has to be updated after every floating point operation. If multiple instructions are executed simultaneously or out of order, they all have to modify the status register. They may have to do so in order. Reading the status register is a serializing event: all preceding instructions have to retire before the status register can be read. The status register does not tell which instruction caused an exception. A vector instruction may generate multiple exceptions but set the status register only once.
Exception trapping is even more inefficient because exceptions must happen in order. All instructions must be executed speculatively so that they can be rolled back in case a preceding instruction, which has not finished yet, is causing an exception. A single instruction may be delayed for hundreds of clock cycles in case of a cache miss. The out-of-order scheduler may find many subsequent independent instructions that can be executed in the meantime. All these subsequent instructions must execute speculatively. This is an awful lot of bookkeeping.
A more efficient solution would be to propagate status information through a chain of calculations to the end result, as discussed in a previous thread viewtopic.php?f=1&t=91 and in the document https://www.agner.org/optimize/nan_propagation.pdf.
A propagation method would save a lot of silicon and power.
I will discuss three possible ways of propagating error status:
Method 1.
Floating point overflow is propagated as INF. Invalid operations are propagated as NAN. The error is detected in the end result.
Advantages:
Certain bits in a control register or mask register indicate what exceptions you want to detect. An enabled exception will generate a NAN result with a payload indicating the kind of exception and where it occurred.
Advantages:
All floating point/vector registers should have some extra status bits that are set in case of exceptions. The status bits are propagated through a series of calculations in the following way. An operation like C = A + B will set the status bits of C as the OR combination of the status bits of A and the status bits of B and the status resulting from the + operation. ForwardCom has special instructions for saving a variable-length vector register in a system-dependent compressed format. This instruction can include the status bits.
Advantages:
The IEEE-754 floating point standard is making a distinction between immediate and delayed exception handling. The methods described here are perfect for delayed exception handling. You can simply check the result after a chain of calculations. The situation is more difficult if you want immediate exception handling. Immediate exception handling means that, in principle, you have to stop the series of calculations immediately in case of an exception. A high-level language may detect exceptions either by reading a status register or with a try/catch block. The status flag is the most common method of detecting floating point errors in C/C++, but try/catch is possible at least in some cases. Other languages like Java and C# are unable to raise and catch floating point exceptions, AFAIK.
Checking the end result with any of the above methods will work as useful replacements for a status register. The try/catch method is more difficult because it presupposes immediate exception handling. We may think of different scenarios with try/catch blocks:
I would like to hear your opinions on which method of error detection to prefer for ForwardCom and any problems it may involve. Can we avoid speculative execution completely if traps are replaced by error propagation, and hardware interrupts are handled in an in-order front end?
A global status register has to be updated after every floating point operation. If multiple instructions are executed simultaneously or out of order, they all have to modify the status register. They may have to do so in order. Reading the status register is a serializing event: all preceding instructions have to retire before the status register can be read. The status register does not tell which instruction caused an exception. A vector instruction may generate multiple exceptions but set the status register only once.
Exception trapping is even more inefficient because exceptions must happen in order. All instructions must be executed speculatively so that they can be rolled back in case a preceding instruction, which has not finished yet, is causing an exception. A single instruction may be delayed for hundreds of clock cycles in case of a cache miss. The out-of-order scheduler may find many subsequent independent instructions that can be executed in the meantime. All these subsequent instructions must execute speculatively. This is an awful lot of bookkeeping.
A more efficient solution would be to propagate status information through a chain of calculations to the end result, as discussed in a previous thread viewtopic.php?f=1&t=91 and in the document https://www.agner.org/optimize/nan_propagation.pdf.
A propagation method would save a lot of silicon and power.
I will discuss three possible ways of propagating error status:
Method 1.
Floating point overflow is propagated as INF. Invalid operations are propagated as NAN. The error is detected in the end result.
Advantages:
- This works with existing systems. Nothing new has to be introduced
- The result of each element of a vector is reported separately. Scalar code can be vectorized without changing the result.
- INF does not propagate through division: 1/INF = 0.
- Underflow and inexact exceptions cannot be detected. These exceptions are rarely used, but they are required by the IEEE-754 floating point standard
- Overflow in a float-to-int conversion cannot be detected with this method
Certain bits in a control register or mask register indicate what exceptions you want to detect. An enabled exception will generate a NAN result with a payload indicating the kind of exception and where it occurred.
Advantages:
- Same advantages as method 1.
- NANs can be detected with existing methods, including standard compare instructions
- Underflow and inexact exceptions can be detected
- Overflow generates NAN rather then INF to make sure it propagates through division
- It is possible to detect where the error occurred. This is useful for debugging
- Legacy code that relies on overflow generating INF may fail
- Overflow in a float-to-int conversion cannot be detected with this method
All floating point/vector registers should have some extra status bits that are set in case of exceptions. The status bits are propagated through a series of calculations in the following way. An operation like C = A + B will set the status bits of C as the OR combination of the status bits of A and the status bits of B and the status resulting from the + operation. ForwardCom has special instructions for saving a variable-length vector register in a system-dependent compressed format. This instruction can include the status bits.
Advantages:
- Works for integer overflow as well.
- All vector registers must have extra bits
- Vector registers can contain elements of 1, 2, 4, 8, or 16 bytes. Do we want status bits for all possible element sizes?
- The status bits are lost when saving values in standard form
- The status bits are difficult to access from high level language code
The IEEE-754 floating point standard is making a distinction between immediate and delayed exception handling. The methods described here are perfect for delayed exception handling. You can simply check the result after a chain of calculations. The situation is more difficult if you want immediate exception handling. Immediate exception handling means that, in principle, you have to stop the series of calculations immediately in case of an exception. A high-level language may detect exceptions either by reading a status register or with a try/catch block. The status flag is the most common method of detecting floating point errors in C/C++, but try/catch is possible at least in some cases. Other languages like Java and C# are unable to raise and catch floating point exceptions, AFAIK.
Checking the end result with any of the above methods will work as useful replacements for a status register. The try/catch method is more difficult because it presupposes immediate exception handling. We may think of different scenarios with try/catch blocks:
- The 'catch' block aborts the program with an error message. This situation is easy. All data are lost anyway, so it does not matter at what time the exception is detected.
- The 'catch' block tries to recover from the error. The code assumes that all calculations before the exception are correct. We must roll back any calculations done after the point of the exception. Vectorizing a loop in this situation can be complicated, but in simple cases we may simply save the part of the result vector that precedes the element that indicates an error.
- The 'catch' block tries to fix the error. The 'catch' block may access intermediate variables, including the value of a loop counter at the time of the exception. It may be very difficult to vectorize such a loop. The code may restore data to the state before the 'try' block and redo all the calculations without vectors.
I would like to hear your opinions on which method of error detection to prefer for ForwardCom and any problems it may involve. Can we avoid speculative execution completely if traps are replaced by error propagation, and hardware interrupts are handled in an in-order front end?