Error handling in ForwardCom

ForwardCom implements a new efficient way of handling numerical errors. Traditional systems have two different methods for detecting errors in floating-point calculations. One way is to use a global status register that is set when an error occurs. Another way is to use traps (software interrupts). Both methods were introduced at a time before parallel processing, vector processing, and out-of-order processing became common. They are based on a linear conception of program flow and do not work well with out-of-order processing and vector processing. If two errors occur simultaneously in two different elements of the same vector, then it will be detected as a single error. If vectors are shorter, then the two errors will not occur simultaneously and we will detect two errors. This is a problem if we want a program to give exactly the same result regardless of vector lengths.

A further problem with out-of-order processors is that traps have to occur in program order. All instructions must be executed speculatively until it can be ascertained that no preceding instruction will generate a trap. Speculative execution is complicated and costly.

The problems with error handling are explained in more detail in the report Agner Fog: Floating point exception tracking and NAN propagation.

The most efficient solution to the problems of error detection is to make the error detection mechanism follow the same information flow as the program instructions we want to monitor. ForwardCom does this in the following way. The detection of each type of floating point error can be enabled or disabled. An enabled error will generate a NAN (not a number) that propagates through the subsequent calculations. Such a NAN can be detected in the end result or at any desired intermediate point in the calculations. Each NAN includes a payload, which is a bit pattern indicating the type of error and information about where it occurred. This method is sure to give the same result regardless of whether instructions are executed in order or out of order, and whether calculations are carried sequentially or in parallel.

Integer overflow and other integer errors cannot be detected in this way because integer variables cannot have NANs. ForwardCom offers a way of detecting integer errors similar to the NAN propagation method. The integer method is storing operands in vector registers and using extra vector elements to store and propagate error information. This is useful for programming languages that check for integer overflow.

ForwardCom seeks to avoid “undefined” behavior and make sure that every error condition has a predictable outcome. Providing reliable ways to detect numerical errors is important for this goal.

ForwardCom defines a standardized way of sending error messages to the user, independent of the user interface framework. A graphical user interface requires a pop-up message box. A character-oriented user interface needs a message sent to the standard error output. A server application requires an error message to be saved in a log file or sent to the administrator. A standard error message function serves the purpose of sending an error message independent of the user interface. This is useful for programs and library functions that need to work regardless of which user interface framework is used.