Hardware exceptions - How your machine deals with errors

When working with computers, a lot can go wrong in many places. In this chapter, we will look at how your hardware deals with errors, in the later chapters we will look at how to deal with errors in code.

Types of errors

First things first: What are things that can go wrong? Here it makes sense to start form the perspective of an application and then work our way down to see how this relates to your hardware. Here is a (non-exhaustive) list of things that can go wrong in an application:

A division by zero is encountered
A file cannot be found
The program has the wrong permission to access file
The program runs out of memory
The program is accessing memory that is read/write protected
The program is accessing memory that is not paged in (page fault)
An integer overflow occurrs
The program is dereferencing a null pointer (which it turns out is the same as accessing read/write protected memory)

To make sense of all these errors, we can classify them into programmer errors, user errors and system errors. Programmer errors are errors that the programmer makes in their logic, for example dereferencing a null pointer without checking whether the pointer is null. User errors are things that the user of the software does that do not match the expected usage of the software. Wrong user inputs are a good example for this: The software expects a birthday in the format DD.MM.YYYY and the user inputs their name instead. Lastly, system errors are errors which relate to the current state of the system that the program is running on. A missing (or inaccessible) file is an example for a system error, as is a dropped network connection. These are neither the programmers nor the users fault, however we as programmers still have to be aware of system errors and have to take precautions to handle them.

In addition to these three error categories, we can also classify errors into recoverable and non-recoverable errors. In an ideal world, our software can always recover from any possible error condition, however that is rarely the case. Whether a particular error is recoverable or non-recoverable depends largely on the context of the software. A missing file error might be recovered from by providing a default file path or even default data, or it might be non-recoverable because the file is crucial to the workings of the program. For most programs, running out of heap memory will be a non-recoverable condition resulting in program termination, but there are programs that have to be able to recover even from such a critical situation. Generally speaking, we as programmers can develop our application with a specific set of recoverable errors in mind and take precautions to either prevent or handle these errors gracefully. At the most extreme end of recovering from errors are things like the Mars Rovers, which have to work autonomously for many years and thus have to be able to deal with all sorts of crazy errorsThere is a great video on how the people at the NASA Jet Propulsion Laboratory (JPL) had to write their C++ code for the Mars Rovers to deal with all sorts of errors..

Notice that some errors are closer to the hardware than others: Wrong user input can be handled on a much higher level than accessing memory within a page that is not paged into working memory.

The general term for errors in computer science is exceptions, as they signify an exceptional condition. This term is very broad, referring not only to the errors that we have seen so far, but also for signals from I/O devices or a hardware timer going off. When talking about errors that can occur in your code, the more specific term software exception is used, to distinguish them from these other kinds of hardware-related exceptional conditions.

How your hardware deals with exceptions

So how does your hardware deal with these exceptional conditions? Typically, this is done by your processor, which has built-in mechanisms for detecting all sorts of exceptional conditions. When such an exception is detected by the processor, it can refer control flow of your program to a special function that can deal with the exception. Such functions are called exception handlers and they are managed by the operating system in a special table called the exception table. All exception types have unique IDs through which they can be identified, and for each such ID a function is registered in the exception table by the operating system on system startup. This way, once your system is up and running, the operating system guarantees that all exceptions that the processor might detect will get handled appropriately. The exception table itself resides somewhere in memory and there is a special CPU register, called the exception table base register, which points to the start of the exception table in memory.

Let's look at how this works in practice. Imagine that you wrote some code that accidentally divides a number by zero. You CPU is running your code and encounters the instruction for dividing two numbers, which results in the division by zero. Your CPU detects this condition at the point of executing the division instruction. For a division error, the exception code on a Linux-system will be 0, so the CPU then defers control to the exception handler for exception type 0. We call this situation exceptional control flow, to distinguish it from the regular control flow of your program. Exception handlers run in kernel mode and thus have access to all the systems resources directly, even if the program in which the exception occurred was running in user mode. Depending on the type of exception, the exception handler either retries the failed instruction (after doing some additional work), moves on to the next instruction, or terminates the current process. For our division error, the exception handler will send a special signal to the process (SIGFPE), which, if unhandled, will terminate the process. If the process is running in a debugger, the SIGFPE signal gets translated into a different signal which causes the debugger to display that a division error occurred in the process. Exceptions and signals thus are an important part of what makes debuggers work.

On the hardware level, we distinguish between 4 different types of exceptions, based on the default behaviour of their exception handlers:

Interrupts: These are things like signals from I/O devices to notify that data can be read from them
Traps: These are intentional exceptions which require some action from the operating system, for example system calls
Faults: These are potentially recoverable errors, for example page faults (accessing memory in a page that is not cached in working memory)
Aborts: These are unrecoverable hardware errors, for example memory failures

The operating system and hardware closely interact when dealing with exceptions, with some exception types (such as divide-by-zero and page fault) being defined by the hardware, and others (such as system call) being defined by the operating system.

Hardware errors are a big topic that we could delve much deeper into, however a lot of this stuff is covered in an operating systems course, so we will leave it at that for now.