Error handling - How to make systems software robust

When learning programming, one of the biggest struggles is to get your code to behave in the way you want it. Be it that nasty off-by-one error in a binary search, or wrong pointers in a linked list, a lot of time is spent on fixing mistakes that we as programmers make. An equally important part of software development is how our software deals with errors not in our own code, but errors that we have no control over. Files might be missing, a network connection might be dropped, we can run out of memory, users might input data in the wrong format, the list goes on. Handling all these error conditions gracefully is an important part of software development. Little is more frustrating to users than software that behaves unexpectedly or even crashes due to a small mistake the user made, or even just unlucky circumstances. While error handling is certainly important for applications software, it is much more critical for systems software. In an application, you have direkt interaction with the user, who can just retry an operation, close an error-popup, or even quickly restart the application in the worst case. Systems software is meant to communicate with other systems, so closing a popup (even showing a popup for that matter) or quickly restarting the system manually is often not an option.

In this chapter, we will learn about error handling in systems software and how to write robust systems that can recover from errors. We will learn about the different ways of communication error conditions in code, especially in C++ and Rust, and how to react to these conditions. We will also learn about the role that the operating system plays in error handling.

Here is the roadmap for this chapter: