6.4. Command line arguments, environment variables and program exit codes

In this chapter, we will look at how you can control processes as a developer. This includes how to run multiple processes, how to react to process results and how to configure processes for different environments.

The standard input and output streams: `stdin`, `stdout`, and `stderr`

When starting a new process on Linux, the operating system automatically associates three files with the process called stdin, stdout, and stderr. These files can be used to communicate with the process. stdin (for standard input) is read-only from the process and is meant to feed data into the process. stdout (for standard output) and stderr (for standard error) are both write-only and are meant to move data out of the process. For output, there is a distinction made between 'regular' data, for which stdout is meant, and error information (or diagnostics), for which stderr is meant.

We are used to starting processes from a command line (which is itself a process). Command lines launch new processes by forking themselves and overwriting the forked process with the new executable. Both stdout and stderr are then automatically redirected to the corresponding files of the command line, which is why you see the output of a process invoked from the command line in the command line itself.

Rerouting of files is a common operation that command lines use frequently to combine processes. If we have two processes A and B, process A can feed information to process B simply by rerouting stdout of A to stdin of B.

So what can we use these input and output streams for? Here are some examples:

Reading user input from the command line using stdin
Passing data (text or binary) to a process using stdin
Outputting text to the user through the command line using stdout
Outputting diagnostic and error information using stderr. It is very common that stderr (and stdout) are redirected into named files in the filesystem, for example on servers to store logging information.

Command line arguments

stdin is a very flexible way to pass data into a process, however it is also completely unstructured. We can pass any byte sequence to the process. Often, what we want is a more controlled manner of passing data into a process for configuration. This could be a URL to connect to (like in the curl tool) or a credentials file for an SSH connection or the desired logging level of an application.

There are two ways for these 'configuration' parameters to be passed into a process: Command line arguments and environment variables. Both are related in a way but serve different purposes. We will look at command line arguments first.

You will probably know command line arguments as the arguments to the main function in C/C++: int main(int argc, char** argv). Each command line argument is a string (typically a null-terminated C-stringAs a quick reminder: C-strings (or null-terminated strings) are arrays of characters representing a text string, where the end of the string is indicated by a special value called the null-terminator, which has the numeric value 0.), and the list of command line arguments (argv) is represented by an array of pointers to these C-strings. This explains the somewhat unusual char** type: Each argument is a C-string, which is represented by a single char*. An array of these equals char**). By convention, the first command line argument usually equals the name of the executable of the current process.

Command line arguments are typically passed to a process when the process is being launched from the command line (or terminal or shell), hence their name. In a command line, the arguments come after the name of the executable: ls -a -l. Here, -a and -l are two command line arguments for the ls executable.

Since the command line is simply a convenience program which simplifies running processes, it itself needs some way to launch new processes. The way to do this depends on the operating system. On Linux, you use the execve system call. Looking at the signature of execve, we see where the command line arguments (or simply program arguments) come into play: int execve(const char *pathname, char *const argv[], char *const envp[])

execve accepts the list of arguments and passes them on to the main function of the new process. This is how the arguments get into main!

Using command line arguments

Since command line arguments are strings, we have to come up with some convention to make them more usable. You already saw that many command line arguments use some sort of prefix (like the -a parameterWindows tools tend to prefer /option over --option, which wouldn't work on Unix-systems because they use / as the root of the filesystem.). Often, this will be a single dash (-) for single-letter arguments, and two dashes (--) for longer arguments.

Command line arguments are unique to every program because they depend on the functionality that the program is trying to achieve. Generally, we can distinguish between several types of command line arguments:

Flags: These are boolean conditions that indicate the presence of absence of a specific feature. For example: The ls program prints a list of the entries in the current directory. By default, it ignores the . and .. directory entries. If you want to print those as well, you can enable this feature by passing -a as a command line argument to ls.
Parameters: These are arguments that represent a value. For example: The cp command can be used to copy the contents of a source file into a destination file. These two files have to be passed as parameters to cp, like this cp ./source ./destination
Named parameters: Parameters are typically identified by their position in the list of command line arguments. Sometimes it is more convenient to give certain parameters a name, which results in named parameters. For example: The curl tool can be used to make network requests on the command line, for example HTTP requests. HTTP requests have different types (GET, POST etc.) which can be specified with a named parameter: curl -X POST http://localhost:1234/test
- Named parameters are tricky because they are comprised of more than one command line argument: The parameter name (e.g. -X) followed by one (or more!) arguments (e.g. POST)

All this is just convention, the operating system just passes an array of strings to the process. Interpreting the command line arguments has to be done by the process itself and is usually one of the first things that happens in main. This process is called command line argument parsing. You can implement this yourself using string manipulation functions, but since this is such a common thing to do, there are many libraries out there that do this (e.g. boost program_options in C++)

Command line arguments in Rust

In Rust, the main function looks a bit different from C++: fn main() {}. Notice that there are not command line arguments passed to main. Why is that, and how do we get access to the command line arguments in Rust?

Passing C-strings to main would be a bad idea, because C-strings are very unsafe. Instead, Rust goes a different route and exposes the command line arguments through the std::env module, namely the function std::env::args(). It returns an iterator over all command line arguments passed to the program upon execution in the form of Rust String values.

This is a bit more convenient than what C/C++ does, because the command line arguments are accessable from any function within a Rust program this way. Built on top of this mechanism, there are great Rust crates for dealing with command line arguments, for example the widely used clap crate.

Writing good command line interfaces

Applications that are controlled solely through command line arguments and output data not to a graphical user interface (GUI) but instead to the command line are called command line applications. Just as with GUIs, command line applications also need a form of interface that the user can work with. For a command line application, this interface is the set of command line arguments that the application accepts. There are common patterns for writing good command line interfaces that have been proven to work well in software. Let's have a look at some best practices for writing good command line interfaces:

1. Always support the --help argument

The first thing that a user typically wants to do with a command line application is to figure out how it works. For this, the command line argument --help (or -h) has been established as a good starting point. Many applications print information about the supported parameters and the expected usage of the tool to the standard output when invoked with the --help option. How this help information looks is up to you as a developer, though libraries such as clap in Rust or boost program_options in C++ typically handle this automatically.

Here is what the git command line client prints when invoked with --help:

usage: git [--version] [--help] [-C <path>] [-c <name>=<value>]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p | --paginate | -P | --no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]

These are common Git commands used in various situations:

start a working area (see also: git help tutorial)
   clone             Clone a repository into a new directory
   init              Create an empty Git repository or reinitialize an existing one

[...]

Here is a small command line application written in Rust using the clap crate:

use clap::{App, Arg};

fn main() {
    let matches = App::new("timetravel")
        .version("0.1")
        .author("Pascal Bormann")
        .about("Energizes the flux capacitor")
        .arg(
            Arg::with_name("year")
                .short("y")
                .long("year")
                .help("Which year to travel to?")
                .takes_value(true),
        )
        .get_matches();

    let year = matches.value_of("year").unwrap();

    println!("Marty, we're going back to {}!!", year);
}

Building and running this application with the --help argument gives the following output:

timetravel 0.1
Pascal Bormann
Energizes the flux capacitor

USAGE:
    timetravel [OPTIONS]

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -y, --year <year>    Which year to travel to?

2. Provide both short (-o) and long (--option) versions of the same argument

This is really a convenience to the user, but often-used arguments should get shorthand versions. A good example of this can again be found in git: git commit -m is used to create a new commit with the given message. Here, the -m option is a shorthand for --message. It is simpler to type, but still has some resemblance to the longer argument, since m is the first letter of message. For single-letter arguments, make sure that they are reasonably expressive and don't lead to confusion.

In Rust using clap, we can use the short and long methods of the Arg type to specify short and long versions of an argument specifier.

3. Use the multi-command approach for complex tools

Some command line applications have so many different functionalities that providing a single command line argument for each would lead to unnecessary confusion. In such a situation, it makes sense to convert your command line application to a multi-command tool. Again, git is a prime example for this. It's primary functions can be accessed through named arguments after for git, such as git pull or git commit, where pull and commit are commands of their own with their own unique sets of arguments. Here is what git commit --help prints:

NAME
       git-commit - Record changes to the repository

SYNOPSIS
       git commit [-a | --interactive | --patch] [-s] [-v] [-u<mode>] [--amend]
                  [--dry-run] [(-c | -C | --fixup | --squash) <commit>]
                  [-F <file> | -m <msg>] [--reset-author] [--allow-empty]
                  [--allow-empty-message] [--no-verify] [-e] [--author=<author>]
                  [--date=<date>] [--cleanup=<mode>] [--[no-]status]

[...]

As we can see, the commit sub-command has a lot of command line arguments that only apply to this sub-command. Structuring a complex command line application in this way can make it easier for users to work with it.

Environment variables

Besides command line arguments, there is another set of string parameters to the exec family of functions. Recall the signature of execve: int execve(const char *pathname, char *const argv[], char *const envp[]). After the command line arguments, a second array of strings is passed to execve, which contains the environment variables. Where command line arguments are meant to describe the current invocation of a program, environment variables are used to describe the environment that the program is running in.

Environment variables are strings that are key-value pairs with the structure KEY=value. Since they are named, they are easier to user from an application than command line arguments.

Environment variables are inherited from the parent process. This means that you can set environment variables for your current terminal session, and all programs launched in this session will have access to these environment variables.

If you want to see the value of an environment variable in your (Linux) terminal, you can simply write echo $VARIABLE, where VARIABLE is the name of the environment variable.

There are a bunch of predefined environment variables in Linux that are pretty useful. Here are some examples:

$PATH: A list of directories - separated by colons - in which your terminal looks for commands (executables). Notice that when you write something like ls -a, you never specified where the ls executable is located on your computer. With the $PATH environment variable, your terminal can find the ls executable. On the author's MacOS system, it is located under /bin/ls, and /bin is part of $PATH
$HOME: The path of the current user's home directory
$USER: The name of the current user
$PWD: The path to the current directory in the terminal. This is the same as calling the pwd command without arguments

If you want to see all environment variables in your terminal, you can run printenv on most Linux shells.

In Rust, we can get access to all environment variables in the current process using the std::env::vars function, which returns an iterator over key-value pairs.

An interesting question is whether to use command line arguments or environment variables for program configuration. There is no definitive answer, but it is pretty well established in the programming community that command line arguments are for things that change frequently between executions of the program (like the arguments to cp for example), whereas environment variables are for parameters that are more static (like a logging level for a server process). If you answer the question 'Is this parameter part of the environment that my program is running in?' with yes, then it is a good candidate for an environment variable.

Configuration files

If your process requires a lot of configuration, a better idea than to provide dozens of command line arguments can be to support configuration files. Command line arguments are only one way to get information into a process, no one is stopping you from implementing some file reading and pulling all the configuration parameters your program requires from a file. We call such a file a configuration file. How such a configuration file looks is a decision that each process has to make, however there are some standardized formats established today which are frequently used for configuration files:

Linux traditionally uses mainly text-based key-value formats with file extensions such as .conf or .ini. Some tools also require some commands to be run at initialization, which are often specified in a special type of configuration file with an .rc prefix. On Linux, go check out your /etc directory, it contains lots of configuration files
For more complex configuration parameters, key-value pairs are often insufficient and instead, some hierarchical data structures are required. Here, common serializiation formats such as JSON, XML, or the simpler YAML format are often used.

It is a good idea to make the path to the configuration file configurable as well, using either a command line argument or an environment variable.

Program exit codes

Up until now we only talked about getting data in and out of processes at startup or during the execution. It is often also helpful to know how a process has terminated, in particular whether an error occurred or the process exited successfully. The simplest way to do this is to make use of the program exit code. This is typically an 8-bit integer that represents the exit status of the process.

In Linux, we can use the waitpid function to wait for a child process to terminate and then inspect the status variable that waitpid sets to see how the child process terminated. This is how your shell can figure out whether a process exited successfully or not.

By convention, an exit code of 0 represents successful program termination, and any non-zero exit code indicates a failure. In C and C++, there are two constants that can be used: EXIT_SUCCESS to indicate successful termination, and EXIT_FAILURE to indicate abnormal process termination.

Running processes in Rust

Let's put all our knowledge together and work with processes in Rust. Here, we can use the std::process module, which contains functions to execute and manage processes from a Rust program.

The main type that is used for executing other processes is Command. We can launch the program program with a call to Command::new("program").spawn() or Command::new("program").output(). The first variant (spawn()) detaches the spawned process from the current program and only returns a handle to the child process. The second variant (output()) waits for the process to finish and returns its result. This includes the program exit code, as well as all data that the program wrote to the output streams stdout and stderr. Here is the signature of output:

#![allow(unused)]
fn main() {
pub fn output(&mut self) -> Result<Output>
}

It returns a Result because spawning a process might fail. If it succeeds, the relevant information is stored in the Output structure:

#![allow(unused)]
fn main() {
pub struct Output {
    pub status: ExitStatus,
    pub stdout: Vec<u8>,
    pub stderr: Vec<u8>,
}
}

Notice that the output of stdout and stderr is represented not as a String but as a Vec<u8>, so a vector of bytes. This emphasizes the fact that the output streams can be used to output data in any format from a process. Even though we might be used to printing text to the standard output (e.g. by using println!), it is perfectly valid to output binary data to the standard output.

Putting all our previous knowledge together, we can write a small program that executes another program and processes its output, like so:

use std::process::Command;

use anyhow::{bail, Result};

fn main() -> Result<()> {
    let output = Command::new("ls").arg("-a").output()?;
    if !output.status.success() {
        bail!("Process 'ls' failed with exit code {}", output.status);
    }

    let stdout_data = output.stdout;
    let stdout_as_string = String::from_utf8(stdout_data)?;
    let files = stdout_as_string.trim().split("\n");
    println!(
        "There are {} files/directories in the current directory",
        files.count()
    );

    Ok(())
}

In this program, we are using Command to launch a new process, even supplying this process with command line arguments themselves. In this case, we run ls -a, which will print a list of all files and directories in the directory it was executed from. How does ls know from which directory it was executed? Simple: It uses an environment variable, which it inherits from our Rust program, which itself inherits it from whatever process called the Rust program. The environment variable in question is called PWD and always points to the current directory. You can try this from your command line (on Linux or MacOS) by typing echo $PWD.

Back to our Rust program. We configure a new process, launch it and immediately wait for its output by calling output(). We are using the ? operator and the anyhow crate to deal with any errors by immediately exiting main in an error case. Even if we successfully launched the ls program, it might still fail, so we have to check the program exit code using output.status.success(). If it succeeded, we have access to the data it wrote to the standard output. We know that ls prints textual data, so we can take the bytes that ls wrote to stdout and convert them to a Rust String using String::from_utf8. Lastly, we use some algorithms to split this string into its lines and count the number of lines, which gives us the number of files/directories in the current directory.

While this program does not to much that you couldn't achieve on the command line alone fairly easily (e.g. using ls -a | wc -l), it illustrates process control in Rust, and shows off some of the other features that we learned about, like iterator algorithms (count) and error handling.

Recap

This concludes the section on process control, and with it the chapter on systems level I/O. We learned a lot about how processes communicate with each other and how we can interact with other devices such as the disk or the network in Rust. The important things to take away from this chapter are:

The Unix file abstraction (files are just sequences of bytes) and how it translates to the Rust I/O traits (Read and Write)
The difference between a file and the file system. The latter gives access to files on the disk through file paths and often supports hierarchical grouping of files into directories
Network communication using the IP protocol (processes on remote machines are identified by the machine IP address and a socket address) and how network connections behave similar to files in Rust (by using the same Read and Write traits)
Processes communicate simple information with each other through signals. If we want to share memory, we can do that by sharing virtual pages using shared memory
Processes have default input and output channels called stdin, stdout, and stderr, which are simply files that we can write to and read from
For process configuration, we use command line arguments and environment variables