6.4. Command line arguments, environment variables and program exit codes
In this chapter, we will look at how you can control processes as a developer. This includes how to run multiple processes, how to react to process results and how to configure processes for different environments.
The standard input and output streams: stdin
, stdout
, and stderr
When starting a new process on Linux, the operating system automatically associates three files with the process called stdin
, stdout
, and stderr
. These files can be used to communicate with the process. stdin
(for standard input) is read-only from the process and is meant to feed data into the process. stdout
(for standard output) and stderr
(for standard error) are both write-only and are meant to move data out of the process. For output, there is a distinction made between 'regular' data, for which stdout
is meant, and error information (or diagnostics), for which stderr
is meant.
We are used to starting processes from a command line (which is itself a process). Command lines launch new processes by forking themselves and overwriting the forked process with the new executable. Both stdout
and stderr
are then automatically redirected to the corresponding files of the command line, which is why you see the output of a process invoked from the command line in the command line itself.
Rerouting of files is a common operation that command lines use frequently to combine processes. If we have two processes A and B, process A can feed information to process B simply by rerouting stdout
of A to stdin
of B.
So what can we use these input and output streams for? Here are some examples:
- Reading user input from the command line using
stdin
- Passing data (text or binary) to a process using
stdin
- Outputting text to the user through the command line using
stdout
- Outputting diagnostic and error information using
stderr
. It is very common thatstderr
(andstdout
) are redirected into named files in the filesystem, for example on servers to store logging information.
Command line arguments
stdin
is a very flexible way to pass data into a process, however it is also completely unstructured. We can pass any byte sequence to the process. Often, what we want is a more controlled manner of passing data into a process for configuration. This could be a URL to connect to (like in the curl
tool) or a credentials file for an SSH connection or the desired logging level of an application.
There are two ways for these 'configuration' parameters to be passed into a process: Command line arguments and environment variables. Both are related in a way but serve different purposes. We will look at command line arguments first.
You will probably know command line arguments as the arguments to the main
function in C/C++: int main(int argc, char** argv)
. Each command line argument is a string (typically a null-terminated C-stringAs a quick reminder: C-strings (or null-terminated strings) are arrays of characters representing a text string, where the end of the string is indicated by a special value called the null-terminator, which has the numeric value 0
.), and the list of command line arguments (argv
) is represented by an array of pointers to these C-strings. This explains the somewhat unusual char**
type: Each argument is a C-string, which is represented by a single char*
. An array of these equals char**
). By convention, the first command line argument usually equals the name of the executable of the current process.
Command line arguments are typically passed to a process when the process is being launched from the command line (or terminal or shell), hence their name. In a command line, the arguments come after the name of the executable: ls -a -l
. Here, -a
and -l
are two command line arguments for the ls
executable.
Since the command line is simply a convenience program which simplifies running processes, it itself needs some way to launch new processes. The way to do this depends on the operating system. On Linux, you use the execve
system call. Looking at the signature of execve
, we see where the command line arguments (or simply program arguments) come into play: int execve(const char *pathname, char *const argv[], char *const envp[])
execve
accepts the list of arguments and passes them on to the main
function of the new process. This is how the arguments get into main
!
Using command line arguments
Since command line arguments are strings, we have to come up with some convention to make them more usable. You already saw that many command line arguments use some sort of prefix (like the -a
parameterWindows tools tend to prefer /option
over --option
, which wouldn't work on Unix-systems because they use /
as the root of the filesystem.). Often, this will be a single dash (-
) for single-letter arguments, and two dashes (--
) for longer arguments.
Command line arguments are unique to every program because they depend on the functionality that the program is trying to achieve. Generally, we can distinguish between several types of command line arguments:
- Flags: These are boolean conditions that indicate the presence of absence of a specific feature. For example: The
ls
program prints a list of the entries in the current directory. By default, it ignores the.
and..
directory entries. If you want to print those as well, you can enable this feature by passing-a
as a command line argument tols
. - Parameters: These are arguments that represent a value. For example: The
cp
command can be used to copy the contents of a source file into a destination file. These two files have to be passed as parameters tocp
, like thiscp ./source ./destination
- Named parameters: Parameters are typically identified by their position in the list of command line arguments. Sometimes it is more convenient to give certain parameters a name, which results in named parameters. For example: The
curl
tool can be used to make network requests on the command line, for example HTTP requests. HTTP requests have different types (GET
,POST
etc.) which can be specified with a named parameter:curl -X POST http://localhost:1234/test
- Named parameters are tricky because they are comprised of more than one command line argument: The parameter name (e.g.
-X
) followed by one (or more!) arguments (e.g.POST
)
- Named parameters are tricky because they are comprised of more than one command line argument: The parameter name (e.g.
All this is just convention, the operating system just passes an array of strings to the process. Interpreting the command line arguments has to be done by the process itself and is usually one of the first things that happens in main
. This process is called command line argument parsing. You can implement this yourself using string manipulation functions, but since this is such a common thing to do, there are many libraries out there that do this (e.g. boost program_options in C++)
Command line arguments in Rust
In Rust, the main
function looks a bit different from C++: fn main() {}
. Notice that there are not command line arguments passed to main
. Why is that, and how do we get access to the command line arguments in Rust?
Passing C-strings to main
would be a bad idea, because C-strings are very unsafe. Instead, Rust goes a different route and exposes the command line arguments through the std::env
module, namely the function std::env::args()
. It returns an iterator over all command line arguments passed to the program upon execution in the form of Rust String
values.
This is a bit more convenient than what C/C++ does, because the command line arguments are accessable from any function within a Rust program this way. Built on top of this mechanism, there are great Rust crates for dealing with command line arguments, for example the widely used clap
crate.
Writing good command line interfaces
Applications that are controlled solely through command line arguments and output data not to a graphical user interface (GUI) but instead to the command line are called command line applications. Just as with GUIs, command line applications also need a form of interface that the user can work with. For a command line application, this interface is the set of command line arguments that the application accepts. There are common patterns for writing good command line interfaces that have been proven to work well in software. Let's have a look at some best practices for writing good command line interfaces:
1. Always support the --help
argument
The first thing that a user typically wants to do with a command line application is to figure out how it works. For this, the command line argument --help
(or -h
) has been established as a good starting point. Many applications print information about the supported parameters and the expected usage of the tool to the standard output when invoked with the --help
option. How this help information looks is up to you as a developer, though libraries such as clap
in Rust or boost program_options
in C++ typically handle this automatically.
Here is what the git
command line client prints when invoked with --help
:
usage: git [--version] [--help] [-C <path>] [-c <name>=<value>]
[--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
[-p | --paginate | -P | --no-pager] [--no-replace-objects] [--bare]
[--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
<command> [<args>]
These are common Git commands used in various situations:
start a working area (see also: git help tutorial)
clone Clone a repository into a new directory
init Create an empty Git repository or reinitialize an existing one
[...]
Here is a small command line application written in Rust using the clap
crate:
use clap::{App, Arg}; fn main() { let matches = App::new("timetravel") .version("0.1") .author("Pascal Bormann") .about("Energizes the flux capacitor") .arg( Arg::with_name("year") .short("y") .long("year") .help("Which year to travel to?") .takes_value(true), ) .get_matches(); let year = matches.value_of("year").unwrap(); println!("Marty, we're going back to {}!!", year); }
Building and running this application with the --help
argument gives the following output:
timetravel 0.1
Pascal Bormann
Energizes the flux capacitor
USAGE:
timetravel [OPTIONS]
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-y, --year <year> Which year to travel to?
2. Provide both short (-o
) and long (--option
) versions of the same argument
This is really a convenience to the user, but often-used arguments should get shorthand versions. A good example of this can again be found in git
: git commit -m
is used to create a new commit with the given message. Here, the -m
option is a shorthand for --message
. It is simpler to type, but still has some resemblance to the longer argument, since m
is the first letter of message
. For single-letter arguments, make sure that they are reasonably expressive and don't lead to confusion.
In Rust using clap
, we can use the short
and long
methods of the Arg
type to specify short and long versions of an argument specifier.
3. Use the multi-command approach for complex tools
Some command line applications have so many different functionalities that providing a single command line argument for each would lead to unnecessary confusion. In such a situation, it makes sense to convert your command line application to a multi-command tool. Again, git
is a prime example for this. It's primary functions can be accessed through named arguments after for git
, such as git pull
or git commit
, where pull
and commit
are commands of their own with their own unique sets of arguments. Here is what git commit --help
prints:
NAME
git-commit - Record changes to the repository
SYNOPSIS
git commit [-a | --interactive | --patch] [-s] [-v] [-u<mode>] [--amend]
[--dry-run] [(-c | -C | --fixup | --squash) <commit>]
[-F <file> | -m <msg>] [--reset-author] [--allow-empty]
[--allow-empty-message] [--no-verify] [-e] [--author=<author>]
[--date=<date>] [--cleanup=<mode>] [--[no-]status]
[...]
As we can see, the commit
sub-command has a lot of command line arguments that only apply to this sub-command. Structuring a complex command line application in this way can make it easier for users to work with it.
Environment variables
Besides command line arguments, there is another set of string parameters to the exec
family of functions. Recall the signature of execve
: int execve(const char *pathname, char *const argv[], char *const envp[])
. After the command line arguments, a second array of strings is passed to execve
, which contains the environment variables. Where command line arguments are meant to describe the current invocation of a program, environment variables are used to describe the environment that the program is running in.
Environment variables are strings that are key-value pairs with the structure KEY=value
. Since they are named, they are easier to user from an application than command line arguments.
Environment variables are inherited from the parent process. This means that you can set environment variables for your current terminal session, and all programs launched in this session will have access to these environment variables.
If you want to see the value of an environment variable in your (Linux) terminal, you can simply write echo $VARIABLE
, where VARIABLE
is the name of the environment variable.
There are a bunch of predefined environment variables in Linux that are pretty useful. Here are some examples:
$PATH
: A list of directories - separated by colons - in which your terminal looks for commands (executables). Notice that when you write something likels -a
, you never specified where thels
executable is located on your computer. With the$PATH
environment variable, your terminal can find thels
executable. On the author's MacOS system, it is located under/bin/ls
, and/bin
is part of$PATH
$HOME
: The path of the current user's home directory$USER
: The name of the current user$PWD
: The path to the current directory in the terminal. This is the same as calling thepwd
command without arguments
If you want to see all environment variables in your terminal, you can run printenv
on most Linux shells.
In Rust, we can get access to all environment variables in the current process using the std::env::vars
function, which returns an iterator over key-value pairs.
An interesting question is whether to use command line arguments or environment variables for program configuration. There is no definitive answer, but it is pretty well established in the programming community that command line arguments are for things that change frequently between executions of the program (like the arguments to cp
for example), whereas environment variables are for parameters that are more static (like a logging level for a server process). If you answer the question 'Is this parameter part of the environment that my program is running in?' with yes, then it is a good candidate for an environment variable.
Configuration files
If your process requires a lot of configuration, a better idea than to provide dozens of command line arguments can be to support configuration files. Command line arguments are only one way to get information into a process, no one is stopping you from implementing some file reading and pulling all the configuration parameters your program requires from a file. We call such a file a configuration file. How such a configuration file looks is a decision that each process has to make, however there are some standardized formats established today which are frequently used for configuration files:
- Linux traditionally uses mainly text-based key-value formats with file extensions such as
.conf
or.ini
. Some tools also require some commands to be run at initialization, which are often specified in a special type of configuration file with an.rc
prefix. On Linux, go check out your/etc
directory, it contains lots of configuration files - For more complex configuration parameters, key-value pairs are often insufficient and instead, some hierarchical data structures are required. Here, common serializiation formats such as
JSON
,XML
, or the simplerYAML
format are often used.
It is a good idea to make the path to the configuration file configurable as well, using either a command line argument or an environment variable.
Program exit codes
Up until now we only talked about getting data in and out of processes at startup or during the execution. It is often also helpful to know how a process has terminated, in particular whether an error occurred or the process exited successfully. The simplest way to do this is to make use of the program exit code. This is typically an 8-bit integer that represents the exit status of the process.
In Linux, we can use the waitpid
function to wait for a child process to terminate and then inspect the status
variable that waitpid
sets to see how the child process terminated. This is how your shell can figure out whether a process exited successfully or not.
By convention, an exit code of 0
represents successful program termination, and any non-zero exit code indicates a failure. In C and C++, there are two constants that can be used: EXIT_SUCCESS
to indicate successful termination, and EXIT_FAILURE
to indicate abnormal process termination.
Running processes in Rust
Let's put all our knowledge together and work with processes in Rust. Here, we can use the std::process
module, which contains functions to execute and manage processes from a Rust program.
The main type that is used for executing other processes is Command
. We can launch the program program
with a call to Command::new("program").spawn()
or Command::new("program").output()
. The first variant (spawn()
) detaches the spawned process from the current program and only returns a handle to the child process. The second variant (output()
) waits for the process to finish and returns its result. This includes the program exit code, as well as all data that the program wrote to the output streams stdout
and stderr
. Here is the signature of output
:
#![allow(unused)] fn main() { pub fn output(&mut self) -> Result<Output> }
It returns a Result
because spawning a process might fail. If it succeeds, the relevant information is stored in the Output
structure:
#![allow(unused)] fn main() { pub struct Output { pub status: ExitStatus, pub stdout: Vec<u8>, pub stderr: Vec<u8>, } }
Notice that the output of stdout
and stderr
is represented not as a String
but as a Vec<u8>
, so a vector of bytes. This emphasizes the fact that the output streams can be used to output data in any format from a process. Even though we might be used to printing text to the standard output (e.g. by using println!
), it is perfectly valid to output binary data to the standard output.
Putting all our previous knowledge together, we can write a small program that executes another program and processes its output, like so:
use std::process::Command; use anyhow::{bail, Result}; fn main() -> Result<()> { let output = Command::new("ls").arg("-a").output()?; if !output.status.success() { bail!("Process 'ls' failed with exit code {}", output.status); } let stdout_data = output.stdout; let stdout_as_string = String::from_utf8(stdout_data)?; let files = stdout_as_string.trim().split("\n"); println!( "There are {} files/directories in the current directory", files.count() ); Ok(()) }
In this program, we are using Command
to launch a new process, even supplying this process with command line arguments themselves. In this case, we run ls -a
, which will print a list of all files and directories in the directory it was executed from. How does ls
know from which directory it was executed? Simple: It uses an environment variable, which it inherits from our Rust program, which itself inherits it from whatever process called the Rust program. The environment variable in question is called PWD
and always points to the current directory. You can try this from your command line (on Linux or MacOS) by typing echo $PWD
.
Back to our Rust program. We configure a new process, launch it and immediately wait for its output by calling output()
. We are using the ?
operator and the anyhow
crate to deal with any errors by immediately exiting main
in an error case. Even if we successfully launched the ls
program, it might still fail, so we have to check the program exit code using output.status.success()
. If it succeeded, we have access to the data it wrote to the standard output. We know that ls
prints textual data, so we can take the bytes that ls
wrote to stdout
and convert them to a Rust String
using String::from_utf8
. Lastly, we use some algorithms to split this string into its lines and count the number of lines, which gives us the number of files/directories in the current directory.
While this program does not to much that you couldn't achieve on the command line alone fairly easily (e.g. using ls -a | wc -l
), it illustrates process control in Rust, and shows off some of the other features that we learned about, like iterator algorithms (count
) and error handling.
Recap
This concludes the section on process control, and with it the chapter on systems level I/O. We learned a lot about how processes communicate with each other and how we can interact with other devices such as the disk or the network in Rust. The important things to take away from this chapter are:
- The Unix file abstraction (files are just sequences of bytes) and how it translates to the Rust I/O traits (
Read
andWrite
) - The difference between a file and the file system. The latter gives access to files on the disk through file paths and often supports hierarchical grouping of files into directories
- Network communication using the IP protocol (processes on remote machines are identified by the machine IP address and a socket address) and how network connections behave similar to files in Rust (by using the same
Read
andWrite
traits) - Processes communicate simple information with each other through signals. If we want to share memory, we can do that by sharing virtual pages using shared memory
- Processes have default input and output channels called
stdin
,stdout
, andstderr
, which are simply files that we can write to and read from - For process configuration, we use command line arguments and environment variables