Advanced Systems Programming – Concurrency Part 2

Patterns for parallel programming

Sharing Memory

Requires synchronization, typically provided by the OS in the form of Mutexes, Semaphores, Condition Variables etc.
Arc<Mutex<T>> is the default way in Rust, atomics are another option

Message passing

Threads exchange messages through a synchronized queue (called a channel)
The corresponding problems are called producer-consumer problems
- 1 to N threads produce pieces of work and push them into the queue
- 1 to M threads pull pieces of work from the queue

Functional programming patterns

In Rust, things like parallel iterators are often used (e.g. through the rayon crate)
Iterator algorithms typically use pure functions (functions without side effects), these are easy to parallelize
Still requires message passing or sharing memory under the hood

One problem - Three solutions

We will look at one specific problem solved in three different ways using the three parallel programming patterns
The problem: Image Convolutions

https://en.wikipedia.org/wiki/Kernel_(image_processing)

Convolution Example: Edge Detection

Convolution by sharing memory

Convolution by message passing

Split the image into small work items, wrapped in some type of message

Convolution by message passing

Put messages into message queue

Convolution by message passing

Each thread pulls the next available item from the message queue

Convolution using Functional Programming

Live code examples

Observations - Threads

Use the fork-join pattern, requires knowing our data upfront
Works well if the problem can be split into disjoint subsets
Competetive performance

Observations - `rayon`

The easiest to write, but uses a sophisticated library
Works well if the problem can be formulated in terms of iterator algorithms
Competetive performance
But beware: Iterators can hide allocations!

Observations - Message Passing

The hardest to write
Performance overhead if messages need to be self-contained (i.e. they don’t borrow memory)
- Copying to/from messages
Works with unknown/dynamic amounts of data
Completely decouples producer and consumer

Splitting up data for parallel processing

Proper parallel processing often hinges on identifying disjoint subsets of the data that can be processed in parallel
Use the Rust compiler and its Rule Of One to guide you:
- Mixing mutable and immutable borrows on the same data -> Not allowed!
- Access disjoint subsets of a slice: Use chunks and chunks_mut
Consider write-buffers if:
- Processing is compute-heavy
- And you write to the same output data
- And copying is acceptable (measure!)

A non-trivial parallelization problem

Matrix multiplication (a core part of almost all modern ML approaches)
When done on a GPU, we want high parallelization, i.e. thousands of parallel threads
We also care for cache locality
One way: Tiled Matrix Multiplication

https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

Tiled Matrix Multiplication

Which parallelisation strategy to use?
What are good values for Mtile, Ntile, Ktile?
How would we know?
- We need to measure and will do so in a later lecture, stay tuned ;)

https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

Concurrency Part 2

Patterns for parallel programming

Sharing Memory

Message passing

Functional programming patterns

One problem - Three solutions

Convolution Example: Edge Detection

Convolution by sharing memory

Convolution by sharing memory

Convolution by sharing memory

Convolution by message passing

Convolution by message passing

Convolution by message passing

Convolution by message passing

Convolution using Functional Programming

Observations - Threads

Observations - rayon

Observations - Message Passing

Splitting up data for parallel processing

A non-trivial parallelization problem

Tiled Matrix Multiplication

Observations - `rayon`