Concurrency Part 2

Applied parallelism

Patterns for parallel programming

Sharing Memory

  • Requires synchronization, typically provided by the OS in the form of Mutexes, Semaphores, Condition Variables etc.
  • Arc<Mutex<T>> is the default way in Rust, atomics are another option

Message passing

  • Threads exchange messages through a synchronized queue (called a channel)
  • The corresponding problems are called producer-consumer problems
    • 1 to N threads produce pieces of work and push them into the queue
    • 1 to M threads pull pieces of work from the queue

Functional programming patterns

  • In Rust, things like parallel iterators are often used (e.g. through the rayon crate)
  • Iterator algorithms typically use pure functions (functions without side effects), these are easy to parallelize
  • Still requires message passing or sharing memory under the hood

One problem - Three solutions

  • We will look at one specific problem solved in three different ways using the three parallel programming patterns
  • The problem: Image Convolutions

https://en.wikipedia.org/wiki/Kernel_(image_processing)

Convolution Example: Edge Detection

Convolution by sharing memory

Thread 1
Thread 2
...
Thread N
Source Image
Target Image
Disjoint write regions
Overlapping read regions

Convolution by sharing memory

Thread 1
Thread 2
...
Thread N
Source Image
Target Image
Disjoint write regions
Overlapping read regions

Convolution by sharing memory

Thread 1
Thread 2
...
Thread N
Source Image
Target Image
Disjoint write regions
Overlapping read regions

Convolution by message passing

Threads
Message Queue

Convolution by message passing

Threads
Message Queue
  • Split the image into small work items, wrapped in some type of message

Convolution by message passing

Threads
Message Queue
  • Put messages into message queue

Convolution by message passing

Threads
Message Queue
  • Each thread pulls the next available item from the message queue

Convolution using Functional Programming

Live code examples

Observations - Threads

  • Use the fork-join pattern, requires knowing our data upfront
  • Works well if the problem can be split into disjoint subsets
  • Competetive performance

Observations - rayon

  • The easiest to write, but uses a sophisticated library
  • Works well if the problem can be formulated in terms of iterator algorithms
  • Competetive performance
  • But beware: Iterators can hide allocations!

Observations - Message Passing

  • The hardest to write
  • Performance overhead if messages need to be self-contained (i.e. they don’t borrow memory)
    • Copying to/from messages
  • Works with unknown/dynamic amounts of data
  • Completely decouples producer and consumer

Splitting up data for parallel processing

  • Proper parallel processing often hinges on identifying disjoint subsets of the data that can be processed in parallel
  • Use the Rust compiler and its Rule Of One to guide you:
    • Mixing mutable and immutable borrows on the same data -> Not allowed!
    • Access disjoint subsets of a slice: Use chunks and chunks_mut
  • Consider write-buffers if:
    • Processing is compute-heavy
    • And you write to the same output data
    • And copying is acceptable (measure!)

A non-trivial parallelization problem

  • Matrix multiplication (a core part of almost all modern ML approaches)
  • When done on a GPU, we want high parallelization, i.e. thousands of parallel threads
  • We also care for cache locality
  • One way: Tiled Matrix Multiplication

https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

Tiled Matrix Multiplication

  • Which parallelisation strategy to use?
  • What are good values for Mtile, Ntile, Ktile?
  • How would we know?
    • We need to measure and will do so in a later lecture, stay tuned ;)

https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html