Mechanical Sympathy

Many programming languages try to be as “high-level” as possible by shielding the developer from the details of the hardware and the inner workings of compiler and runtime. What a shame! Users of these languages get some ease in coding for the price of needlessly slow apps.

What if a language laid only a thin abstraction over the hardware it is supposed to run on? And what if developers knew enough about the hardware they write code for?

Why you should have (or develop) mechanical sympathy

In an ideal world, algorithms would always execute without any friction—no CPU or RAM limits, no I/O delay. Reality, however, confronts the developer with hard and soft limits as well as more or less subtle hardware peculiarities. Mechanical sympathies means to know about these factors and use them to your code's advantage.

The term “mechanical sympathy” originates from Jackie Stewart, a racing driver, who said:

You don’t have to be an engineer to be a racing driver, but you do have to have mechanical sympathy.
– Jackie Stewart

What he meant is that if you know your car, you become a better driver. A better driver can drive his car faster, which, after all, is the point of car races, right? The same applies to software development: Know your hardware architecture—the CPU, RAM, I/O, and so forth—and you can become a better developer. A better developer writes software that's more efficient, in terms of both speed and memory consumption. Did you notice that hardware became tremendously faster over the last decades while software feels sluggish as ever? Performance optimization has been neglected for too long.

A common meme in our industry is “avoid pre-optimization.” I used to preach this myself, and now I’m very sorry thatI did.
Dave Farley

I want to extend the notion of mechanical sympathy from hardware to software mechanics. Compilers and runtimes can contain complex algorithms that are worth developing some mechanical sympathy for. Think of:

memory allocation and garbage collection
concurrency as a “mechanical” system

to name only two. My point is: Certain kinds of software abstraction layers can expose hardware-like behavior on their own.

Mechanical sympathy for hardware

Now let's look at some features of modern CPUs that can affect your code in unforeseen ways but that you also can leverage to make your code performing better.

Local caches

From a software point of view, RAM is linear. Available memory is a long row of memory cells numbered from 1 to 2^x (where x depends on the width of the address bus in bits).

In reality, RAM is hierarchical: A series of caches sit between RAM and a CPU core, each one smaller but faster than the previous one. (As these caches are local to the CPU core, they are typically called “L1”, “L2”, and “L3”) When the core fetches data from RAM, the data remains cached as long as available cache space permits, and future requests to the same data become much faster.

Algorithms can be designed to take advantage of the caches by operating on amounts of data that fit into a local CPU cache—ideally into the L1 cache that sits closest to the CPU and is the fastest but also the smallest one in the cache hierarchy.

On the flipside, algorithms that are unaware of the cache architecture can become slower than necessary, even to an extent that cancels out performance achievements from concurrent work.

How so? Imagine a concurrent app with goroutines running on multiple cores at once. Goroutine a on core A reads a variable and updates it. Now the variable's current value sits in core A’s L1 cache. Then, goroutine b on core B wants to read and update the same variable. To make this possible, the processor must first invalidate the value in A’s L1 cache, move the new value back to main RAM, then let B fetch and process the value and store it in B’s L1 cache. Now imagine if goroutine a wants to update the value again; the slow cache invalidation process repeats.

Structure alignment

Terms like “32-bit processor” or “64-bit architecture” refer to the size of one unit of information that the processor can read from RAM. A 64-bit processor reads data in chunks of 64 bits or eight bytes at once.

But what about smaller data, such as an int16 or a boolean? The CPU still must fetch a full 64-bit chunk of data to read a 16-bit value from RAM.

For this reason, data must be aligned at 64-bit borders.

Consider this struct:

var s struct {
  a int8
  b int64
  c bool
}

If it was stored in memory as a continuous data block, field b would spread across two 64-bit memory cells:

0 8             64
+-+-------------+
|a|b            |
+-+-------------+
|b|c|###########|
+-+-------------+

(The hash signs ### denote unused space within a memory cell.)

To fetch field b, the CPU would have to read both memory cells, extrac the two fragments of b and combine them into one.

As this would be highly inefficient, the compiler maps the struct to memory so that no value spans two memory cells:

0 8             64
+-+-------------+
|a|#############|
+-+-------------+
|b              |
+-+-------------+
|c|#############|
+-+-------------+

Apparently, our partiular struct wastes quite some space by storing two 8-bit values in two 64-bit memory cells.

Luckily, as Gophers, we are empowered to change this. The compiler maps struct fields in the sequence given in the source code, so by rearranging the struct a bit, we can have the compiler map the two short fields, a and c, to the same memory cell:

var s struct {
  b int64
  a int8
  c bool
}

The compiler happily arranges a single cell for a and c, thus reducing the required memory by one third!

0 8             64
+-+-------------+
|b              |
+-+-------------+
|a|c|###########|
+-+-------------+

This might not sound much, but for large slices of structs, it can make a difference.

These were just two examples of using knowledge about hardware behavior to improve efficiency. Next, I'll transpose this principle to software mechanics.

Mechanical sympathy for Go

Programming languages, as simple as some of them may seem, are in fact quite complex systems, comprised of a compiler and runtime libraries that directly influence how code performs. Languages often build their data types around specific paradigms.

Two examples of this in Go are slices and goroutine scheduling.

Slices

Every Go newcomer quickly learns that slices are quite special: While other languages treat dynamic arrays as an opaque concept that “just works” (somehow), Go makes the internal structure explicit: Slices live on underlying static arrays, and once a Go developer learns how the mechanics of slicing, copying, and appending work, they can design algorithms that perform better than those that have to rely on an impenetrable abstraction of dynamic arrays.

   Slice header
+-----------------+
| len | cap | ptr |
+--------------|--+
  +------------+ 
  |               Array
+-V---------------------------------+
|   |   |   |   |   |   |   |   |   |
+-----------------------------------+

Algorithms that operate on a known (maximum) amount of data can pre-allocate underlying arrays, by either explicitly allocating an array and slicing it or calling make with a capacity parameter: s := make([]int, 100, 10000). Without the need to resize the underlying array, allocations are reduced to a minimum. Memory-intense operations become faster and the garbage collector has much less to do.

Another aspect of slices worth mentioning: As continous blocks of data, they are much more cache-friendly than pointer-based structures like linked lists, whose elements are necessarily allocated on the heap and therefore can be anywhere in RAM. So if you're about to implement an algorithm that would seem to require connecting individual data elements through pointers, see if you can find a slice-based implementation instead that can take advantage of cache locality.

The Go scheduler

You probably have come across demo apps that spawn bazillions of goroutines to demonstrate the amazingly low overhead per goroutine. Yet, knowing and adapting to the way goroutines are scheduled onto system threads can pay off.

System threads are mapped 1:1 to CPU cores; that is, only one thread can run on a core at any given time. Swapping out a thread to run another one is rather expensive, so the Go inventors designed goroutines so that scheduling them becomes dirt cheap.

The scheduler isn't exactly a simple algorithm, so I attempted to make the logic more accessible by comparing the scheduler to a restaurant.

The most important part of designing concurrent work is to know whether the goroutines you plan to spawn are I/O-bound or CPU-bound.

I/O-bound goroutines are frequently blocked by slow data exchange with storage or network connections. You can spawn a lot more of them than then number of cores in your target system, as they mostly will be in a blocked state, giving room for OS activity or other goroutines to run.
CPU-bound goroutines max out the CPU by doing calculations on data that's readily available in RAM. Here, you'd want to schedule no more goroutines to run simultaneously than there are CPU cores available. Any goroutine above that number will only generate additional scheduling overhead without any further speed-up.

As concurrency is inherently difficult (despite Go making it more accessible through concurrency primitives that are easy to reason about (goroutines and channels, that is), always keep an eye on how concurrent flows unfold in your app by monitoring CPU and scheduling profiles to see contention, memory use, and goroutines leaks and adjust concurrent flows accordingly.

Conclusion: To write good software, learn hardware

As a universal rule, if you want to get better in a particular domain of life, lean out of the window, peek over the fence, and explore the domain adjacent to yours. It is always interesting and sometimes revealing when you take on a new perspective.

Exploring modern hardware is fascinating, and if modern CPU architecture feels overwhelming, start with programming a microcontroller.

Whatever you decide to explore outside the domain you want to get better, it'll pay off in many ways.

Applied Go

Mechanical Sympathy

Why you should have (or develop) mechanical sympathy

Mechanical sympathy for hardware

Local caches

Structure alignment

Mechanical sympathy for Go

Slices

The Go scheduler

Conclusion: To write good software, learn hardware

Further reading

About cache misses and concurrent access to data

The Go scheduler explained with chefs, pans, and stoves

About mechanical sympathy

Mechanical sympathy and Go