Distributed Computing With Dried, Salted Cod Fish, WASM, And (Tiny)Go
You don't need monstrous software orchestration systems for collecting information from distributed data sources. Here is an easy way of sending a Go binary to where the data is.
Data is everywhere… but never where you need it
Heaps of gold.
That's what your data is.
The constant streams of log data. The scientific data collected from sensors, cameras, or the particle collider in your backyard. The activity data your security systems accumulates continuously in all your company's subsidiaries around the world.
There is only a tiny problem. The data is scattered across remote data centers. Moving huge amounts of data is difficult and expensive. How can you process data that's distributed around the world?
Compute over data
Instead of sending all the data to one central point for processing, send the processing jobs to where the data is.
Data processing results are typically many orders of magnitude smaller than the original data and can be collected much easier and faster than moving all the raw data across the globe to one location.
Compute over data, or CoD, embraces this idea. A CoD system sends compute jobs to remote nodes, where the jobs process the data and send the results back to the originating node (or store the data somewhere to collect it later).
Software orchestration can be overkill
If you do not already run a distributed Kubernetes supercluster, you'll want to avoid the huge overhead of setting up complex software orchestration systems with all their strings attached.
You need a distributed data processing system that is easy to set up and maintain while being efficient and secure.
Bacalhau specializes on orchestrating distributed batch jobs
Bacalhau is a CoD system makes distributed data processing ridiculously easy. In a nutshell, you –
- set up a swarm of Bacalhau nodes,
- create data processing jobs as Docker containers or WASM binaries,
- send those jobs to every node that hosts the data to process,
- collect the results.
Side note 1: Bacalhau is written in Go.
Side node 2: Bacalhau is the Portuguese name of dried, salted cod fish.
How to run a Go WASM binary on a Bacalhau network
The Bacalhau documentation contains examples of running Docker containers or Rust programs as WASM, but here I want to focus on writing a WASM binary for Bacalhau in Go.
For the sake of simplicity, I will use a single, local Bacalhau server, so that you need nothing but a computer, the Bacalhau CLI command, and TinyGo, to follow the steps and create your own mini data processing system.
Step 1: Write a Go program that counts words in files
The data processing job is quite simple: Read all files in a given directory and count the words. Return the per-file results in an output file, and send the total count of all files to stdout
.
The Go code is unspectacular. It does not need to know anything about Bacalhau. There are no packages to import. Bacalhau provides the job transparently with input and output directories and also collects everything the job writes to stdout
and stderr
.
Here is the full code in all its boringness.
func main() {
/inputs
here. inputDir := "/inputs"
stdout
, and to stderr
. outputDir := "/outputs"
dir, err := os.Open(inputDir)
if err != nil {
stderr
output as well. log.Fatal(err)
}
defer dir.Close()
/inputs
. For simplicity, let's assume all of them are plain text files. entries, err := dir.Readdirnames(-1)
if err != nil {
log.Fatal(err)
}
if len(entries) == 0 {
log.Fatal("No files found")
}
out, err := os.Create(filepath.Join(outputDir, "count.txt"))
if err != nil {
log.Fatal(err)
}
defer out.Close()
stdout
. total := 0
/inputs
and count the words in each file. for _, entry := range entries {
f, err := os.Open(filepath.Join(inputDir, entry))
if err != nil {
log.Fatal(err)
}
r := bufio.NewReader(f)
words, err := countWords(r)
f.Close()
if err != nil {
log.Fatal(err)
}
total += words
fmt.Fprintf(out, "%s has %d words\n", entry, words)
}
stdout
. fmt.Println("Total word count: ", total)
countWords
scans words from an input stream and counts them. The algorithm is simple and counts everything, including comment symbols, markdown heading markers, etc.func countWords(r *bufio.Reader) (int, error) {
scanner := bufio.NewScanner(r)
scanner.Split(bufio.ScanWords)
wordCount := 0
for scanner.Scan() {
wordCount++
}
if err := scanner.Err(); err != nil {
if err == io.EOF {
return wordCount, nil
}
return wordCount, fmt.Errorf("countWords: %w", err)
}
return wordCount, nil
}
}
Step 2: Compile the program to WASM with TinyGo
Now it gets more interesting. We want to compile the code to a WASM binary. To save space, I'll use TinyGo, a “Go compiler for small spaces”. TinyGo produces compact WebAssembly binaries, with a few restrictions on Go features.
With TinyGo, compiling Go code to WASM is as easy as calling
> tinygo build -o main.wasm -target=wasi main.go
The resulting main.wasm
binary is ready to be sent to data nodes.
Troubleshooting: Go version mismatch
TinyGo might exit with this error message (the actual version number may differ):
error: requires go version 1.18 through 1.21, got go1.22
TinyGo does not hurry to catch up with the latest Go version. To fix this error, you need to install a Go version supported by TinyGo.
The easiest way is to download an older Go and set a local alias that is valid in the current shell:
> go get [email protected]
> go1.21.7 download
> alias go go1.21.7
Then repeat the tinygo build
command.
Step 3: Start a local Bacalhau server
This would be the time to set up a Bacalhau network of data nodes. I'll simulate this network by starting a single Bacalhau server locally.
> bacalhau serve \
--node-type requester,compute \
--allow-listed-local-paths '/path/to/local/directory/with/inputs' \
--web-ui
This server starts with some special settings:
- The node type is both requester and compute. Bacalhau distinguishes between requester nodes that orchestrate job requests, and compute node that run the jobs. This minimal server setup combines both types into one hybrid node.
- As a security measure, a specific local path is whitelisted for serving as the input directory for jobs.
- Finally, we ask the server to run a web UI that lists active nodes and a history of job runs.
For my test, I created three files in the inputs
directory, with the creative names file1.txt
, file2.txt
, and file3.txt
.
A few seconds after starting, the server list a few environment variables to set, so that the CLI command knows how to reach the server.
For convenience, the environment variables are also written to ~/.bacalhau/bacalhau.run
. In a new shell, set these variables by running
> source ~/.bacalhau/bacalhau.run
Now that the server is running, we can send our first job.
Step 4: Send the job to the “network”
In the shell where you sourced the Bacalhau environment settings, cd into the directory containing the WASM binary and call:
> bacalhau wasm run -i file:///path/to/input/directory:/inputs main.wasm
Here, /path/to/input/directory
is the full path to the directory containing the input files. After the colon, specify the path as the job would see it. The Go code looks into /inputs
, and so the actual input path is mapped to /inputs
.
(Note that /inputs
happens to be the default input directory, so unless your job binary expects a different input path, you can omit the :/inputs
part from the -i
parameter.)
If all goes well, the command should return an output similar to the following:
Job successfully submitted. Job ID: 113c9bc7-7f54-474d-bfe1-7839b32984cf
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):
Communicating with the network ................ done ✅ 0.0s
Creating job for submission ................ done ✅ 0.5s
Job in progress ................ done ✅ 0.0s
To download the results, execute:
bacalhau get 113c9bc7-7f54-474d-bfe1-7839b32984cf
To get more details about the run, execute:
bacalhau describe 113c9bc7-7f54-474d-bfe1-7839b32984cf
Step 5: Collect the data
The outputs of the job get a unique ID assigned. To “download” the results (let's pretend the server runs on a remote node), run the provided get
subcommand:
> bacalhau get dc3b187f-de18-4b38-a1e1-e2f20172d8ef
Fetching results of job 'dc3b187f-de18-4b38-a1e1-e2f20172d8ef'...
Results for job 'dc3b187f-de18-4b38-a1e1-e2f20172d8ef' have been written to...
/path/to/job/results/job-dc3b187f
Each job has a unique results directory that contains
- the output directory,
- and the files
stdout
,stderr
, andexitCode
.
Yes, Bacalhau collects everything that the job would return if it ran in a normal shell: file output, shell output and the exit code.
Read the files to verify that the job did its job (apologies for the pun):
> cat job-dc3b187f/stdout
Total word count: 414
> cat job-dc3b187f/outputs/count.txt 0s
file2.txt has 138 words
file3.txt has 207 words
file1.txt has 69 words
Summing up: Bacalhau is distributed computing without the orchestration overhead
A few commands were sufficient to set up a single-node Compute-over-Data network. With a few more commands, you can set up a real, distributed network of Bacalhau nodes and send them compute jobs.
You can be specific about the nodes to use, such as: “Send this job only to nodes that have a GPU”. Jobs can fetch data from local directories, URLs, S3 buckets, or IPFS networks, and send output to any of these locations except URLs. Nodes can join and leave the network, and the client will always know which nodes are available and which nodes to send a given job to.
Bacalhau is a powerful system for distributed data processing, yet easy to set up and use. And Go is the perfect language for compute job payloads. Go apps can live in small containers or WebAssembly modules, saving bandwidth when distributing jobs.
Maybe I'll set up a real Bacalhau cluster next.
Links
Happy coding!
Cod photo by Ricardo Resende on Unsplash
Imports and globals