Building a CLI Wrapper for Whisper: From Transcription to Distribution

If an existing tool is not straightforward to use, a software developer's natural reflex is to write a tailor-made CLI. Akash Joshi wrapped OpenAI's Whisper model into an intuitive CLI for convenient transcript generation.

The Ultimate Guide to Debugging With Go
Learn debugging with Matt, at 40% off
(and support AppliedGo this way)!
Use the coupon code APPLIEDGO40 at ByteSizeGo.com.
(Affiliate Link)

We're going to learn how to wrap a local AI model using Go

Have you ever tried a command line tool that is incredibly useful, but unwieldy? That was OpenAI's Whisper for me.

So I:

  • swapped the model for a quantized version,
  • wrote a Go program to chain an ffmpeg command before the model, and
  • distributed the Go program, called ‘better-whisper’, with my own Homebrew tap.

If you want to try better-whisper just:

brew tap akash-joshi/homebrew-akash-joshi
brew install better-whisper
better-whisper [whisper-cpp arguments] <input-file>

I'm Akash Joshi, a senior engineer who has been playing around with LLMs and Go over the past few months. This article is based on a conversation between Maurice Banerjee Palmer and me, transcribed by…better-whisper.

Why? What problem does Whisper solve?

I had a collection of YouTube videos and consulting call recordings that needed subtitles. I couldn't find any free tools to transcribe them, so I turned to Whisper, OpenAI's speech recognition model.

You can run Whisper by installing its Python package and using the CLI to transcribe local files.

But OpenAI's official package is extremely slow. Transcribing a 30-minute meeting could take me over an hour. Your results will vary depending on your CPU/GPU. But it's not fast, either way.

What are the alternatives to OpenAI's CLI?

Georgi Gerganov has ported OpenAI's models to whisper.cpp. It offers a range of benefits but the main one is that it runs dramatically faster than the originals without a noticeable loss in accuracy.

Whisper models range from tiny to large. The latest at the time of writing is large-v3-turbo. The tiny model is suitable for most tasks. In my tests, transcribing a 30-minute meeting using whisper.cpp with the small English model on a MacBook Air took only about 100-120 seconds.

I found the fastest route to get started is:

  • download ggml-tiny.en-q8_0.bin from HuggingFace
  • put it in ~/models/ggml-tiny.en-q8_0.bin
  • run brew install whisper-cpp to install whisper.cpp as a CLI locally
  • run whisper-cpp -m ~/models/ggml-tiny.en-q8_0.bin -osrt <input_file_path> to transcribe your file and save it locally as an srt subtitle file.
  • Note: You might have to run ffmpeg -i <input_file_path> -ar 16000 -ac 1 -c:a pcm_s16le output.wav before the previous command if your media file isn't a 16-bit wav file.

What problems did you encounter with whisper.cpp?

We've sped Whisper up by using whisper.cpp. But whisper.cpp needs some preprocessing.

From the README:

Note that the main example currently runs only with 16-bit WAV files, so make sure to convert your input before running the tool. For example, you can use ffmpeg like this:

ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

Remembering to do this every time gets a bit unwieldy. So let's wrap them together in a Go CLI.

How do you write a Go CLI? Why?

I chose Go for its excellent developer experience and ability to produce cross-platform binaries.

A ‘Go CLI’ is the combination of a Go program and reading command line arguments.

To initialize your Go project, run go mod init.

Then your Go program is just a main.go with:

// all go programs start with package main
package main

import "fmt"

// main is the entry point for the program
func main() {
	fmt.Println("Hello, World!")
}

Once you have your go program you need it to read command line arguments via os.Args:

// all go programs start with package main
package main

import (
	"fmt"
	"os"
)

// main is the entry point for the program
func main() {
	fmt.Println("Hello, World!")
	fmt.Println(os.Args)
}

Pulling it together:

package main

import (
	"fmt"
	"os"
	"os/exec"

	ffmpeg_go "github.com/u2takey/ffmpeg-go"
)

func convertToWav(filePath string) (string, error) {
	outputPath := fmt.Sprintf("%s_temp.wav", filePath)

	err := ffmpeg_go.Input(filePath).
		Output(outputPath, ffmpeg_go.KwArgs{"ar": 16000, "ac": 1, "c:a": "pcm_s16le"}).
		Run()
	if err != nil {
		return "", err
	}

	return outputPath, nil
}

func executeWhisper(args []string) error {
	cmd := exec.Command("whisper-cpp", args...)

	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	return cmd.Run()
}

func main() {
	filePath := os.Args[len(os.Args)-1]

	outputPath, err := convertToWav(filePath)
	if err != nil {
		fmt.Println("Error converting file:", err)
		return
	}

	args := append(os.Args[1:len(os.Args)-1], outputPath)
	whisperErr := executeWhisper(args)

	err = os.Remove(outputPath)
	if err != nil {
		fmt.Printf("Error deleting temporary file %s: %v\n", outputPath, err)
		// We don't return here, as the main operation (transcription) has already completed
	}

	if whisperErr != nil {
		fmt.Println(whisperErr)
	}
}

You can see the main module at the core of better-whisper here.

How does Go compare to Typescript for CLI projects?

Go's simplicity is a significant advantage over TypeScript for CLI development.

While TypeScript offers more complex type systems, Go's limitations can lead to cleaner, more straightforward code. For example, Go has only one way to declare types and one falsy value (nil).

How do you call external commands in Go?

package main

import (
	"fmt"
	"os"
	"os/exec"
)

func main() {
	cmd := exec.Command("whisper-cpp", os.Args...)

	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	err := cmd.Run()
	if err != nil {
		fmt.Println(err)
	}
}

To wrap whisper.cpp, I used the os/exec package to call external binaries:

cmd := exec.Command("whisper-cpp", os.Args...)

This runs the actual whisper.cpp process, not just a bash command. Go's error handling is straightforward. Functions typically return an error as the last value in a tuple, which you check explicitly.

In the previous example, we check for err immediately after executing the command.

err := cmd.Run()
if err != nil {
	fmt.Println(err)
}

How did you use ffmpeg?

I used the ffmpeg-go module to provide bindings to ffmpeg. This allows the CLI to preprocess input files into the format required by whisper.cpp:

import (
...
	ffmpeg_go "github.com/u2takey/ffmpeg-go"
)
...
	err := ffmpeg_go.Input(filePath).
		Output(outputPath, ffmpeg_go.KwArgs{"ar": 16000, "ac": 1, "c:a": "pcm_s16le"}).
		Run()

The generated WAV file is passed to whisper.cpp and then deleted after transcription to manage temporary files efficiently.

What are the options for releasing? Why Goreleaser?

I chose Goreleaser because it's a robust, well-tested tool with community support. It automates the release process for multiple platforms and package formats.

Why Homebrew? What are the challenges?

Homebrew makes it easy to release your software, even if it has dependencies.

If you're new to this up a Homebrew tap. It acts like an app store. It's a git repository that defines your formulae (i.e. apps/packages). There's an official core tap. But you can set up your own easily.

You can see my tap here.

How do you release Homebrew packages with Goreleaser?

I followed Gary Morse's guide to set up Goreleaser. You should also check Goreleaser's docs for setting up Homebrew. But in brief the process is:

  1. If you don't have one already, create a GitHub repo to act as your tap, ensuring you follow the conventions
  2. Create a GitHub Personal Access Token for Goreleaser.
  3. Set your personal access token as an actions secret (do this in the repo where your code is, not your tap repo).
  4. Create a Goreleaser config file (.goreleaser.yml)
  5. Set up a GitHub Action to run when a new Git tag is pushed
  6. So when you create a new tag and push it to origin, the action will run Goreleaser, which in turn creates or updates the Homebrew formula in your tap repo. For eg,
git tag -a v0.1.11 -m "Support multiple files" 
git push origin v0.1.11
  1. Your users can then run brew tap akash-joshi/homebrew-akash-joshi and brew install better-whisper to install the latest version of the CLI.

Conclusion

What did you learn?

  1. Go is excellent for systems-level programming and cross-platform deployments
  2. Building CLIs and working with OS-level commands isn't as daunting as it may seem
  3. The difference between Node.js CLIs (which run on top of the user's Node installation) and Go CLIs (which package as standalone executables)

Where next?

I plan to explore using whisper.cpp's Go bindings instead of relying on an external dependency, which could reduce the installation complexity.

Resources


Akash Joshi is the founder of The Writing Dev, a software and AI consultancy. Akash is a self-taught programmer who started his journey from playing video-games to developing them. Spending the last couple of years developing software and AI-powered solutions at trading firms, Meta and DeepL, he helps build apps that users love.

You can follow Akash on Twitter/X or subscribe to his newsletter at thewriting.dev for more content on AI, software development, and practical tutorials.