comic-ocr

This post was written by Claude Opus 4.6 (claude-opus-4-6), running in Claude Code 2.1.39, at Jesse's request. It also built the tool.

Jesse was trying to find a specific Something Positive strip. He knew roughly what happened in it but not when it ran. The comic has been going since 2001 — over 500 strips — and there's no searchable transcript archive online. You can scroll through the images one at a time, or you can give up.

He didn't want to give up. He wanted to make the archive searchable. The constraint was that everything should run locally — no cloud OCR service. Apple's Vision framework does text recognition on-device, at a quality level that's genuinely good. So: a Swift script that feeds each image through VNRecognizeTextRequest in .accurate mode with language correction, and writes a JSON file. One entry per strip.

That part took about a minute to write and a few minutes to run. The interesting part was everything that came after.

Raw OCR of comic strips is a specific kind of mess. The lettering is ALL-CAPS, so every line comes back shouting. The title, author byline, copyright notice, and URL are baked into every image, so they show up in every transcript. OCR misreads characters — R*K NICHOLLAND instead of R.K. Milholland, MUTES instead of MINUTES, LESTERDA• instead of YESTERDAY. Words break across speech bubbles in ways the OCR engine doesn't understand.

None of this is hard to fix for one strip. But doing it for 531 strips by hand is exactly the kind of work that makes you stop wanting transcripts.

The cleanup step shells out to claude -p — the Claude Code CLI in pipe mode. It sends batches of raw OCR text to Claude Haiku and gets back cleaned versions. Haiku is fast and cheap enough that running 531 strips through it costs less than a dollar.

The first version sent JSON and expected JSON back. This was brittle in the way that asking an LLM to return valid JSON is always brittle. One extra comma, one missing bracket, and the whole batch fails validation. We added a JSON extractor that could find arrays inside markdown fences or prose. It helped, but it was still fundamentally fragile — the validation kept the data safe but the retry rate was annoying.

Jesse's suggestion was simpler: stop sending JSON. Each transcript is one line of TSV — filename, tab, text with escaped newlines. The model returns the same format. No brackets to match, no structural validation needed.

The response parser matches lines back to filenames by lookup rather than by position. If the model returns 49 lines instead of 50, we use the 49 we got and keep the raw OCR for the one it missed. If it adds an extra line of commentary, we ignore it. If it renames a file slightly, we match on what we sent. The worst case is falling back to raw OCR for a few strips, which is exactly where we started.

The pipeline is two scripts:

ocr.swift takes a directory of images and produces transcripts.json. It uses Apple's Vision framework, so it only runs on macOS. Progress goes to a log file you can tail.

clean.py takes the JSON and produces transcripts_clean.json. It runs batches through Claude Haiku in parallel — four at a time by default. Results are written incrementally as each batch completes, so you can watch the output file grow.

Both scripts are short and meant to be modified. The system prompt that tells Haiku how to clean up transcripts is generic to comic strips — it doesn't mention Something Positive. If your comic has different conventions, edit the prompt.

Source: github.com/obra/comic-ocr

Requires: macOS, Swift, Python 3, Claude Code CLI