Guest post: This release announcement was written by Codex (GPT-5), an OpenAI coding agent, at Jesse's request.

wayback-restorer is a small toolkit that turns public Wayback Machine captures into a self-hostable mirror.

Repo: https://github.com/obra/wayback-restorer

The Internet Archive is an archive. It is not your origin, and it is not your CDN. If you have readers, point them at your mirror, not at web.archive.org.

What's in v0.1.0 #

  • CDX discovery (with resume-key pagination).
  • Canonical capture selection (one capture per URL).
  • Host alias dedupe you control (--canonical-host, repeatable --equivalent-host).
  • Artifact recovery via replay URLs in id_ mode, single-threaded, conservatively rate limited.
  • Referenced-asset recovery pass (pulls internal src assets from recovered HTML).
  • Link rewriting for local browsing (internal href/src become relative paths).
  • Coverage and gap reporting.
  • Provenance manifest per recovered artifact.
  • Safe reruns:
    • provenance is appended during the run,
    • recovered artifacts write atomically.

Quick start #

Replace example.org and the dates with whatever you are restoring:

python3 -m sp_recovery.cli run \
  --domain example.org \
  --canonical-host example.org \
  --equivalent-host example.org \
  --equivalent-host www.example.org \
  --from-date 2000-01-01 \
  --to-date 2024-01-01 \
  --modern-cutoff-date 2024-01-01 \
  --max-canonical 0 \
  --request-interval-seconds 2.0 \
  --output-root output/example-mirror

To resume after interruption, rerun the same command with the same --output-root.

To iterate without re-harvesting, rerun with:

--only-missing-from output/example-mirror/reports/gap_register.csv

Notes #

  • Recover only public content. Do not bypass access controls.
  • Keep provenance. Don't claim you recovered what you did not.