Packing a site into one file

Bundle a whole crawl into one SQLite database or one ZIM archive with yomi pack, and keep it current with a resumable, incremental crawl.

yomi site writes a folder of Markdown files. yomi pack writes one file: a crawl of the whole site bundled into a single SQLite database, or a single ZIM archive you can open in Kiwix. Both are backed by the same database, so a pack resumes where it left off and a later run only fetches what changed.

# A SQLite database of the whole site (the default format)
yomi pack paulgraham.com -o pg.db

# A ZIM offline archive, browsable in Kiwix
yomi pack paulgraham.com -o pg.zim

The output extension picks the format. -o pg.zim builds a ZIM and -o pg.db builds a database without you passing --format as well. With no -o, pack writes <host>.db (or <host>.zim when --format zim is set). An explicit --format always wins over the extension.

SQLite: a site you can query

The default format is a SQLite database with clean, structured tables. Every page is a row in pages, and its links and images live in links and images tables that join back by page_id.

yomi pack paulgraham.com -o pg.db

The pages table carries one row per page: its url, title, byline, site_name, excerpt, lang, published, fetched, word_count, reading_time, depth, and the markdown body. So a query over the site is one line of SQL:

# The five longest essays
sqlite3 pg.db "select title, word_count, reading_time from pages order by word_count desc limit 5;"

# Every outbound link the author wrote, across the whole site
sqlite3 pg.db "select p.title, l.url from links l join pages p on p.id = l.page_id where l.internal = 0;"

A meta table records the crawl itself: the seed, the host, when it was created and last updated, the page count, and the yomi version that built it.

ZIM: a site you can read offline

A ZIM archive is the format to reach for when you want to read the site offline. pack renders each page to a self-contained HTML document, rewires the in-scope links to point at the sibling entries, generates a contents page as the landing page, and writes one OpenZIM file.

yomi pack paulgraham.com -o pg.zim

Open the result in Kiwix on any device, or serve it over HTTP:

kiwix-serve --port 8080 pg.zim

A ZIM build keeps its SQLite store next to the archive as a sidecar (pg.db for pg.zim), so the next run is incremental too. Point --state somewhere else to keep the store apart from the archive.

The ZIM metadata flags set what a reader sees in Kiwix:

yomi pack paulgraham.com -o pg.zim \
  --title "Paul Graham's Essays" \
  --description "An offline archive of paulgraham.com" \
  --language eng \
  --date 2026-06-18 \
  --icon pg.png

--title defaults to the home page title, --language to eng, and --date to today. The archive carries the Title, Description, Creator, Publisher, Date and Counter metadata Kiwix shows in its library, plus a 48x48 icon for the book tile. yomi draws a built-in reading icon by default; pass --icon with a PNG to use the site's own logo instead. Pass --no-compress to store every entry raw, which makes a larger file that opens without decompression.

The crawl resumes

A pack is resumable because the database is the crawl's own backing store. Run pack again over the same output and it keeps every page already stored, fetching only pages it has not seen:

yomi pack paulgraham.com -o pg.db   # first run: reads the whole site
yomi pack paulgraham.com -o pg.db   # again: new 0, every page kept

The summary line reports new (pages fetched this run) and kept (pages already stored and skipped without re-fetching). On a settled site the second run reads nothing. If a run is interrupted, the pages it had already written stay in the store, so re-running it picks up the rest rather than starting over.

Keeping a pack current

Two flags drive a refresh.

--refresh re-fetches every page, ignoring what is stored. Reach for it when the whole site has changed and you want a clean rebuild:

yomi pack paulgraham.com -o pg.db --refresh

--max-age re-fetches only the pages older than a cutoff, leaving fresher ones untouched. A daily mirror stays current without reading the whole site each time:

yomi pack paulgraham.com -o pg.db --max-age 24h

Anything stored longer ago than the duration is re-read; everything newer is kept. Without either flag a stored page is never re-fetched.

Scope, limits, and politeness

pack takes the same scope and crawl controls as yomi site, and they mean the same thing:

# Just one section, two hundred pages at most, ignoring a subtree
yomi pack go.dev -o go.db --scope-prefix /doc --max-pages 200 --exclude /blog

# Pull in subdomains, eight workers
yomi pack example.com -o example.zim --subdomains --workers 8

--scope-prefix, --max-pages, --max-depth, --subdomains, --exclude, --workers, and --no-robots all behave exactly as they do for a folder crawl. A pack honours robots.txt by default.

Which format

Reach for SQLite when you want to query the site, feed it to a tool, or keep a structured record you can diff and join. Reach for ZIM when you want to read the site offline in Kiwix on a phone, a laptop, or a server. Either way the crawl is the same, and the SQLite store is always there to resume from.