Methodology

Status: v1 draft, decisions locked. Becomes active when the first benchmark cycle launches and is frozen for six months from that date.

This document defines how the benchmarks on this site are produced. The goal is reproducibility and accountability. Anyone reading a result should be able to come back here and verify what the numbers do and do not represent.

Goal

Provide a versioned, reproducible measurement of cold-cache and warm-cache wall-clock check times for major Python type checkers across a fixed set of public Python codebases. The benchmarks are one input into tool selection. They do not capture false-positive rate, ergonomics, IDE responsiveness, or error-message quality.

What’s measured

Wall-clock time for the type checker’s check command from invocation to exit, measured by hyperfine (warm scenario) or single-shot timing (cold scenario).
Two scenarios per (tool, target) pair:
- Cold cache: fresh container, no tool cache, no target build artifacts. One timed invocation per container, sampled across 10+ independent container starts.
- Warm cache: prior run’s cache present, no source changes since.
Tool exit code per run. The page cell shows “OK” if every timed run in the cell exited 0, “error” otherwise.

What’s not measured

IDE / LSP responsiveness. Editor feedback latency depends on debouncing, partial reanalysis, and editor protocol implementation. It cannot be scripted in CI without becoming an apples-to-oranges comparison.
Incremental check time after small source edits. Each tool defines “incremental” differently. Worth measuring eventually, but not in v1.
Peak memory / RSS. Deferred to a later methodology version. /usr/bin/time -v wrapped around hyperfine is ambiguous about which process’s RSS is being captured, and a separate single-shot memory pass per cell doubles container cost. The v1 page omits memory rather than publishing a number that can’t be defended.
Diagnostic counts. The number of diagnostics each tool reports varies for reasons unrelated to performance (rule set breadth, strict-mode defaults, false-positive rate) and would be misread as a quality comparison if surfaced in the headline table. The exit code is the only diagnostic-related signal exposed.
Memory or timing on Windows or macOS. Linux only in v1.
Network-dependent steps. Package install, stub fetching, and module resolution against PyPI all run before the timed region.

Hardware and environment

Modal containers, machine class cpu=4.0, memory=8192, region us-east. The class is encoded in benchmarks/modal_app.py and changes go through the methodology-change process below.
Each run snapshots /proc/cpuinfo, lscpu, kernel version, and the resolved Modal image hash into the JSON manifest. Runs whose machine class drifts from the pin abort before the timed region rather than producing comparable-looking numbers from different hardware.
CPython interpreter pinned per target in the manifest. The interpreter version moves independently of the target’s source tag because not every pinned target tag installs on the latest interpreter. Each target’s manifest records the exact python --version used.
All tools and targets installed into the container’s image layer before timing begins.

Run protocol

Cold-cache scenario

Fresh container start.
Install tool and target with versions pinned per the manifest. Verify no tool cache exists.
Run the timed command once. Record wall-clock and exit code.
Repeat steps 1-3 across at least 10 independent container starts per (tool, target) cell.

N=1 per container is required because every iteration within a single container is warm by definition: the first run primes the OS page cache, populates the tool’s on-disk cache, and resolves imports. Only the first invocation in a fresh container sees true cold paths, so cold-cache samples must come from container starts, not from hyperfine --runs N inside one container.

Warm-cache scenario

Fresh container start.
Install tool and target with versions pinned per the manifest.
Run the tool once to populate the cache. Discard timing.
Run hyperfine --warmup 3 --runs 10 on the timed command. Record per-iteration timings and exit code.
Repeat steps 1-4 across three independent container restarts.

Single-container repeats hide cache effects in tmpfs and the OS page cache, which is why multi-container sampling is the unit of trust rather than a single hyperfine run.

Reporting

All samples from all containers in a (tool, target, scenario) cell are pooled into a single distribution. The page reports the median as the headline number and the IQR (25th to 75th percentile) as a band. Cells link to the cycle’s manifest in R2 for the full per-sample raw data.

The page renders one table per scenario with median wall-clock time, IQR band, and an “OK / error” indicator for each (tool, target) pair.
“Last successful run” timestamp shown prominently. The page banner turns red after 14 days without a fresh run.
Each cell links to the JSON manifest for the underlying (tool, target, scenario, cycle) tuple. The manifest holds raw per-sample timings, exit codes, the machine snapshot, and the pinned tool, target, and interpreter versions. The table is a rendering of it.
Methodology version pinned in the page footer. The page links back to this document.

Tool configurations

Each tool runs with two configurations:

Defaults: no flags. What a user gets on first install.
Strict: a named, exact incantation pinned per tool in the manifest. Initial set:

Tool	Strict incantation
mypy	`mypy --strict`
Pyright	`pyright --strict`
basedpyright	`basedpyright --strict`
ty	TODO: confirmed with ty maintainers in pre-freeze engagement window
pyrefly	TODO: confirmed with pyrefly maintainers in pre-freeze engagement window
Zuban	TODO: confirmed with Zuban maintainers in pre-freeze engagement window

Each entry links to the tool’s documentation page for the flag. Maintainers are explicitly invited to ratify or correct their entry during the engagement window before each freeze. The entry used in a cycle is whatever the manifest records at cycle start.

Both columns are reported. Defaults favor tools that flag less; strict rewards conformance and breadth. Neither answers “which tool is best.”

Targets

Five public Python codebases, pinned to specific tags. Each target’s CPython interpreter is pinned independently in the manifest.

Target	Reason for inclusion
`requests`	Small, fast checker; sanity baseline.
`rich`	Mid-sized, well-typed; the existing benchmark target from prior posts.
`FastAPI`	Web framework; protocol-heavy code.
`django`	Large; metaclass and runtime-introspection patterns.
`pandas`	Dynamic-typing stress test; large, slow on every tool.

The “latest stable” tag is resolved at the start of each benchmark cycle and pinned for the cycle. Adding or removing a target is a methodology change (see below). Targets rotate at most once per calendar year.

Tools included

ty, pyrefly, mypy, Pyright, basedpyright, Zuban. Every tool runs every cycle. Adding a tool requires it to install from PyPI and check a Python project from a single CLI invocation. Pre-release tools (version below 0.1.0) are not included until they ship a 0.1 release. The exact version pinned in any given cycle is recorded in that cycle’s manifest.

Removing a tool requires that it be archived or formally deprecated by its maintainers.

Cycle cadence

Benchmarks run biweekly. Tool releases ship roughly biweekly across this set, so each cycle catches roughly one release per tool. A cycle that fails to complete leaves the previous cycle’s numbers visible on the page until the red-banner threshold (14 days) is reached.

Manifest hosting

Per-run JSON manifests live in a Cloudflare R2 bucket, one object per (tool, target, scenario, cycle) tuple. Each manifest holds raw per-sample timings, the machine snapshot, pinned tool / target / interpreter versions, exit codes, and the methodology version under which it was produced.

A small per-cycle summary file is committed to the site’s git repo. It lists the manifest object keys, headline median + IQR per cell, and the tool / target versions for the cycle. This split keeps git history clean while preserving full provenance of every number.

Methodology changes

This methodology is frozen for six months at a time. Changes during a freeze period are limited to:

Bug fixes that restore documented behavior.
Adding a tool that meets the inclusion criteria.
Removing a tool that has been archived or deprecated by its maintainers.

All other changes (machine class, target list, configuration choices, scenario definitions, run count) go into the next freeze period via a public PR.

Engagement with tool maintainers

Before each freeze period begins, the proposed methodology is sent to the maintainers of every included tool for comment. The comment window stays open until seven consecutive days pass without a new objection, at which point the freeze proceeds. Maintainer responses are recorded under docs/benchmark-methodology-feedback/, and disagreements that don’t result in a methodology change are noted on the public page.

The benchmarks are not a marketing comparison. Their authority depends on tool teams trusting the methodology even when the numbers don’t favor them.

Failure modes

When the benchmark breaks, the page must show that it’s broken rather than serving stale numbers as if they were fresh.

Modal job fails: GitHub Action exits non-zero, “last successful run” timestamp does not advance, page shows a red banner after 14 days.
Tool fails to install or check exits non-zero: cell shows “error” linked to the failed run’s logs, and the methodology document records the known incompatibility.
Machine class drifts: run aborts before the timed region.
Target’s pinned tag is no longer fetchable: cycle aborts and the on-call author is paged.

Open questions for v1

These are decisions that still need to be made before the methodology can be marked active:

CPython interpreter versions per target. The pin is per-target (see “Hardware and environment”), but the actual values need to be chosen and recorded in the manifest before the first cycle.
Strict incantations for ty, pyrefly, and Zuban. The mypy / Pyright / basedpyright entries are known; the other three need maintainer ratification during the first engagement window.
Whether pandas belongs in the canonical target set or moves to a separate stress-test tier. Some tools time out or require special config on pandas; isolating it may be more honest than letting one target dominate every column.