Python Type Checker Benchmarks

Reproducible, versioned wall-clock benchmarks for the major Python type checkers across a fixed set of public Python codebases.

Vertical slice in progress. The methodology is in its pre-freeze engagement window with tool maintainers, and the harness currently runs one cell only (mypy on rich) to prove the pipeline end-to-end. The full matrix (six tools, five targets, both scenarios) ships when the freeze locks. See the methodology for what will be measured and why.

Latest cycle

Cycle 20260513T030824Z · methodology v1-draft-2026-05-12 · written 2026-05-13T03:08:52.422536+00:00

Tool	Target	Scenario	Median	IQR	N	Status
`mypy` mypy 2.1.0 (compiled: yes)	`rich` v15.0.0	warm	0.3494s	0.3356s – 0.3831s	30	OK
`mypy` mypy 2.1.0 (compiled: yes)	`rich` v15.0.0	cold	3.03s	2.4225s – 3.5548s	10	OK

What this is

Every benchmark cycle measures cold-cache and warm-cache wall-clock check times for the major Python type checkers against five public Python codebases (requests, rich, FastAPI, django, pandas). Cycles run biweekly. Per-cell median and IQR are reported here; full per-run manifests live in R2 and are linked from each cell.

What this is not

A marketing comparison. The benchmarks do not measure false-positive rate, ergonomics, IDE responsiveness, or error-message quality. They are one input into tool selection.

Methodology

The methodology is versioned and frozen six months at a time. Read it in full at /methodology. Tool maintainers are invited to comment before each freeze; their responses are recorded publicly.