The contract behind every cell on /performance.
CRUD (C/R/U/D mix at steady state) and Search (five queries every server in the roster supports). Ingest is the setup tax paid to reach each checkpoint — not a published metric, because the vendor-recommended bulk path is $import, not transaction POSTs.
The patient population grows incrementally between checkpoints; the server is not wiped. p50 (median) latency at each checkpoint is the headline metric, plotted on log-log axes so scaling behavior is visible at a glance. p95 and p99 are captured in every evidence row as tail evidence, but aren't the headline — a 2-minute run is too sample-starved on slow cells for the tail to be stable.
Search uses only 2xx responses. Including fast 4xx rejections would reward vendors that fail fast on unsupported queries — the opposite signal.
Anyone can rerun a round locally with make benchmark. The runner, the queries, the schema, and the hardware config are all open source.