The FHIR Search Cliff Is Plural

Five of six open-source FHIR servers show some flavor of search-workload cliff at 64K patients. Aidbox climbs latency; Blaze collapses throughput; Spark fails 13% of queries; MS FHIR jumps 8× at the last checkpoint; Medplum errors at a 14% baseline. Only HAPI scales cleanly. Here is what to look for.

mock.health · 9 min read · 2026-05-02

If you are building a clinical AI agent, an RCM bot, or anything else that hammers FHIR at read-heavy load, the bug you cannot see coming is the one where your server performs beautifully in eval and then falls off a cliff in production.

I ran six open-source FHIR servers through a blended CRUD + search workload across four checkpoints: 1K, 4K, 16K, 64K Synthea patients. The full performance matrix tells the whole story. The finding worth pulling out is that the search cliff is plural. Five of the six servers in the matrix degrade meaningfully on the search workload as the corpus grows, in five different shapes. Only one — HAPI — scales cleanly. Knowing which shape your candidate server has is the difference between picking it and being surprised by it.

This is the post you read before you commit to one of them.

Five cliffs, one survivor

From the ramp round at 1K → 64K patients, headline metric is ok-only p50 (median) search latency, with throughput (ops/sec) and error rate as the two other axes that show the cliff:

Server	p50 1K → 64K	ops/s 1K → 64K	Errors
HAPI	26 → 35 ms	1,567 → 1,023	0% throughout
Aidbox	264 → 1,501 ms	219 → 40	0% throughout
Blaze	0 → 2 ms	2,027 → 34	0% throughout
Spark	312 → 3,186 ms	16 → 8	10–16% throughout
MS FHIR	27 → 286 ms	116 → 55	6.7% throughout
Medplum	10 → 79 ms	474 → 131	14% throughout

Each row is a different cliff. Worth understanding what each one means.

Aidbox — the latency cliff

Aidbox grows the median from 264 ms to 1,501 ms across the ramp — a 5.7× latency growth for a 64× corpus growth. Throughput drops from 219 to 40 ops/sec, a 5.5× collapse. Zero errors at any checkpoint. Aidbox is honest under load: it answers every query, slowly.

These numbers are with the vendor-recommended GIN-index bootstrap applied — Health Samurai's own benchmark indexes installed before ingest. Default-config Aidbox (no operator setup) collapses much harder; published numbers show p90 at 64K around 56 seconds with 94% errors. The bootstrap brings search ≥1,000× faster and is documented in the methodology doc. Both regimes are published on /performance so operators see what their day-one experience looks like vs. what a tuned production deployment looks like.

The cliff persists even with the bootstrap because the workload includes shapes Postgres cannot fully cover with GIN indexes — observation_recent (most-recent observations per patient, joining date + patient), reference-chain searches, compound multi-parameter queries. By 16K patients the median has crossed the 1-second "amber" band and stays there.

Blaze — the throughput cliff

Blaze's median is the best-looking number in the table — 0 ms at 1K, 2 ms at 64K — and that is the trap. The headline median hides what is happening underneath:

N patients	Search p50 (ms)	Search ops/s	Search p99 (ms)
1,000	0	2,027	512
4,000	1	2,161	411
16,000	1	4,109	4,109
64,000	2	34	23,091

Throughput drops 60× from 2,027 to 34 ops/sec across the ramp, while p99 explodes from half a second to twenty-three seconds. The bulk of queries stay fast (median is unmoved); a steadily-growing fraction take an order of magnitude longer; concurrency under load collapses. By 64K, Blaze can serve 34 search ops/sec — less than 2% of its 1K rate.

The cause is Blaze's Datomic-style immutable storage. History queries are cheap; complex search at scale is paying for the immutable design in throughput. If your app's read pattern is mostly simple reads (Patient/123) Blaze's CRUD is the fastest in the matrix. If your app fires a high concurrency of complex search at 64K-patient scale, the throughput floor is below what most apps can tolerate.

Spark — the error-rate cliff

Spark's latency line is bad — p50 climbs from 312 ms to 3,186 ms across the ramp — but the worse number is the error rate:

N patients	Search errors
1,000	10.2%
4,000	14.7%
16,000	15.9%
64,000	13.3%

A 10–16% error rate at every checkpoint, including 1K. This is not a scaling cliff in the Aidbox sense; it is a baseline failure mode. Spark cannot answer roughly one in eight search queries in the blended pool, and certain queries (wildcard _revinclude, deep back-references) time out at 60 seconds even with the read-path Mongo bootstrap applied.

Throughput is in single digits — 16 ops/sec at 1K, 8 ops/sec at 64K. By comparison HAPI serves 1,022 ops/sec at 64K. The matrix honestly disqualifies Spark's search cell at 1K and beyond; the bootstrapped numbers are published next to the default-config numbers so operators can see both.

MS FHIR — the late-checkpoint cliff plus a 6.7% error baseline

Microsoft FHIR Server is the most stable in the matrix on most metrics — until the last checkpoint:

N patients	Search p50 (ms)	Search ops/s	Errors
1,000	27	116	6.7%
4,000	41	123	6.7%
16,000	35	114	6.8%
64,000	286	55	6.8%

At 1K–16K the median is in HAPI's neighborhood (~30–40 ms). At 64K it jumps 8× to 286 ms while throughput drops in half. The 6.7% error rate is constant — a baseline the server posts at every scale, not a scaling cliff. If you are testing at 4K patients, MS FHIR looks fine. If you are running at 64K, it has crossed the amber band.

The OSS Docker container is what is being measured here. An Azure-hosted MS FHIR running against Azure SQL Hyperscale would look different. For the open-source evaluation, this is what you get.

Medplum — the constant 14% error baseline

Medplum's latency scales reasonably — p50 climbs from 10 ms to 79 ms across the ramp, an 8× growth that holds steady from 16K onward. Throughput drops 3.6× (474 → 131 ops/sec). Both are within "amber-but-livable" territory.

The number to interrogate is the error rate: 14% across every checkpoint. The headline median is ok-only — it excludes the failed queries entirely — so the 79 ms p50 reads as "Medplum is fast on the queries it can answer" while the 14% errors read as "Medplum cannot answer roughly one in seven queries in the blended pool, regardless of corpus size." Both numbers matter for an app that does not get to pick its query shapes in advance.

This is a baseline behavior, not a cliff. Specific query shapes in the pool fail-fast with 4xx because the operation is not implemented. Run the per-query breakdown at /performance/servers/medplum and look at which queries fail; if the failures are queries your app does not fire, Medplum is fine. If the failures intersect your app's workload, plan for a different server or a different query design.

HAPI — the comparison case

For context on what "no cliff" looks like, here is HAPI's row:

N patients	Search p50 (ms)	Search ops/s	Errors
1,000	26	1,567	0%
4,000	27	1,561	0%
16,000	31	1,484	0%
64,000	35	1,023	0%

35 ms median at 64× corpus growth. Throughput drops 35%. Zero errors throughout. This is what a server scaling cleanly on the search profile looks like, and HAPI is the only one in the matrix that does it.

(HAPI has its own caveat — CRUD p99 is in the 10–24 second range across the ramp, which is a different story covered in HAPI FHIR at 64,000 Patients.)

Why this matters if you are building clinical AI

An agent doing prior auth, coding, denial review, or chart extraction fires dozens of FHIR search queries per decision. The queries are not single-resource reads — those are the things CRUD numbers measure. The queries are things like:

"Give me all observations for this patient in the last 90 days"
"Give me all encounters for this patient at this facility type"
"Give me all medication requests for this patient that are not yet dispensed"

Every one of those is a search, not a read. Every one of those is exactly the shape that hits a cliff on the wrong server.

The failure mode in production is cruel. The agent works fine in dev against 100 test patients. It works fine in staging against 1K patients. It works fine in prod for the first week while usage ramps. Then the corpus grows, specific queries start falling off, and you see tail latency creep up from 200ms to 2s to 20s to timeout — exactly in the queries that drive the most-used feature. By the time the on-call alert fires, the agent has already returned wrong answers because it interpreted a timeout as "no data found."

If you are building this, you care about blended search behavior at 16K+ patients on three axes — median latency, throughput, error rate — not CRUD latency at 1K. The performance matrix has per-query breakdowns per server so you can pull the three queries that match your agent's dominant shape and look at them across the ramp.

What to check before you commit to a server

The short version: any FHIR server evaluation that does not include a blended search workload on a 16K-patient corpus, measured on three axes (p50 latency, throughput, error rate), is giving you the wrong answer.

The reason is that the two cheapest things to test — single-resource read-by-id, single-record ingest — are the two that least resemble your workload. Agents do searches. Searches are what separate the matrix.

If you are picking a new server, the honest rank for a read-heavy search workload is:

HAPI — 35 ms p50 at 64K, 0% errors, 1,023 ops/sec. The search-profile leader. (Caveat: CRUD p99 tail.)
Medplum — fast on what it implements, but plan for the constant 14% search-error baseline.
Aidbox — operator-grade indexing brings it from "completely broken" to "amber and predictable," but the median crosses 1 second by 16K and stays there.
MS FHIR — fine at 4K, problematic at 64K (8× latency jump + 6.7% error baseline).
Blaze — fast median, throughput cliff at scale. Pick it for ingest, not concurrent search.
Spark — disqualified at every checkpoint by the 13–16% error baseline.

The numbers are reproducible. The companion repo runs the ramp on your box. Every chart on /performance is generated from results/rounds/2026-q2-r000/benchmark.json; the file has a sha256 in MANIFEST.json.

A note on fairness

Five of these servers are serious, well-built FHIR servers used in production by real companies. The teams have been public and responsive about indexing tradeoffs and tuning workflows. Nothing in this post should be read as "do not use server X." It should be read as: each server has a search-profile shape you need to see before you pick it.

The Aidbox and Spark numbers in particular are with the vendor-recommended index bootstraps applied, per the methodology doc. The default-config numbers are catastrophically worse for both — Aidbox at 94% search errors at 64K with no indexes, Spark at full COLLSCAN. Both regimes are published. The bootstrapped numbers are "the engine's real capability"; the default numbers are "what you experience on day one." Both matter; they answer different questions.

The broader point is that "fast" is a workload-dependent adjective for FHIR servers. The matrix exists because vendors cannot give you an answer that is not colored by what they sell. Run the ramp. The cliff your candidate server has, if any, will show up.

If you need synthetic patient data that actually exercises search — panel-coded observations, deep longitudinal histories, Coverage and RelatedPerson for complex reference chains — that is what we built mock.health for. Free tier, no sales call. For the conformance side of the same matrix, see the companion post.

HAPI FHIR at 64,000 Patients: A Median-vs-Tail Story — I loaded 64,000 Synthea patients into HAPI and ran a production-shaped workload against it. Search median climbed from 26ms to 35ms with zero errors. CRUD median was rock-flat at 1ms — but the p99 tail was ten seconds. Here is what scales cleanly and what the tail tells you.
FHIR, USCDI, and US Core: What They Are, How They Fit — FHIR is HL7's standard for exchanging health data over RESTful APIs. USCDI defines what data; US Core defines how to format it. Here's how the three fit.
Your Clinical AI Agent Needs More Than 5 Patients — Your prior auth agent works in testing. Then it meets a 68-year-old with CKD, hypertension, and a specialist referral — and crashes.

All posts · Home · Docs