6 Open FHIR Servers Until They Break

Six FHIR servers. 64,000 Synthea patients. Here is what happened under loads far below your average hospital.

the nighthawk · 13 min read · 2026-05-02


The conformance post showed that six open-source FHIR servers loaded with the same single patient will disagree about a surprising amount. This post is about what happens when you scale that same set up to sixty-four thousand patients and start hitting them with workloads.

The goal was a fair fight. Same box, same data generator, same workload definition, same resource budget per container, image digests pinned by sha256. Four checkpoints: 1K → 4K → 16K → 64K patients. At each checkpoint: load patients, then run a blended CRUD workload, a 30-query search workload, and a transaction-Bundle ingest workload. Measure throughput, median latency, error rate, CPU, and memory. Full methodology is at /performance/methodology; the per-server detail pages are at /performance.

Six servers: HAPI FHIR 8.8.0-1, Microsoft FHIR Server 4.0.728, Medplum 5.1.8, Aidbox 2603, Blaze 1.6.2, Spark 2.4.1-r4. Six findings.

Headline metric for CRUD and Search is p50 (median) latency — see the methodology post for why p99 is unstable on a 2-minute run window. p95 and p99 ride along as tail evidence and surface in two of the stories below where the divergence matters. Ingest headlines on bundles per second of successful transaction Bundle POSTs.

Before the numbers, the punchline: every server I tested is good at something and broken at something else. The story is which workload profile matches which engine. The mismatches are loud.

Aidbox is the steadiest CRUD server. Aidbox is also the steadiest search-cliff.

Start here because it is the result that made me rerun the experiment to make sure I had not miscabled something.

Aidbox CRUD p50 latency — single-resource reads, writes, updates, deletes:

N patients CRUD p50 (ms) CRUD ops/s
1,000 13 4,572
4,000 12 4,807
16,000 13 4,729
64,000 13 4,804

Rock-flat. Across a 64× corpus growth Aidbox's median CRUD latency does not move and throughput does not drop. The JSONB-on-Postgres design is doing what it is designed to do.

Aidbox search p50 latency — same server, same corpus, blended search workload drawn live from the loaded data:

N patients Search p50 (ms) Search ops/s Errors
1,000 264 219 0%
4,000 509 117 0%
16,000 949 64 0%
64,000 1,501 40 0%

That is a 5.7× latency growth across a 64× corpus growth, and a 5.5× throughput collapse from 219 to 40 ops/sec. Zero errors throughout — but the median crosses the 1-second "amber" band at 16K and stays there. At 64K, half of all search requests take longer than a second.

Worth knowing: the Aidbox numbers above are measured with the vendor-recommended GIN-index bootstrap applied — Health Samurai's own benchmark indexes, applied before ingest. Default-config Aidbox (which is what you get from docker compose up aidbox with no setup) collapses much harder: search p90 at 64K is reportedly ~56 seconds with 94% errors. The bootstrap brings search ≥1,000× faster. Both numbers are published on /performance. The number above is "Aidbox configured the way an operator would run it in prod"; the default-config number is "Aidbox configured the way you experience it on day one."

The cliff is still real even with the bootstrap. The queries in the blended workload include several that hit shapes Postgres still has to range-scan even with GIN indexes — observation_recent (most-recent observations per patient, joining date + patient) and reference-chain searches in particular. Median latency growth from 264ms to 1,501ms across the ramp is visible at every checkpoint.

Takeaway for you: if you are building an agent that hammers FHIR search at load (clinical AI, RCM, patient access), put whatever server you are considering through a blended search workload on a 16K+ corpus before you commit. Aidbox shows up here as an excellent CRUD store with a search profile that needs operator-grade indexing and still degrades under scale. If your eval was "I wrote some queries, they ran fast" against a hot cache and a small corpus, you are missing the regime where the system actually runs.

Blaze ingests fastest, and quietly chokes search throughput at scale

Blaze is the ingest standout. At 16K patients it sustains 17.46 transaction Bundles per second — the only server in the matrix above 10 B/s at any checkpoint. At 64K it is still 14.30 B/s. No other server is remotely close on ingest.

CRUD on Blaze is fast — p50 latency is 0–1 ms across every checkpoint, throughput holds 570–585 ops/s. The median read-by-id is faster than every other server in the matrix, full stop.

Search is where the picture changes. The headline median stays excellent — 0–2 ms p50 across the ramp — but the per-checkpoint throughput and tail tell a different story:

N patients Search p50 (ms) Search ops/s Search p99 (ms)
1,000 0 2,027 512
4,000 1 2,161 411
16,000 1 284 4,109
64,000 2 34 23,091

Throughput drops 60× from 2,027 to 34 ops/sec across the ramp. The p99 tail explodes from half a second to twenty-three seconds. The median is a great number that hides a much worse story underneath: the bulk of queries are fast, but a steadily-growing fraction take an order of magnitude longer, and at 64K the throughput floor has collapsed to less than 5% of the 1K rate.

The cause is architectural. Blaze is built on Datomic-style immutable time-travel storage — every resource version persists, every read pays a small walk of the history graph. It is a beautiful design for history queries (Patient/123/_history is cheap on Blaze), and it is fast on simple reads. But complex search at scale is paying for that immutable design in throughput.

Takeaway: Blaze is an excellent choice if your workload is ingest-heavy and your reads are simple-shaped (Patient/123, Observation/456). The matrix numbers back that up unambiguously. It is a poor choice for high-concurrency complex search at 64K-patient scale; the throughput cliff is real and the p99 tail proves it. If your eval of Blaze was "ingest was amazing, reads were instant," run it under load against a 16K+ corpus before committing.https://github.com/FirelyTeam/spark

Spark's search workload is disqualified at every checkpoint

Spark's CRUD numbers look great on paper — p50 latency 2–3 ms, 8,598 ops/s at 64K — better than every other server in the matrix on simple read-by-id throughput. The issue is what happens when you ask Spark anything more complex.

N patients Search p50 (ms) Search ops/s Errors
1,000 312 16 10.2%
4,000 886 9 14.7%
16,000 2,218 6 15.9%
64,000 3,186 8 13.3%

A 10% error rate is the entry point. By 4K patients it is 15%. The median climbs from a third of a second at 1K to over three seconds at 64K. The throughput floor is in the single digits — Spark serves about 8 search ops/sec at 64K patients, against HAPI's 1,022.

The Spark numbers above are with the read-path Mongo bootstrap applied — 14 compound indexes auto-created on first MongoDB startup. Without them, every search is a full collection scan and p50 sits in the seven-second range at 1K patients. With them applied, certain queries (wildcard _revinclude, deep back-references) still time out at 60 seconds regardless of indexing. Spark's search cell remains honestly disqualified at 1K and beyond. The bootstrapped numbers are published next to the default-config numbers on /performance so operators can see both regimes.

The charitable reading is that Spark is a reference implementation — built to prove a spec, not to run a production search workload. The project's own roadmap notes it as a teaching tool. That is fine; that is a legitimate reason for a server to exist. But every search of the phrase "open-source FHIR server" includes Spark, and I have talked to startups who picked it for the license and learned this the hard way.

Takeaway: if your evaluation is "does this scale," include a 16K-patient checkpoint with a blended search workload. Spark fails it. Several other servers wobble. The point of the matrix is that they wobble in different ways.

HAPI's median is great, the long tail is not

This is the section where the simple narrative — "HAPI is the boring winner, everyone is running it for a reason, the matrix proves it" — turns out to be more than the numbers can support.

HAPI's blended search holds up beautifully across the ramp:

N patients Search p50 (ms) Search ops/s Errors
1,000 26 1,567 0%
4,000 27 1,561 0%
16,000 31 1,484 0%
64,000 35 1,023 0%

Twenty-six milliseconds at 1K patients, thirty-five milliseconds at 64K, zero errors across the entire ramp, throughput drops only 35% from 1,567 to 1,023 ops/sec. On the search profile that drives most real FHIR apps, HAPI is the leader.

The CRUD picture is more complex. HAPI's CRUD p50 is 1–2 ms across every checkpoint — the median is excellent. But the p99 tail tells a different story:

N patients CRUD p50 (ms) CRUD p99 (ms) CRUD ops/s
1,000 1 10,340 63
4,000 1 11,099 59
16,000 1 19,483 35
64,000 2 23,868 26

p99 is in the 10–24 second range across the entire ramp. Half the requests take a millisecond; the worst 1% take ten or more seconds. Whatever JPA write path is being exercised here has a tail that stays bad regardless of corpus size, and CRUD throughput is in the tens of ops/sec — orders of magnitude below Aidbox / Medplum / Spark on the same workload.

This is consistent with what large HAPI deployments report — the median write feels instant, but the long tail under contention is real and operators size for it. The methodology post explains why p99 is footnoted rather than headlined (a 2-minute run is not enough samples for stable p99 on slow cells), but in HAPI's case the p99 numbers are stable enough to publish: every checkpoint produces tens of thousands of CRUD samples.

For SMART-on-FHIR patient access at the search profile, HAPI is excellent. For high-concurrency single-resource writes where p99 matters, the tail is something to plan for. The number to watch alongside the median is the per-verb breakdown at /performance/servers/hapi.

The HAPI numbers above are measured with the vendor-recommended Lucene knobs enabled (advanced_lucene_indexing=true, advanced_hsearch_indexing=true, lastn_enabled=true), which the HAPI docs recommend for full-text support. Without them, vanilla HAPI returns HTTP 400 on _content and code:text queries. Cost is ~5–10% slower ingest. Both knobs and cost are documented in the methodology — the HAPI in this matrix is "HAPI 8.8 with the full-text knobs the docs recommend," not "vanilla HAPI."

There is also one specific HAPI gotcha — the _total=accurate query — that is dramatic enough to warrant its own post. See HAPI FHIR at 64,000 Patients.

Microsoft FHIR Server is steady on CRUD, slow and error-prone on search

Microsoft FHIR Server's CRUD line is the cleanest in the matrix:

N patients CRUD p50 (ms) CRUD ops/s
1,000 8 1,171
4,000 8 1,195
16,000 8 1,207
64,000 7 1,165

Eight milliseconds, flat, across the ramp. ~1,170 ops/sec, also flat. If your workload is CRUD-shaped and you need predictability, MS FHIR is a reasonable pick.

Search is where it falls behind:

N patients Search p50 (ms) Search ops/s Errors
1,000 27 116 6.7%
4,000 41 123 6.7%
16,000 35 114 6.8%
64,000 286 55 6.8%

The 6.7% error rate is constant across the entire ramp — not a scaling cliff but a baseline. A persistent fraction of queries fail regardless of corpus size, suggesting specific query shapes the server cannot handle. At 1K–16K the median is comparable to HAPI; at 64K it jumps 8× to 286 ms while throughput drops to 55 ops/sec.

The MS FHIR Server image being measured here is the OSS Docker container, not Azure Health Data Services. An Azure-hosted MS FHIR running against Azure SQL Hyperscale would look different. For the OSS evaluation, this is what you get.

Takeaway: If your architecture is Azure-native and you need HIPAA BAA coverage under Microsoft's compliance umbrella, MS FHIR running on the managed service is a reasonable pick for operational reasons. If you are evaluating it in Docker as a peer of HAPI, the constant 6.7% search error rate is the number to interrogate before committing.

Medplum is the fair-middle path with a search-error caveat

Medplum's CRUD line is rock-solid: p50 7 ms at every checkpoint, 5,400–5,700 ops/sec, near-zero errors. On CRUD throughput it is in the top tier — second only to Spark and not by much.

Search is where the caveat lives:

N patients Search p50 (ms) Search ops/s Errors
1,000 10 474 14.1%
4,000 17 230 14.9%
16,000 79 160 14.9%
64,000 79 131 14.6%

The median grows ~8× across the ramp (10 → 79 ms) and stabilizes — that is a respectable scaling curve. But Medplum returns errors on roughly one in seven search queries at every checkpoint. This is a constant-baseline behavior, not a scaling cliff: specific query shapes in the blended pool are not implemented, and the unsuccessful queries fail-fast with 4xx rather than dragging the latency tail.

Per the methodology, the headline median is ok-only — the latency stream excludes the 14% of failed queries. So the 79 ms median is "Medplum is fast on the queries it can answer" while the error rate is "Medplum cannot answer roughly 14% of the queries in the blended pool." Both numbers matter.

Takeaway: Medplum's positioning — managed-HAPI-with-good-DX — is what the matrix shows. It is fast and steady on what it implements. Before committing, run a blended search workload on a corpus shaped like your data, look at the per-query error breakdown at /performance/servers/medplum, and verify the queries your app fires are in the implemented set.

How to read the matrix

The full per-server, per-query breakdown lives at /performance. Each server has a page at /performance/servers/{server} with every query's p50/p95/p99, error rate, and CPU/memory timeline as the workload ran. The methodology document at /performance/methodology describes the workload generator, the resource budget per container, the checkpoint reset protocol, the vendor-recommended configurations, and the index bootstraps for Aidbox and Spark.

The artifact at results/rounds/2026-q2-r000/benchmark.json has a sha256 in MANIFEST.json. Every number above is pulled directly from that JSON. Every chart on the page is generated from that JSON. If a number in the post does not match the page, trust the page.

What to do with all this

If you have a FHIR server in production, pull the three numbers that matter for your workload and see where your server lands: blended search ok-p50 at 16K patients, CRUD p50 at 16K patients, search error rate at 16K patients. 16K is the scale at which the matrix starts to separate the contenders from the "fails silently" group. If you are way off the median for your workload profile, it is worth a conversation.

If you are picking a server, run the ramp yourself. The companion repo at fhir-server-compare ships the load-test driver. The full ramp takes 12–16 hours elapsed. The smoke test takes minutes and already surfaces the Spark error baseline and the Medplum search-error baseline.

If you are building a clinical AI agent, the bottleneck is almost certainly search at load, not CRUD. That shifts the ranking: HAPI is the obvious first pick (35 ms p50 at 64K, zero errors); Medplum and Aidbox are reasonable seconds with the per-query caveats above; Blaze is a poor fit (throughput cliff); Spark is disqualified. MS FHIR is reasonable on Azure with the managed service.

If you are building patient access (SMART apps against logged-in patients, modest concurrency, lots of resource reads), HAPI and Medplum are both reasonable. The conformance matrix matters more for you than the performance matrix — see the conformance post.

If you are building a write-heavy ingest pipeline (HL7v2-to-FHIR conversion, batch uploads), Blaze is the standout — 14 B/s at 64K vs. ~2 B/s for everyone else. The read trade-off is real but predictable.

The thing I did not expect going into this, and the thing I think is the real finding, is how little correlation there is between a server's reputation and its performance on a fair benchmark. Every server in the matrix has devoted users. Every server has a workload profile on which it is the right pick. The problem is picking wrong and finding out six months into the build. Run the ramp.

If you need synthetic patient data that exercises every one of these regimes — longitudinal chronic patients that bloat a bundle, panel-coded observations that stress search, Coverage and RelatedPerson resources that trip conformance — mock.health is what we built for exactly this. Free tier, no sales call.


Related posts

All posts · Home · Docs