How to Benchmark a FHIR Server: Methodology Notes

Most FHIR server benchmarks are unreliable. They run on different hardware, warm caches, hand-pick queries, and headline on percentile noise. Here is what fair looks like — median latency, ok-only filtering, transaction-Bundle ingest, and vendor-recommended config — and why every choice is the way it is.

mock.health · 11 min read · 2026-05-02


If you look for "FHIR server benchmark" today, you find two kinds of results. Vendor-published comparisons where the vendor's server always wins, and academic papers from 2019 testing three servers on a tiny corpus. Neither is what a team making a real server-selection decision needs.

Last month I ran a benchmark of six open-source FHIR servers at 1K, 4K, 16K, and 64K patients. The full matrix is public, the methodology doc is public, the companion repo reproduces every number on your box. This post is about the methodology choices under that matrix — and the ones I changed once the data made it clear that the original choices were lying to me.

The three ways FHIR benchmarks lie

Before the methodology, the failure modes worth naming.

Warm cache. You load a server, run the workload immediately, and every query hits pages Postgres already has in RAM. Latency looks amazing. Production never sees that. The fix: cold-DB restart between checkpoints, and runtime-sampled parameter values drawn from the freshly loaded corpus so nothing is cached at the application layer.

Hand-picked queries. "Our server is fast at Patient?_id=123." Sure. That is not what a FHIR app does. The fix: a blended workload that fires every query shape a real app runs, uniformly at random, with runtime-sampled parameters. The matrix runs 30 query shapes covering seven FHIR routes (Patient, Observation, Condition, Procedure, Encounter, MedicationRequest, Metadata) and four search classes — see Search workload below.

Different hardware. Server A was tested on bare metal; Server B was tested in Docker with a memory limit. Server A wins. The fix: same box, same resource budget per container, same image digests. Every server runs in a container with a 32 GiB memory cap, 12 physical CPU cores via cpuset_cpus: "0-11,16-27" (cores 12–15 are reserved for the loader and OS so loader↔server SMT contention can't taint results), and a 100 GB SSD volume. The comparison is boring on purpose.

These three fixes are the price of entry. They are necessary and not sufficient. The interesting methodology choices are what you measure once the field is level — and the matrix changed two of those choices between rounds because the data showed me the original picks were wrong.

What I measure: three workloads, three headlines

Profile Measures Headline
CRUD Mixed C/R/U/D over five FHIR resource types (Patient / Observation / Condition / Encounter / MedicationRequest), 64 concurrent workers at steady state p50 latency across all verbs
Search 30 queries spanning seven routes and four classes, drawn uniformly at random ok-only p50 latency across 2xx responses
Ingest Transaction Bundle POST against the base URL, 32 workers bundles per second (ok-only throughput)

p95 and p99 are captured in every evidence row as tail evidence, but the headline — the number that colors the heatmap cell and drives the scaling curve — is the median for latency profiles, and ok-only throughput for ingest. This is the part of the methodology that changed.

Median, not p99

The first version of this matrix headlined on p99. It looked authoritative. The numbers were noise.

Each cell is a 2-minute run. A reliable tail quantile needs a lot of samples. Rule of thumb: a quantile q needs at least ~10/(1−q) samples to be stable, and ideally ten times that. For p99 you want ~10,000 samples; for p95, ~2,000; for p50 (the median), ~20.

For a fast cell — HAPI CRUD at 1K — 2 minutes easily produces tens of thousands of samples and any percentile is fine. But for a slow cell — Medplum search at 1K, where a single request can take multiple seconds — the same window produces only tens of samples. A p99 computed from 30 requests is essentially max(30 numbers): it moves wildly run-to-run. That is how you get the apparent anomaly where a server's p99 went down when the population grew 4×. It didn't. The estimator just got noisier.

We don't want to hide the tail, but we don't want the headline to be a number we can't honestly defend. Solution: headline on the median (stable even in small samples), and keep p95 and p99 in every evidence row as tail evidence — visible in the round JSON and the per-server detail pages, just not leading the story.

For the Search profile the headline uses the ok-only percentile stream: latencies across 2xx responses only. Including 4xx responses would reward vendors that fail fast on unsupported queries — the opposite of the signal we want. CRUD and ingest both expect every op to succeed, so they headline on the all-responses median; a 4xx ingest is a real failure, not a feature.

Ingest as transaction Bundles, not single resources

The original matrix headlined ingest on resources/sec from individual resource POSTs. That was wrong for the same reason hand-picked queries are wrong: it is not what real FHIR pipelines hit.

Transaction Bundles are the interactive write path that real pipelines use. HL7v2→FHIR conversion in particular turns each inbound HL7v2 message into one transaction Bundle of correlated mutations (Patient + Encounter + Observation + …) — bundles per second is what those pipelines actually feel. This benchmark was added at the suggestion of Ralph at haste.health, who asked for an open transaction-Bundle benchmark all vendors can measure against.

Why throughput, not latency, as the headline: Synthea bundles vary roughly 10× in entry count (50–500 entries), so per-bundle latency is partly a function of bundle size and not just server speed. Throughput — bundles applied per second under the loader's 32-worker pool — normalizes that. The per-Bundle p50/p95/p99 latencies remain available as tail evidence in each evidence row.

This is not a replacement for $import. $import is the spec-blessed bulk-load path for cold-start migrations and is correctly recommended for one-shot population. Transaction Bundles are the production write API. They are different workloads; the matrix publishes the production one because that is what an integration engineer feels every day.

Search workload: 30 queries, four classes

The search workload picks uniformly at random per request from a 30-query pool spanning seven FHIR routes and four search classes:

Class Count Examples
SIMPLE — single-parameter token / string / date / reference 18 patient_by_gender, observation_by_code, condition_by_code, observations_for_patient, observation_recent (date range)
COMPLEX — multi-parameter, compound AND, _include / _revinclude 2 patient_by_gender_family, q1_uscore_observation_combo
FULL_TEXT_content or :text modifier 7 patient_content_search, observation_content_search, condition_code_text_modifier, patient_name_text_modifier, patient_text_narrative
OPERATION — FHIR ops (/metadata, _history, $expand, $lookup, $export) 1 capability_statement

Of the 30 load-pool queries, 13 use static parameter values; the other 17 are runtime-sampled — their {{placeholder}} values are drawn fresh per request from pools harvested against the target server at workload-start (patient ids, family / given names, condition / procedure / medication codes, practitioner and location ids). This measures cache-miss behavior on a live corpus rather than a hot 5-patient set. Per-query p50 / p95 / p99 are preserved in evidence[].per_verb[] and surfaced in the per-server pages.

Splitting search by class matters because the previous flat list mixed Lucene-required full-text queries with cheap token searches into one weighted average. Servers without a full-text path silently dragged the headline down; servers with Lucene silently bore the cost. With the 4-class taxonomy, the FULL_TEXT cell on a server without Lucene is the headline story, not a footnote.

Vendor-recommended configuration

Each server runs with the configuration its own vendor documents (or benchmarks against). Leaving a vendor on a default that their documentation explicitly recommends against would measure "did the image ship the right knob flipped?" rather than "how fast is the engine?" — so we flip the knobs the vendor tells us to. Every such knob is in the published methodology doc and in the compose file:

Server Setting Source
MS FHIR x-bundle-processing-logic: parallel request header on ingest Azure FHIR best practices
Aidbox BOX_FHIR_SEARCH_DEFAULT_PARAMS_TOTAL=none (disable implicit _total=accurate) Health Samurai benchmark config
Aidbox Full GIN index set on JSONB resource columns (Health Samurai's own benchmark indexes, applied via loadtest/aidbox_bootstrap.py) HealthSamurai initbundle.json
HAPI Embedded Lucene full-text indexing (advanced_lucene_indexing=true, advanced_hsearch_indexing=true, lastn_enabled=true) HAPI FHIR docs — Lucene/Elasticsearch
Spark Write-path + read-path Mongo indexes (auto-created via bind-mounted spark-mongo-init/01-create-indexes.js) Authored locally by matching Spark's query patterns; no vendor doc exists

The HAPI Lucene knobs were added on 2026-04-30 after a shadow run showed _content and code:text queries returning HTTP 400 against vanilla HAPI. The methodology promised "Lucene-backed full-text"; the embedded backend (no Elasticsearch sidecar) delivers it. Cost: ~5–10% slower ingest from per-resource Lucene index writes. Both are documented so the HAPI numbers are not "vanilla HAPI 8.8" but "HAPI 8.8 with the full-text knobs the docs recommend." Servers that don't ship Lucene (Medplum, Blaze, Spark) remain on stock full-text behavior — the FULL_TEXT class will surface that asymmetry in the cells, which is the point.

Server-specific index bootstrap (Aidbox, Spark)

Two servers require an operator to create the backing search indexes manually. Leaving them unindexed would measure "did the vendor ship indexes?" rather than "how fast is the engine once it is configured like an operator would run it in prod?"

Hardware

Captured per run in meta.json and shown on the round page:

Server stacks pinned via cpuset_cpus: "0-11,16-27" (12 physical cores per server). Cores 12–15 are reserved for the loader / OS / sampler so loader↔server SMT contention cannot taint results. Each server gets 12 CPU + 32 GiB RAM; its backing DB (where separate) gets 6 CPU + 16 GiB.

Size ladder

The ramp ladder steps through four checkpoints of cumulative patient count:

1,000 → 4,000 → 16,000 → 64,000

Between checkpoints the population grows incrementally (not wiped). At each checkpoint the warm server runs all three workloads. The resulting per-(server, profile, checkpoint) p50 series is what gets plotted on a log-log axis — a true power-law scaling shows up as a straight line.

Not every cell is filled. Several servers have only completed the 1K checkpoint so far on Ingest; higher checkpoints show grey. This is deliberate honesty, not a bug.

The checkpoint spacing is intentional. 1K is the size a developer evaluates against in a sandbox. 4K is a dev environment. 16K is where the matrix starts to separate servers — the regime where unindexed search queries start seq-scanning and memory starts mattering. 64K is a small production corpus; large enough to expose scaling weaknesses, small enough that the full ramp fits in 12–16 hours.

Cold-plan caveat

30 seconds of warmup is not enough for cold-plan statistics on some Postgres-backed servers. If the ingest that populated the checkpoint just finished, ANALYZE/autovacuum may not have built good stats yet, and queries that plan differently with vs. without stats (classically _total=accurate, which forces a COUNT(*)) can take 100× longer on the first checkpoint than on later ones. In one round, Medplum's observation_search_total_accurate median dropped from 1,790 ms at 1K to 17 ms at 4K — not a scaling anomaly, just the planner catching up. The per-verb breakdown in evidence[].per_verb[] makes this visible at a glance.

What "fair" does and does not mean

Fair is defined narrowly here. Every server runs on the same box with the same resource budget. Every server ingests the same Synthea data generated with the same deterministic seed. Every server gets the same blended workload with the same 30 queries drawn uniformly at random. Every server has its image pinned by sha256 so the test is reproducible.

Fair includes running each server with the vendor-recommended configuration documented above. That is a deliberate move from the original methodology, which ran out-of-box defaults. Defaults a vendor explicitly recommends against measure shipping decisions, not engine capability. The default-config numbers (e.g., default-Aidbox at 94% search error) are still published, separately, on /performance so the operational reality is visible.

Anyone who wants to argue "you should have tuned us differently" is welcome to submit a re-run request with the tuning configuration they recommend. Re-runs happen quarterly. The original numbers stay in the round artifact for historical comparison.

Reproducibility is the whole point

The reason to publish a benchmark is not to settle an argument. It is to make the argument settleable. Every number in /performance links back to a file in the round artifact. Every file has a sha256 in MANIFEST.json. The entire artifact is in results/rounds/2026-q2-r000/.

The full ramp takes 12–16 hours. The local reproducer:

cd ~/repo/fhir-server-compare
sudo bash scripts/setup-host.sh         # governor + THP + swappiness + ulimit
set -a; source .env; set +a             # Aidbox license, Medplum creds
make loadtest-ramp-50k                  # full ramp
make benchmark                          # parse JSONL → benchmark.json
make benchmark-publish                  # copy into ../fhir-studio/

The round JSON is deterministic given the same inputs.

What this methodology is for

Three audiences.

Teams picking a FHIR server. Open /performance, find the three queries that dominate your app's workload, and compare the p50 latency across servers at your target scale. The rank order for read-heavy apps is different from write-heavy apps; the rank order for CRUD-heavy apps is different from search-heavy apps. The matrix lets you find your ranking without running the benchmark yourself.

Teams running a FHIR server in production. Find your server's row, find the number at your corpus size, and sanity-check it against your observed production latency. If your production tail is 10× worse than the matrix number, something is wrong in your deployment — probably indexing, probably heap sizing, probably a config default. The matrix gives you a ceiling.

Vendors and open-source maintainers. The round artifact is public, the methodology is public, the harness is public. Disagree? Submit a re-run request. Run the benchmark against your next release and compare. This is the infrastructure that makes "our new version is 30% faster" a defensible claim rather than a marketing line.

For the pillar posts that interpret the numbers see fhir-server-compare (conformance) and fhir-server-performance (performance). The cell-level rules — ramp ladder, ok-only median, CPU pinning, warmup window, vendor knobs, index bootstraps — live at /performance/methodology. For the specific findings that surprised me most, see the multi-server search cliff and HAPI at 64K patients.

If you need realistic synthetic patient data to run your own benchmarks against any of these servers — longitudinal records with panel-coded observations, cross-resource references that actually resolve, Coverage and RelatedPerson for the full US Core stack — mock.health has a free tier.


Related posts

All posts · Home · Docs