Same FHIR Specification Different Answers

I loaded the same Synthea patient into six open-source FHIR servers and ran the same conformance probes against each.

the nighthawk · 11 min read · 2026-05-02

Recently I wrote a comparison of HAPI and GCP Healthcare API and the most common reaction I got was "what about the other ones." Fair. So I did the experiment at scale.

I took a single Synthea patient — 171 resources, one complete longitudinal record — and loaded it into six open-source FHIR R4 servers, each in its own Docker container, each pinned by sha256 digest so the whole thing is byte-for-byte reproducible:

Server	Version	License
HAPI FHIR	8.8.0-1	Apache-2.0
Microsoft FHIR Server	4.0.728	MIT
Medplum	5.1.8	Apache-2.0
Aidbox	2603 (dev tier)	Proprietary, free dev license
Blaze	1.6.2	Apache-2.0
Spark	2.4.1-r4	BSD-3

Then I ran a conformance probe suite against each server: 23 TestScript-driven checks covering the FHIR R4 base spec (silent-ignore behavior, search semantics, terminology operations) plus 6 checks for the Bulk Data Access IG v2 kickoff surface. The full matrix lives at /conformance as a heatmap with one row per check and one column per server, each cell pass / fail / N/A with a spec citation. The companion repo that reproduces every cell is open at fhir-server-compare.

What I want to do in this post is walk through the findings that surprised me. If you are about to pick a FHIR server, or have already picked one and are starting to regret it, these are the things I did not know I should be looking for.

The silent-ignore will eat your filter

This is the one I will open with because it is the one that kept me up.

I sent each server this query:

GET Observation?this-is-not-a-real-param=garbage&_count=1

A spec-compliant server looks at this-is-not-a-real-param, finds it is not in the CapabilityStatement for Observation, and returns a non-200 status with an OperationOutcome describing the error. A misbehaving server silently drops the unknown parameter and returns HTTP 200 with the full unfiltered result set, as if you had never passed a filter at all.

Half the servers do the right thing. Half do not.

Per the conformance check:

HAPI — pass. Rejects the unknown parameter.
Microsoft FHIR Server — fail. Returns 200 with the unfiltered bundle. Silent-ignore.
Medplum — pass. Rejects with an error body.
Aidbox — pass. Rejects.
Blaze — fail. Returns 200. Silent-ignore.
Spark — fail. Returns 200. Silent-ignore.

Three out of six servers will, by default, lie to your code when you typo a search parameter. The failure mode is exactly the one you cannot test around: your code asks for Patient?familyname=Smith (the real parameter is family, no "name"), the server returns every patient it has, and your UI displays the unfiltered list as if it had been filtered. A clinician opens a patient search expecting to see one patient and sees nine hundred. If they trust the filter, they click into the wrong record.

The spec is not silent on this. FHIR R4 §3.1.1.4 says "a server SHALL ... return an error when the client has specified a search parameter that the server does not support" — but the word "SHALL" is not universally obeyed, and the R4 text elsewhere allows for handling=lenient as an explicit opt-in to silent-ignore. The problem is that the three permissive servers default to lenient without requiring the client to ask for it. On HAPI you can flip the default by setting Prefer: handling=strict per request; HAPI's matrix entry is "pass" because the server-side default has been hardened.

Worth flagging: the prior round of this matrix had different membership in the silent-ignore camp — HAPI used to default to lenient, Blaze used to default to strict. Both flipped between rounds. This is a behavior that changes with releases and configuration; do not bake it into your code by name. Either set Prefer: handling=strict on every request, or validate every search parameter name against the CapabilityStatement of the target server before you trust a result.

`_revinclude=*` returns 500 on two servers

_revinclude=* is FHIR shorthand for "give me this resource plus every resource that points at it, of any type." It is the atomic patient summary query — Patient/123?_revinclude=* means "give me patient 123 plus everything that references patient 123 anywhere in the store."

The conformance check FhirR4BaseRevincludeWildcardNo5xx asserts that no server should return a 5xx status on the wildcard form. Returning 4xx is fine (the wildcard is optional in the spec); returning 200 is fine; returning 500 is not.

Two servers fail it:

HAPI — pass. Returns 200 with a bundle.
Microsoft FHIR Server — pass. Returns 200.
Medplum — pass. Returns a 4xx with _revinclude must specify a resource type and parameter.
Aidbox — fail. Returns 500 Internal Server Error.
Blaze — pass. Returns 200.
Spark — fail. Returns 500.

The FHIR R4 spec does not require _revinclude=*. Section 3.1.1.6 explicitly lists the wildcard as optional. Returning 4xx with an OperationOutcome (Medplum's response) is the right answer for a server that does not implement the wildcard. Returning 500 with a stack trace (Aidbox, Spark) is what you get when the server's parser accepts the wildcard but the query plan blows up downstream.

If you build your patient summary on _revinclude=* against HAPI and then try to port it to Aidbox or Spark, you get a stack trace in production the first time a customer opens a patient record. And the rewrite is not trivial — you have to enumerate every resource type in your data model that can reference Patient and issue one _revinclude=ResourceType:patient per type, then stitch the results together client-side.

If your app needs "give me this resource and everything that points at it," you cannot treat it as a portable query. Either stay on a server that accepts the wildcard, or enumerate up front and keep the enumeration in sync as your schema grows.

`_total=accurate` works on five of six (and one fails it loudly)

The conformance check FhirR4BaseTotalAccurate sends Observation?_count=1&_total=accurate and asserts the request succeeds with a 2xx — a probe that the server honors the _total hint at all, not that the count value is correct.

HAPI — pass.
Microsoft FHIR Server — pass.
Medplum — pass.
Aidbox — pass.
Blaze — pass.
Spark — fail. Returns 500.

So _total=accurate is portable enough — five of six servers will return something. The cost varies wildly: HAPI in particular pays a separate COUNT(*) query that on a 64K-patient corpus pushes p99 latency for that specific query to 32 seconds (covered in detail in HAPI FHIR at 64,000 Patients). Aidbox uses a different default in this matrix (BOX_FHIR_SEARCH_DEFAULT_PARAMS_TOTAL=none per the vendor recommendation) so the explicit _total=accurate request is a separate, cheap counting code path. Blaze and Medplum populate it without surprises.

The practical implication: Bundle.total is not safe to assume in your code. Some servers populate it by default, some require the explicit _total=accurate hint, and one returns 500 on the hint. If your pagination UI needs "page X of Y," either always pass _total=accurate and accept HAPI's tail latency, page by cursor and skip the denominator, or keep a separately-maintained cached count.

Blood pressure returns zero on every server (and that is correct)

This one is a clinical-data shape gotcha rather than a conformance disagreement, but it shows up in every team's first FHIR query log so it is worth surfacing here.

If you run GET Observation?code=http://loinc.org|8480-6 (systolic blood pressure) against any server in the matrix, every one of them returns zero results. Same query, same data, every server consistent.

I thought I had found a real indexing divergence. I had not. Every server was behaving identically and correctly.

Synthea encodes blood pressure as a panel: Observation.code = 85354-9 ("Blood pressure panel with all children optional"), with the systolic value living in component[0] and the diastolic in component[1]. The standard R4 code search parameter on Observation only matches the top-level code field. It does not descend into component. Every server returned zero because there are zero standalone systolic Observations in the data. The code is buried in the panel.

The fix is the US Core combo-code search parameter, which matches code or component.code:

GET Observation?combo-code=http://loinc.org|8480-6

This works on HAPI, Medplum, Aidbox, Microsoft FHIR Server, and Blaze. Spark's matrix entry on the underlying Observation search test (FhirR4BaseObservationSearchBasic) fails — among Spark's other issues, US Core observation search extensions are not all there.

If your patient summary runs code=8480-6 to fetch a patient's systolic readings against any server, you ship a feature that displays no blood pressure data. Combined with the silent-ignore problem above on the three permissive servers, the worst-case version of this is genuinely dangerous: the typo that returned the unfiltered result set, combined with the panel-structure gotcha that returned zero, produces a clinical app that shows all-or-nothing depending on which bug fires first.

Test on real panel-shaped data. Synthea emits it, which is one of the reasons we use Synthea as the bundle source in the reproducer.

Patient-compartment Bulk Export: half the servers, half the field

Patient/$export is the patient-compartment variant of the Bulk Data Access IG v2 — the standard way to extract a cohort of patients as NDJSON files for analytics, ML, or migration. The conformance suite probes the kickoff surface (six checks: capability declaration, group/system endpoint discovery, Patient/$export returns 202, Content-Location header, NDJSON output format).

Per the bulk-data-v2 profile at the kickoff layer:

HAPI — green. All 6 checks pass.
Microsoft FHIR Server — green. All 6 checks pass.
Medplum — amber. 5 of 6 pass; Group/[id]/$export returns 404 (Patient and System exports work).
Aidbox — N/A. The aidboxone dev image requires a cloud storage backend (GCP/Azure/AWS) for Bulk Data; without one, the operation returns 500 with storage-type not specified. Aidbox's hosted edition supports Bulk Data; the locally-run dev image is not configured for it.
Blaze — N/A. Patient/$export is not implemented; the probe returns 400.
Spark — N/A. Same as Blaze; not implemented.

The full async lifecycle (status URL polling, NDJSON file output, JWT-signed Backend Services auth) is not yet tested in this round — that ships separately when the Inferno integration lands. The kickoff probe is what is published.

If you are building anything that needs to move patients between stores — ETL, multi-tenant migration, backup, ML training-set extraction — three of six open-source servers in this matrix will do it (HAPI, MS FHIR fully; Medplum for Patient and System scope). The other three require either operational setup (Aidbox + a cloud bucket) or are not in the implementation at all (Blaze, Spark). This is one of the cleaner findings in the matrix: the spec has landed, two servers ship full kickoff support, and the outliers are easy to see.

What about terminology operations?

Both ValueSet/$expand (against the LOINC LL715-4 PHQ-9 answer list) and CodeSystem/$lookup (against LOINC code 8480-6, systolic blood pressure) probe pass on all six servers in the current round.

This is a meaningful improvement from where the matrix sat in earlier rounds. Worth flagging: the conformance probe asserts the operation returns 200 with an expanded ValueSet or a populated Parameters resource. It does not assert what the content of that response should be — different servers ship with different default terminology backings (HAPI's preloaded Lucene index, Aidbox's optional CodeSystem import, Blaze and Spark using bundled minimal terminology subsets). Two servers can both pass $expand and return different code counts for the same value set.

In other words: terminology operations are now portable at the HTTP-200 layer, but the content portability is a separate question that the published conformance suite does not yet probe. If your app's correctness depends on which specific codes come back from $expand, do not assume cross-server equivalence based on the matrix alone — pin to a server, verify the content shape against your test fixtures, and retest when you change servers. (A future post may dig into the content-divergence question once a probe set is published.)

What to actually do with all this

There is no "best" server in the matrix. There are servers that are strict on the things you care about and lenient on the things you do not, and the question to ask before you pick one is which of these disagreements are load-bearing in your app.

If you are building patient access (SMART-on-FHIR apps that read longitudinal data for a logged-in patient), you live and die on _revinclude behavior, $export, and the silent-ignore default. HAPI and Microsoft FHIR Server are the correct picks on conformance grounds.

If you are building clinical AI (agents that read FHIR at read-heavy load), you need the silent-ignore problem to be loud in staging (test against Medplum, Aidbox, or HAPI — the three that reject unknown parameters by default) and you need fast search at production scale (see the performance post).

If you are building anything that exports cohorts, the Bulk Data axis matters most. HAPI and MS FHIR are the obvious picks; Medplum is reasonable for Patient-scope exports.

If you are building against a single server and telling yourself it is probably fine: run the reproducer. docker compose up takes about a minute. One Synthea patient loads in another minute. The conformance probe runs in seconds. The silent-ignore row alone is worth the ten minutes.

Run it yourself

The full matrix lives at /conformance, one row per check, one column per server, each cell with a spec citation. The companion repo at fhir-server-compare reproduces every number in this post on your own machine.

git clone https://github.com/mock-health/fhir-server-compare
cd fhir-server-compare
sudo bash scripts/setup-host.sh
cp .env.example .env
# paste AIDBOX_LICENSE from aidbox.app/signup
set -a; source .env; set +a
docker compose up -d
make conformance      # ~1-2 minutes for the full probe suite

The make conformance output writes results/conformance/<round>/<server>/*.testreport.json and rolls up to conformance.json. Every row of the published matrix maps to a check in this output.

If you want synthetic patient data that exercises every one of these behaviors — panel-coded Observations, longitudinal Encounter chains, cross-resource references that Provenance can actually resolve — that is what we built mock.health for. Free tier, no sales call.

The companion post on performance covers what happens when you scale the same six servers up to 64K patients and run workloads against them. Short version: the fastest server on CRUD median has a search-throughput cliff at 64K. The most consistent server on every metric is running with a ten-second p99 tail. Different story.

Your Clinical AI Agent Needs More Than 5 Patients — Your prior auth agent works in testing. Then it meets a 68-year-old with CKD, hypertension, and a specialist referral — and crashes.
Building a FHIR API Gateway: What HAPI Won't Do for You — HAPI stores FHIR and runs queries. It doesn't auth users, enforce access, or fix URLs behind a load balancer. Here's the gateway layer.
The FHIR Sandbox Problem: Why Open Epic Isn't Enough — You opened a Patient resource and found TEST TEST. The sandbox is built for certification, not demos. Here's what's missing and the fix.

All posts · Home · Docs