Your FHIR Architecture Determines Your Test Data Strategy

Facade, hybrid, or FHIR-native — each architecture breaks differently. Here's what to test for each model and what breaks when you don't.

mock.health · 10 min read · 2026-04-08

Darren Devitt's FHIR Architecture Decisions is the best guide available on choosing between facade, hybrid, and FHIR-native architectures. If you haven't read it, you should — it's the only vendor-neutral treatment of the three models, with a 9-question decision framework and honest assessments of where each one fails. We're not going to rehash his framework here. Go read his book.

What his book doesn't cover is what happens after you choose. Specifically: how do you test the architecture you picked before you have production data? Each model breaks in different ways, which means each one needs different test data to surface those breaks. A facade that passes integration tests against a single clean database will fall apart when it has to join across three legacy systems. A hybrid that syncs beautifully against simple patients will drift silently when the source system has mixed terminology. A FHIR-native system that works with structurally valid but clinically empty patients will crumble the moment a real chart walks in.

Your architecture determines your test data strategy. Here's what that means in practice.

Facade: Test the Mapping Layer

A facade constructs FHIR resources on the fly from existing databases. No FHIR storage — every request is a live translation. The bulk of the work is converting legacy schemas to FHIR resources.

The thing that breaks: the mapping layer. And it breaks in ways that only show up with messy data.

What to test

Terminology mapping. Your source system stores conditions in ICD-9. FHIR wants SNOMED CT. Your mapper handles the common codes — diabetes, hypertension, COPD. But what about ICD-9 799.9 (Other unknown and unspecified cause of morbidity)? Does your mapper drop it, pass it through with the wrong system URI, or map it to a SNOMED "unknown" concept? You need test data with edge-case codes to find out.

Multi-source joins. Devitt warns that facades pulling from multiple databases can take minutes per request. But the performance problem is secondary to the correctness problem: when Patient demographics come from System A and lab results come from System B, and the two systems use different patient identifiers, your facade has to join them. Test data needs to exercise that join — not just single-source happy paths.

Search parameter translation. A consumer sends GET /Observation?category=laboratory&date=gt2024-01-01. Your facade translates that to a SQL query against a legacy schema that doesn't have a "category" column — lab results live in a different table than vitals. The translation logic is where bugs hide, and it's specific to every combination of FHIR search parameter and legacy schema.

_include and _revinclude across sources. GET /Patient?_revinclude=Condition:patient&_revinclude=MedicationRequest:patient asks for a patient plus all their conditions and medications in one response. If conditions and medications live in different source systems, your facade is making three queries, merging the results, and constructing a coherent Bundle. Test this with patients who have 15+ conditions and 10+ medications — the kind of complexity that exposes pagination bugs and timeout issues.

What test data you need

Data that exercises the ugly joins — patients with records spread across what would be multiple source systems. Mixed code systems (ICD-9, ICD-10, SNOMED in the same patient). Edge-case codes that test your terminology mapper. High-volume patients (many conditions, many meds, many encounters) that stress the multi-source assembly.

The classic mistake: testing your facade against a single clean PostgreSQL database when production will pull from three legacy systems with inconsistent schemas. Your tests pass. Your demo fails.

Hybrid: Test the Sync

A hybrid syncs data from legacy systems into a FHIR server. The FHIR server holds a copy. Legacy stays as source of truth.

The thing that breaks: data drift. The sync fails silently, the FHIR server gets stale, and nobody notices until a consumer reports missing data two weeks later.

What to test

Sync completeness. After a sync run, does the FHIR server contain everything the source system has? For every Patient in the source, is there a corresponding Patient in FHIR with all required fields? This sounds basic, but schema changes in the source system can silently break the translation layer. A column gets renamed, a code system changes, a new nullable field appears — and the sync process either fails or produces incomplete resources.

Terminology consistency post-sync. The source system sends a Condition coded in ICD-10. Your sync pipeline maps it to SNOMED for the FHIR server. But the source system has a mix of ICD-10, ICD-9, and free-text conditions. Your mapper handles ICD-10 correctly but drops the ICD-9 codes. Now your FHIR server is missing 30% of the patient's conditions. Test data with mixed code systems catches this.

Temporal consistency. Source system updates a Patient's address at 2:00 PM. Sync runs at 2:30 PM. A consumer queries the FHIR server at 2:15 PM and gets the old address. That's expected — eventual consistency. But what if the sync fails at 2:30? The FHIR server stays stale indefinitely. Your integration tests should verify behavior when sync is delayed or partially failed, not just when everything works.

Data correction after sync failure. Devitt writes that he's "seen systems released without [data correction capabilities] that came apart within weeks." Your test suite should include scenarios where specific resources are intentionally out of sync, and verify that your reconciliation process detects and fixes the drift.

What test data you need

A "source" dataset and a "synced" dataset with intentional differences — missing resources, stale timestamps, mixed terminology. Patients where the source has updated records but the FHIR server hasn't caught up. Edge cases: a source Patient that was deleted but still exists in FHIR, a Condition that changed codes between sync runs, an Observation that was corrected in the source but not propagated.

The classic mistake: testing your sync against a static source and verifying it once. Production data changes continuously. Your sync tests need to simulate that.

FHIR-Native: Test the Clinical Model

FHIR-native means the FHIR server is the source of truth. All reads and writes flow through FHIR. No legacy translation.

The thing that breaks: clinical realism. The FHIR API works perfectly. The data model handles every resource type. Then real clinical data arrives and you discover your system has no concept of comorbidity patterns, no longitudinal lab trends, and no encounter complexity.

What to test

Write validation. Your FHIR-native system accepts POST requests for new resources. What does it do with an Observation that has no category? A Condition with a code from the wrong ValueSet? A MedicationRequest with a dosage that's syntactically valid but clinically nonsensical (metformin 50,000mg)? If your validation is only structural (does it parse?), you'll accept garbage. If it's too strict, you'll reject valid resources from external data providers who code things slightly differently than you expect.

Query performance at clinical density. FHIR-native systems often start with structurally valid but clinically sparse test patients — a Patient with 1 Condition, 1 Encounter, 0 Observations. Performance is great. Then a real patient shows up with 15 conditions, 47 encounters over 5 years, 200+ observations, 12 active medications, and 3 imaging studies. $everything returns a 2MB Bundle. Your search indexes, pagination logic, and client rendering all need to handle that density.

Cross-resource integrity. A FHIR-native system that owns its data needs to ensure referential consistency. An Observation references a Patient that doesn't exist. A MedicationRequest references a Medication with an RxNorm code that's been deprecated. A ServiceRequest references a Practitioner in a different partition. These are the kinds of issues that data governance processes catch in legacy systems — processes you now need to replicate.

The components you forgot to build. Devitt lists 22 components beyond the FHIR server that production systems typically need — MPI, MDM, terminology services, data correction interfaces, bulk data processing, audit logging. If your test data is simple enough that you never need any of these, your tests aren't representative. A 68-year-old with CKD stage 3, diabetes, hypertension, and 4 years of declining eGFR will stress your data model in ways that a healthy 30-year-old with a single encounter never will.

The server itself. Before you invest months of product work on top of a FHIR server, know where it lands on the base spec. The FHIR server conformance heatmap runs identical TestScript suites against HAPI, Aidbox, Medplum, Microsoft FHIR Server, Blaze, and Spark — so you can see which MUST/SHOULD/MAY gaps you're about to inherit.

What test data you need

Clinically realistic patients. Not structurally valid empty shells — patients with correlated comorbidities, longitudinal lab trends, medication progressions, encounter variety across ambulatory, inpatient, and emergency settings. The kind of patients your system will see in production but that no sandbox provides.

The classic mistake: testing your FHIR-native system against Synthea defaults and concluding that everything works. Synthea's built-in modules produce structurally correct patients with single conditions and minimal clinical depth. Your system handles them beautifully. The first complex patient from a real data feed breaks three assumptions.

Match Your Test Data to Your Architecture

Architecture	Primary failure mode	What test data must exercise
Facade	Mapping layer — terminology, multi-source joins, search translation	Mixed code systems, multi-source patients, edge-case codes, high-volume `_include` queries
Hybrid	Data drift — sync failures, stale data, inconsistent terminology	Source-vs-FHIR discrepancies, partial sync scenarios, temporal consistency gaps
FHIR-native	Clinical model — sparse data hides workflow gaps	Clinically dense patients, comorbidity patterns, longitudinal data, write validation edge cases

The common thread: if your test data is cleaner than your production data, your tests are lying to you. A facade tested against one clean database won't survive three messy ones. A hybrid tested with perfect sync won't survive a Tuesday night failure. A FHIR-native system tested with single-condition patients won't survive a geriatric chart.

Your test environment needs to match the failure mode of your architecture, not just the happy path.

mock.health generates FHIR patients with the clinical density and terminology variance that production actually sends — correlated comorbidities, mixed code systems, longitudinal data. Free tier, no sales call →

FHIR, USCDI, and US Core: What They Are, How They Fit — FHIR is HL7's standard for exchanging health data over RESTful APIs. USCDI defines what data; US Core defines how to format it. Here's how the three fit.
FHIR as It's Actually Deployed: The Map of Every US Hospital Endpoint — Open Epic publishes one CapabilityStatement. Production deploys thousands of shapes of it. The map plots ~3,800 real hospital FHIR endpoints by the shape they actually serve — and lets you build against them before the customer call.
FHIR Still Isn't Enough for Real-Time Integrations — FHIR can't push real-time events — Subscriptions are barely implemented. For EHR notifications you poll, or fall back to HL7v2. Here's why, and how.

All posts · Home · Docs

Your FHIR Architecture Determines Your Test Data Strategy

Facade: Test the Mapping Layer

What to test

What test data you need

Hybrid: Test the Sync

What to test

What test data you need

FHIR-Native: Test the Clinical Model

What to test

What test data you need

Match Your Test Data to Your Architecture

Related posts