Your Clinical AI Agent Needs More Than 5 Patients

Your prior auth agent works in testing. Then it meets a 68-year-old with CKD, hypertension, and a specialist referral — and crashes.

mock.health · 9 min read · 2026-04-08

Your prior authorization agent works. You've tested it against 5 synthetic patients. It reads the patient's conditions, checks the payer's coverage rules, and submits the auth request. Clean approvals every time. Ship it.

Then it meets a 68-year-old with type 2 diabetes, CKD stage 3, hypertension, hyperlipidemia, and a pending nephrology referral. The payer's rules require step therapy documentation for the GLP-1 agonist. The agent has never seen a step therapy requirement because none of your test patients had enough medication history to trigger one.

The auth request is denied. Your agent doesn't know what to do with a denial because it's never received one. The clinician finds out three days later when the patient calls asking why their medication wasn't filled.

This is the shallow-data problem: simple patients don't get denied. And if your test data only contains simple patients, your agent has never practiced handling the hard cases — which are most of the cases that actually matter.

AI Is Now a Standalone Reason to Adopt FHIR

Darren Devitt's FHIR Architecture Decisions notes that 2025 "marked the point where AI use became a common standalone reason for adopting FHIR." He gives it one paragraph and moves on. That single paragraph undersells what's actually happening.

A new category of software is emerging: AI agents that consume FHIR data as their primary input. Prior authorization agents read patient records to build auth submissions. Care coordination platforms ingest longitudinal histories to generate care plans. Clinical decision support systems analyze conditions, medications, and lab trends to surface recommendations. Denial management agents parse EOBs and remittance data to draft appeals.

These agents don't just read FHIR resources — they reason over them. They make decisions based on clinical relationships between resources: the connection between a declining eGFR trend and a CKD diagnosis, the progression from metformin to insulin in a diabetic patient, the gap between a specialist referral and the follow-up encounter that should have happened.

The test data requirements for these agents are fundamentally different from traditional FHIR integration testing. A patient portal needs to display resources correctly. An AI agent needs to reason over them correctly. The difference shows up in what breaks when the data is too simple.

What Breaks with Shallow Data

Prior authorization agents

A prior auth agent reads patient history, checks payer rules, and submits requests. The rules are conditional: step therapy requirements depend on medication history, medical necessity depends on documented comorbidities, specialist referrals require documented primary care visits.

With simple test patients (one condition, one medication, one encounter), every auth request is straightforward. The agent learns the happy path: read condition, find matching rule, submit. Approved.

With realistic patients:

Step therapy: Patient on lisinopril for hypertension wants to switch to an ARB. Payer requires documentation that ACE inhibitor was tried first. The agent needs to find the lisinopril prescription in MedicationRequest history, confirm adequate trial duration, and include it in the submission. If your test patients have no medication history, the agent never learns this workflow.
Comorbidity-dependent rules: GLP-1 agonist for a diabetic patient with BMI > 30 may be covered under the diabetes benefit. Same drug for a patient without diabetes requires obesity-specific prior auth with different documentation. The agent's routing logic depends on the combination of conditions, not individual ones.
Denials and appeals: A denial triggers an appeal workflow — the agent needs to parse the denial reason, gather additional documentation, and resubmit. If your test data never produces a denial, this entire codepath is untested.

Care coordination agents

A care coordination agent identifies gaps in care: missed follow-ups, overdue screenings, medication conflicts, unaddressed referrals.

With a single-encounter patient, there are no gaps to find. The agent reports "no issues" and you conclude it works.

With a multi-year patient history:

Missed screenings: Diabetic patient hasn't had an HbA1c in 14 months. The agent needs longitudinal Observation data with dates to detect this. Single-encounter patients don't have temporal gaps.
Medication conflicts: Patient on warfarin gets prescribed an NSAID by a different provider. The agent needs overlapping MedicationRequest resources from multiple encounters to detect the interaction. One medication per patient means no conflicts to find.
Referral follow-through: Primary care refers patient to cardiology. No cardiology encounter appears in the next 90 days. The agent needs ServiceRequest → Encounter correlation across time. Flat patient histories don't have this signal.

Clinical decision support

A CDS agent analyzes patient data and surfaces recommendations: adjust medication dosage based on lab trends, flag high-risk patients for proactive outreach, suggest diagnostic workups based on symptom patterns.

With clinically empty patients, the agent has nothing to analyze. No trends, no patterns, no risk factors.

With realistic patients:

Lab trend analysis: eGFR declining from 55 to 38 over 18 months in a diabetic patient. The agent should flag CKD progression and recommend nephrology referral. This requires multiple Observation resources with effectiveDateTime values spanning years and valueQuantity values that follow a clinically plausible trajectory.
Risk stratification: A patient with diabetes, hypertension, CKD, and a history of cardiovascular events is high-risk. A patient with diabetes alone is moderate-risk. The agent's stratification depends on comorbidity combinations, not individual conditions. If your test population has one condition per patient, every patient looks moderate-risk.
Polypharmacy review: Patient on 12 active medications from 4 different prescribers. The agent should identify redundant therapies and potential interactions. This requires a MedicationRequest history with realistic prescribing patterns — not 1 medication per patient.

What "Enough" Test Data Looks Like

The minimum depends on what your agent does. But the principle is the same: your test population needs to contain the clinical scenarios your agent will encounter in production.

Volume

Five patients isn't enough for any agent. You need enough patients to cover the distribution of cases your agent will handle:

Patients across age ranges (pediatric, adult, geriatric — clinical complexity correlates with age)
Both genders (disease prevalence is sex-stratified — breast cancer screening, prostate-specific conditions)
Multiple chronic condition combinations (diabetes alone, diabetes + hypertension, diabetes + CKD + hypertension — these are different clinical scenarios, not the same one repeated)
Varied encounter histories (single visit, longitudinal multi-year, emergency + inpatient + ambulatory)

For a prior auth agent, we'd recommend at minimum 50-100 patients spanning the condition categories your payer rules cover. For a CDS agent analyzing lab trends, you need patients with 3+ years of longitudinal data.

Clinical density

Each patient needs enough clinical depth to exercise your agent's reasoning:

Agent Type	What the data needs
Prior auth	Medication history with start/stop dates, condition-medication correlations, prior auth requests and outcomes
Care coordination	Multi-year encounters, referral chains (ServiceRequest → Encounter), screening schedules
Clinical decision support	Longitudinal labs with reference ranges and interpretation flags, trending vital signs, comorbidity clusters
Denial management	EOB resources with denial reason codes, remittance data, appeal documentation

Comorbidity correlation

This is the one most people miss. Real patients don't have random conditions. A 65-year-old with type 2 diabetes is likely to also have hypertension (75% co-occurrence), hyperlipidemia (70%), and some degree of CKD (30-40%). These conditions travel together because they share pathophysiology.

If your test patients have conditions assigned randomly — diabetes on Patient A, hypertension on Patient B, CKD on Patient C, never together — your agent never sees the comorbidity patterns it will encounter in production. The prior auth rules that trigger on condition combinations go untested. The CDS risk stratification that depends on comorbidity clusters produces meaningless scores.

We built this into our data generation pipeline. Patient conditions are derived from millions of real CMS patient journeys — the comorbidity patterns, medication progressions, and encounter frequencies match real population statistics by age and sex. A 68-year-old diabetic in our data has the same probability of concurrent CKD and hypertension as a 68-year-old diabetic in a Medicare claims database.

The Feedback Loop Problem

There's a subtler issue beyond missing test scenarios: agents trained or evaluated on shallow data develop false confidence.

If your agent achieves 98% accuracy on simple patients, you might conclude it's production-ready. But that 98% reflects the difficulty of the test set, not the capability of the agent. Simple patients are easy cases. The agent hasn't been tested on the cases where clinical judgment matters — the ambiguous ones, the multi-condition ones, the ones where the right answer depends on context that isn't in a single resource.

This is the healthcare version of a well-known ML problem: evaluating a model on data that doesn't represent the production distribution. In healthcare, the consequences are specific: missed denials, delayed care, incorrect risk scores, and clinicians who stop trusting the tool after it fails on the first complex patient they test it on.

The fix isn't more sophisticated AI. It's more representative test data.

What We'd Recommend

If you're building a clinical AI agent that consumes FHIR data, you've probably already got a handful of Synthea patients loaded into a local HAPI instance. Maybe you wrote a script to generate 20 bundles and it's been good enough so far. That's how every team starts — and at some point, the in-house test data becomes its own maintenance burden: configuring Synthea modules, tuning parameters, adding conditions your agent needs to handle, keeping it all in sync as your product evolves. The test data pipeline quietly becomes a second project.

Here's what we'd suggest instead:

Start with realistic synthetic data, not production PHI. You can iterate faster without IRB approvals, BAAs, and de-identification pipelines. Synthetic data with realistic clinical distributions lets you build and test your agent's reasoning before you have production access. Here's how to set up a test environment.

Test against the hard cases first, not last. Generate patients that specifically exercise your agent's edge cases: complex comorbidity combinations, long medication histories, multi-year encounter timelines. If your agent handles a 68-year-old with 5 chronic conditions and 12 medications, it'll handle the simple cases too. The reverse is not true.

Measure accuracy on clinical density, not volume. 100 clinically realistic patients will reveal more bugs than 10,000 structurally valid but clinically empty ones. The metric that matters is whether your agent produces correct outputs on patients that look like production charts — not whether it can parse a large number of simple resources.

Build regression tests around the patients that broke your agent. When a complex patient reveals a bug, that patient becomes a permanent test fixture. Over time, your regression suite accumulates the clinical scenarios that actually matter for your specific agent.

The agents that survive first contact with production data are the ones that were tested against data that looked like production before they got there.

mock.health generates FHIR patients from millions of real patient journeys — correlated comorbidities, longitudinal labs, medication progressions. The kind of clinical complexity your agent needs to practice on. Free tier →

Building a FHIR API Gateway: What HAPI Won't Do for You — HAPI stores FHIR and runs queries. It doesn't auth users, enforce access, or fix URLs behind a load balancer. Here's the gateway layer.
The FHIR Sandbox Problem: Why Open Epic Isn't Enough — You opened a Patient resource and found TEST TEST. The sandbox is built for certification, not demos. Here's what's missing and the fix.
Same FHIR Specification Different Answers — I loaded the same Synthea patient into six open-source FHIR servers and ran the same conformance probes against each.

All posts · Home · Docs