Your Clinical AI Agent Needs More Than 5 Patients

Your prior auth agent works in testing. Then it meets a 68-year-old with CKD, hypertension, and a specialist referral — and crashes.

mock.health · 9 min read · 2026-04-08


Your prior authorization agent works. You've tested it against 5 synthetic patients. It reads the patient's conditions, checks the payer's coverage rules, and submits the auth request. Clean approvals every time. Ship it.

Then it meets a 68-year-old with type 2 diabetes, CKD stage 3, hypertension, hyperlipidemia, and a pending nephrology referral. The payer's rules require step therapy documentation for the GLP-1 agonist. The agent has never seen a step therapy requirement because none of your test patients had enough medication history to trigger one.

The auth request is denied. Your agent doesn't know what to do with a denial because it's never received one. The clinician finds out three days later when the patient calls asking why their medication wasn't filled.

This is the shallow-data problem: simple patients don't get denied. And if your test data only contains simple patients, your agent has never practiced handling the hard cases — which are most of the cases that actually matter.

AI Is Now a Standalone Reason to Adopt FHIR

Darren Devitt's FHIR Architecture Decisions notes that 2025 "marked the point where AI use became a common standalone reason for adopting FHIR." He gives it one paragraph and moves on. That single paragraph undersells what's actually happening.

A new category of software is emerging: AI agents that consume FHIR data as their primary input. Prior authorization agents read patient records to build auth submissions. Care coordination platforms ingest longitudinal histories to generate care plans. Clinical decision support systems analyze conditions, medications, and lab trends to surface recommendations. Denial management agents parse EOBs and remittance data to draft appeals.

These agents don't just read FHIR resources — they reason over them. They make decisions based on clinical relationships between resources: the connection between a declining eGFR trend and a CKD diagnosis, the progression from metformin to insulin in a diabetic patient, the gap between a specialist referral and the follow-up encounter that should have happened.

The test data requirements for these agents are fundamentally different from traditional FHIR integration testing. A patient portal needs to display resources correctly. An AI agent needs to reason over them correctly. The difference shows up in what breaks when the data is too simple.

What Breaks with Shallow Data

Prior authorization agents

A prior auth agent reads patient history, checks payer rules, and submits requests. The rules are conditional: step therapy requirements depend on medication history, medical necessity depends on documented comorbidities, specialist referrals require documented primary care visits.

With simple test patients (one condition, one medication, one encounter), every auth request is straightforward. The agent learns the happy path: read condition, find matching rule, submit. Approved.

With realistic patients:

Care coordination agents

A care coordination agent identifies gaps in care: missed follow-ups, overdue screenings, medication conflicts, unaddressed referrals.

With a single-encounter patient, there are no gaps to find. The agent reports "no issues" and you conclude it works.

With a multi-year patient history:

Clinical decision support

A CDS agent analyzes patient data and surfaces recommendations: adjust medication dosage based on lab trends, flag high-risk patients for proactive outreach, suggest diagnostic workups based on symptom patterns.

With clinically empty patients, the agent has nothing to analyze. No trends, no patterns, no risk factors.

With realistic patients:

What "Enough" Test Data Looks Like

The minimum depends on what your agent does. But the principle is the same: your test population needs to contain the clinical scenarios your agent will encounter in production.

Volume

Five patients isn't enough for any agent. You need enough patients to cover the distribution of cases your agent will handle:

For a prior auth agent, we'd recommend at minimum 50-100 patients spanning the condition categories your payer rules cover. For a CDS agent analyzing lab trends, you need patients with 3+ years of longitudinal data.

Clinical density

Each patient needs enough clinical depth to exercise your agent's reasoning:

Agent Type What the data needs
Prior auth Medication history with start/stop dates, condition-medication correlations, prior auth requests and outcomes
Care coordination Multi-year encounters, referral chains (ServiceRequest → Encounter), screening schedules
Clinical decision support Longitudinal labs with reference ranges and interpretation flags, trending vital signs, comorbidity clusters
Denial management EOB resources with denial reason codes, remittance data, appeal documentation

Comorbidity correlation

This is the one most people miss. Real patients don't have random conditions. A 65-year-old with type 2 diabetes is likely to also have hypertension (75% co-occurrence), hyperlipidemia (70%), and some degree of CKD (30-40%). These conditions travel together because they share pathophysiology.

If your test patients have conditions assigned randomly — diabetes on Patient A, hypertension on Patient B, CKD on Patient C, never together — your agent never sees the comorbidity patterns it will encounter in production. The prior auth rules that trigger on condition combinations go untested. The CDS risk stratification that depends on comorbidity clusters produces meaningless scores.

We built this into our data generation pipeline. Patient conditions are derived from 4.4 million real CMS patient journeys — the comorbidity patterns, medication progressions, and encounter frequencies match real population statistics by age and sex. A 68-year-old diabetic in our data has the same probability of concurrent CKD and hypertension as a 68-year-old diabetic in a Medicare claims database.

The Feedback Loop Problem

There's a subtler issue beyond missing test scenarios: agents trained or evaluated on shallow data develop false confidence.

If your agent achieves 98% accuracy on simple patients, you might conclude it's production-ready. But that 98% reflects the difficulty of the test set, not the capability of the agent. Simple patients are easy cases. The agent hasn't been tested on the cases where clinical judgment matters — the ambiguous ones, the multi-condition ones, the ones where the right answer depends on context that isn't in a single resource.

This is the healthcare version of a well-known ML problem: evaluating a model on data that doesn't represent the production distribution. In healthcare, the consequences are specific: missed denials, delayed care, incorrect risk scores, and clinicians who stop trusting the tool after it fails on the first complex patient they test it on.

The fix isn't more sophisticated AI. It's more representative test data.

What We'd Recommend

If you're building a clinical AI agent that consumes FHIR data, you've probably already got a handful of Synthea patients loaded into a local HAPI instance. Maybe you wrote a script to generate 20 bundles and it's been good enough so far. That's how every team starts — and at some point, the in-house test data becomes its own maintenance burden: configuring Synthea modules, tuning parameters, adding conditions your agent needs to handle, keeping it all in sync as your product evolves. The test data pipeline quietly becomes a second project.

Here's what we'd suggest instead:

Start with realistic synthetic data, not production PHI. You can iterate faster without IRB approvals, BAAs, and de-identification pipelines. Synthetic data with realistic clinical distributions lets you build and test your agent's reasoning before you have production access. Here's how to set up a test environment.

Test against the hard cases first, not last. Generate patients that specifically exercise your agent's edge cases: complex comorbidity combinations, long medication histories, multi-year encounter timelines. If your agent handles a 68-year-old with 5 chronic conditions and 12 medications, it'll handle the simple cases too. The reverse is not true.

Measure accuracy on clinical density, not volume. 100 clinically realistic patients will reveal more bugs than 10,000 structurally valid but clinically empty ones. The metric that matters is whether your agent produces correct outputs on patients that look like production charts — not whether it can parse a large number of simple resources.

Build regression tests around the patients that broke your agent. When a complex patient reveals a bug, that patient becomes a permanent test fixture. Over time, your regression suite accumulates the clinical scenarios that actually matter for your specific agent.

The agents that survive first contact with production data are the ones that were tested against data that looked like production before they got there.


mock.health generates FHIR patients from 4.4M real patient journeys — correlated comorbidities, longitudinal labs, medication progressions. The kind of clinical complexity your agent needs to practice on. Free tier →


Related posts

All posts · Home · Docs