How 10,000 Synthetic Patients at mock.health Stack Up.

We ran 10K patients through Census, CDC, and comorbidity benchmarks. 15/15 pairs correct. 18/20 prevalences in range. Honest numbers.

the nighthawk · 12 min read · 2026-03-23

We generated 10,000 synthetic patients and compared them against published population health benchmarks from the CDC, CMS, AHA, and US Census Bureau. No cherry-picking. No "representative examples." Every number below comes from the full 10,000-patient population with complete bundle analysis.

Most synthetic data generators claim "realistic" output. Few show their work. Here's ours — including the parts where we fall short.

Why Synthea's Architecture Falls Short

Synthea is the most widely used open-source synthetic patient generator, and it's a genuinely impressive project. It works by running JSON state machines — one per disease. There's a module for diabetes, a module for COPD, a module for hypertension, and so on. Each module independently decides whether a patient develops that condition based on hardcoded probabilities.

The problem is the word "independently."

In the real world, a 45-year-old doesn't accumulate diabetes, hypertension, and heart failure through three separate coin flips. These conditions form a metabolic cascade: obesity raises the risk of insulin resistance, which raises the risk of hypertension, which raises the risk of coronary disease, which raises the risk of heart failure. Synthea's architecture can express these dependencies — modules can check what other modules have done — but someone has to hand-author every pairwise interaction. With 25+ chronic conditions, that's hundreds of clinical relationships to manually encode and calibrate. Nobody has done it. In practice, most modules make their onset decisions alone.

The result: patients whose conditions are structurally valid but statistically wrong. You get diabetics without hypertension, 30-year-olds with dementia, and comorbidity correlations near zero where they should be strongly positive. The FHIR parses fine. A clinician would flag it in seconds. (This is the same problem with publicly available sandboxes — structurally valid but clinically empty.)

How Our Markov Module Works

Instead of hand-authoring clinical rules, we learned them from data.

We estimated transition probabilities from 4.4 million patient journeys in the Medical Expenditure Panel Survey (MEPS), a nationally representative longitudinal survey of healthcare utilization conducted by AHRQ. The core idea is a Markov model: what happens to a patient next depends on their current state — age, sex, and which conditions are already active — not the full history of how they got there. This turns out to be a surprisingly good fit for chronic disease progression. (It's a simplification — real disease isn't purely Markovian — but the transition matrices learned from millions of real patient-months capture enough of the signal to produce realistic populations.)

The model tracks 25 chronic condition groups simultaneously through a shared bitmask. Every simulated month, it runs six learned sub-models. The two that matter most:

Chronic onset decides whether the patient develops new conditions, conditioned on what they already have. This is where comorbidity correlations come from. Obesity raises the onset probability for Type 2 diabetes. Type 2 diabetes raises the probability of coronary artery disease. CAD raises the probability of heart failure. We didn't author these rules — they fell out of what real patients in the MEPS actually experienced. Multiply 25 conditions × 12 age/sex strata × 8 risk factor combinations and you have a model that captures thousands of pairwise clinical interactions that nobody would hand-curate.

Condition-specific medications and procedures determine what each condition generates. A Type 2 diabetic in the 50–64 age group has an 80% monthly probability of an antidiabetic prescription and a 60% probability of a blood pressure medication. A CHF patient has a 45% monthly probability of an echocardiogram. A diabetic gets A1C monitoring. Each probability is condition-specific, age-stratified, and sex-stratified — all estimated from prescribing and procedure patterns in the MEPS.

The remaining sub-models handle acute events (ED visits, hospitalizations), utilization tiers (low utilizers vs. the top 5% "frequent flyer" pattern), and medication complexity escalation (patients accumulate drug classes over time the way real patients do — metformin first, then BP meds, then statins, then insulin). All learned, not authored.

The Setup

We generated 10,000 patients using age/sex-stratified Markov models. The strata were designed to approximate US Census 2020 proportions:

Stratum	Generated	Census	Delta
0–17	22.0%	22.0%	0.0pp
18–34	21.0%	21.2%	−0.2pp
35–49	19.0%	19.5%	−0.5pp
50–64	19.0%	19.4%	−0.4pp
65–79	13.0%	12.7%	+0.3pp
80+	6.0%	5.2%	+0.8pp

Maximum deviation: 0.8 percentage points. Sex split landed at 52.0% female / 47.9% male versus the Census 50.8/49.2 — a 1.2pp delta. Both are within the margin you'd expect from Synthea's base demographic engine and intentional rounding of generation quotas.

The whole run took 15 minutes on a single machine. No GPUs required for structural generation. Zero failed batches. 100% yield.

What 10,000 Patients Look Like

Each patient is a complete FHIR R4 Bundle — not a flat table with demographics, but a full clinical record. Across the population:

Over 818,000 encounters averaging 82 per patient (median: 66). Nearly 670,000 medication requests across 7,823 unique formulations. More than 8.1 million FHIR resources total, every one validating against US Core profiles.

The median patient has 9 coded conditions, 66 encounters spanning 40 years of clinical history, and 6 active prescriptions. The distribution is right-skewed, as it should be — 29% of patients carry the maximum 10 tracked conditions, reflecting the heavy disease burden in older age strata, while 9% have zero chronic conditions, reflecting the healthy pediatric and young adult population.

Condition Prevalence

This is where most synthetic data generators fall apart. It's easy to generate some hypertension. The question is whether the rate matches what CDC and CMS report in real populations.

We compared our 10,000-patient population (7,800 adults) against published adult prevalence rates:

Condition	Synthetic	Published	Ratio	Reference
Depressive disorder	9.0%	8.4%	1.07x	NIMH Statistics: 8.4% major depressive episode in adults
Anxiety disorder	15.9%	19.2%	0.83x	NIMH Statistics: 19.2% any anxiety disorder
Asthma	14.3%	8.0%	1.79x	CDC NHIS 2022: 8.0% current asthma
Obesity	30.4%	42.0%	0.72x	CDC NHANES 2017–2020: 42.0% of adults
Hypertension	34.6%	47.4%	0.73x	CDC NHIS 2022: 47.4% of adults
CKD	11.0%	15.0%	0.73x	CDC CKD Surveillance: 15.0% all stages
A-fib	2.9%	4.0%	0.73x	AHA Heart Statistics 2023: 3–4% age-adjusted
Type 2 diabetes	7.1%	11.3%	0.63x	CDC Diabetes Report 2022: 11.3% diagnosed
Prediabetes	23.5%	38.1%	0.62x	CDC Diabetes Report 2022: 38.1% by lab criteria
Hyperlipidemia	20.4%	33.7%	0.61x	CDC NHANES: 33.7% on lipid-lowering therapy
COPD	3.2%	4.7%	0.68x	CDC NHIS 2022: 4.7% of adults
Coronary disease	3.8%	6.0%	0.63x	AHA Heart Statistics 2023: 6.0% CHD
Chronic liver disease	2.7%	4.5%	0.60x	AASLD: ~4.5% estimated prevalence
Cancer	3.2%	5.5%	0.58x	NCI SEER: 5.5% cancer prevalence
Heart failure	1.2%	2.1%	0.57x	AHA Heart Statistics 2023: 2.1%
Substance abuse	4.4%	7.9%	0.56x	SAMHSA NSDUH 2022: 7.9% SUD past year
PAD	4.3%	6.5%	0.66x	AHA Heart Statistics 2023: 6.5% age 40+
Stroke	1.6%	3.0%	0.53x	AHA Heart Statistics 2023: 3.0% prevalence
Dementia	2.9%	6.7%	0.43x	Alzheimer's Association 2023: 6.7% of adults 65+
Type 1 diabetes	1.0%	0.5%	2.08x	CDC Diabetes Report 2022: ~0.5% Type 1

18 out of 20 conditions fall within 0.5–2.0x of published rates. Mean prevalence ratio across all conditions: 0.79x.

The systematic undershoot — most ratios sitting between 0.5x and 0.8x rather than clustering around 1.0x — is expected and has a clear explanation. Our transition matrices were estimated from coded encounter data in the MEPS: conditions that were diagnosed, documented, and billed. Published prevalence rates, especially for conditions like prediabetes (38.1%) and CKD (15.0%), include screening-detected cases that may never appear as a coded diagnosis in an actual EHR. A patient whose A1C is 5.8% has prediabetes by lab criteria, but their chart might never carry that ICD-10 code. Our model generates the latter, not the former.

The two outliers tell specific stories. Dementia at 0.43x reflects the difficulty of modeling a condition whose prevalence denominator is restricted to adults 65+ while our population includes all ages. Type 1 diabetes at 2.08x likely reflects coding ambiguity in the MEPS — some insulin-dependent Type 2 diabetics get coded under Type 1 ICD-10 codes, inflating the learned onset probabilities. (We're honestly not 100% sure about the T1D explanation — it could also be a training data artifact we haven't fully diagnosed.)

Depression at 1.07x is worth highlighting. Getting major depressive disorder within 7% of the NIMH reference rate — without any condition-specific tuning for that particular diagnosis — is a strong signal that the Markov model's learned onset probabilities are capturing real epidemiological patterns, not just noise.

The Comorbidity Test

Getting individual prevalence rates right is necessary but not sufficient. The real test: do conditions show up together the way they do in actual patients? A hypertensive diabetic with CKD is one patient, not three independent coin flips.

We tested 15 clinically established comorbidity pairs using phi coefficients (φ), a measure of statistical association between binary variables. Each pair was selected because the positive correlation is well-documented in clinical literature:

Pair	φ	Reference
Hypertension × CKD	+0.328	KDIGO 2021 Clinical Practice Guideline: hypertension present in 67–92% of CKD patients
Obesity × Hypertension	+0.254	AHA Scientific Statement 2021: obesity accounts for 65–78% of primary hypertension
Depression × Anxiety	+0.250	NIMH Comorbidity: ~60% of those with major depression also meet criteria for an anxiety disorder
Hypertension × Type 2 diabetes	+0.169	ADA Standards of Care 2023: hypertension affects ~75% of adults with diabetes
Hypertension × Coronary disease	+0.148	AHA Heart Disease Statistics 2023: hypertension is the leading modifiable risk factor for CHD
Obesity × Type 2 diabetes	+0.123	CDC Diabetes Report 2022: ~89% of adults with diabetes are overweight or obese
Type 2 diabetes × CKD	+0.102	USRDS 2022 Annual Data Report: diabetes is the leading cause of CKD, accounting for ~38% of ESKD
CHF × Atrial fibrillation	+0.083	Framingham Heart Study: AF prevalence ~25–50% in heart failure populations
Depression × Substance abuse	+0.077	SAMHSA 2022 NSDUH: 37.9% of adults with SUD had a concurrent mental illness
Heart failure × Coronary disease	+0.076	AHA 2023: ischemic heart disease is the etiology in ~50% of HF cases
PAD × Coronary disease	+0.053	PARTNERS Study: PAD patients have 2–6x elevated risk of MI and coronary death
Coronary disease × Stroke	+0.034	AHA 2023: CHD and stroke share atherosclerotic etiology and risk factors
Chronic liver disease × Substance abuse	+0.029	AASLD Practice Guidance 2023: alcohol-associated liver disease accounts for ~50% of cirrhosis deaths in the US
COPD × Coronary disease	+0.022	GOLD Report 2023: cardiovascular disease is the leading cause of death in mild-moderate COPD
Dementia × Stroke	+0.013	Lancet Commission on Dementia 2020: stroke approximately doubles the risk of subsequent dementia

15 out of 15 pairs show the correct positive correlation direction. The strongest associations — hypertension/CKD (+0.328), obesity/hypertension (+0.254), depression/anxiety (+0.250) — land where the literature says they should be.

This falls out of the Markov architecture. The model tracks all active conditions through a shared bitmask, so the conditional probability of developing CKD given existing hypertension reflects the actual statistical relationship from MEPS patient journeys. We didn't hand-tune correlation targets — they're emergent properties of a model trained on longitudinal encounter data.

Disease Onset Ages

Temporal consistency is the axis that breaks most synthetic generators. A 20-year-old with dementia or a 5-year-old with COPD might pass structural validation but fails clinical review immediately.

We measured median onset ages across the full 10,000-patient population against expected clinical ranges derived from epidemiological literature:

Condition	Median Onset	Expected Range	Reference
Asthma	25	5–30	CDC NHIS 2022: prevalence peaks in childhood; ~50% of cases onset before age 12
Anxiety	25	15–40	NIMH Statistics: median age of onset 11 for phobias, 21–35 for GAD/panic
Obesity	28	25–55	CDC NHANES 2017–2020: prevalence rises sharply from age 20, peaks 40–59
Depression	29	20–45	NIMH Statistics: highest prevalence in 18–25 age group; median onset mid-20s
Prediabetes	36	35–60	CDC Diabetes Report 2022: prevalence increases from age 35, peaks 45–64
Hypertension	39	35–65	AHA Heart Statistics 2023: prevalence doubles between ages 35–44 and 45–54
Hyperlipidemia	41	35–65	CDC NHANES: elevated cholesterol prevalence rises from ~12% at 20–39 to ~40% at 40–59
Type 2 diabetes	48	40–70	ADA Standards of Care 2023: screening recommended from age 35; peak incidence 45–64
CKD	50	50–75	USRDS 2022 ADR: CKD prevalence ~6% at age 40–59, ~25% at 60–69, ~35% at 70+
COPD	55	45–75	GOLD Report 2023: typically diagnosed after age 40; prevalence peaks 65–74
Coronary disease	57	45–75	AHA 2023: CHD prevalence 1.3% at 20–39, rising to 19.1% at 60–79
A-fib	61	55–80	Framingham Heart Study: AF prevalence ~0.5% at 50–59, ~9% at 80–89
CHF	65	55–80	AHA 2023: HF prevalence ~1% at 40–59, ~6–10% at 60+

13 out of 14 tested conditions have median onset ages within expected clinical ranges. The progression from childhood asthma → young adult anxiety/depression → middle-age metabolic disease → late-life cardiac/neurological conditions matches textbook epidemiology.

The one miss is dementia, whose median onset age of 57 falls below the expected 65–90 range. This is a known artifact of boosting dementia's onset coefficient to bring its prevalence closer to the 6.7% reference rate — higher onset probability across broader age ranges pulls the median down. It's a tradeoff we chose intentionally: better prevalence accuracy at the cost of onset age precision for one condition.

Encounter Patterns

Annualized encounter rates from the full population:

Metric	Synthetic	Published	Reference
All visits/person/year	2.3	3.5	CDC/NCHS NAMCS 2019: 880.5M office visits ÷ 252M adults ≈ 3.5/person/year
ED visits/1,000/year	200	430	CDC NHAMCS 2021: 139M ED visits ÷ 332M population ≈ 430/1,000/year
Inpatient stays/1,000/year	44	104	HCUP NIS 2019: 34.4M weighted discharges ÷ 330M population ≈ 104/1,000/year

The ambulatory visit rate (2.3 vs 3.5) is in the right neighborhood. ED visits are undercounted (200 vs 430/1,000) and inpatient stays run at about 40% of published rates.

Both gaps trace to the same thing. The Markov model was trained on chronic disease management trajectories — longitudinal records of patients being managed for specific conditions. These capture regular follow-ups and hospitalizations for acute exacerbations. What they don't capture are standalone ED visits for injuries, acute infections, poisonings, and social/behavioral crises that have no longitudinal trajectory. A 22-year-old who visits the ED for a broken wrist and never comes back doesn't generate a trajectory. Neither does a 35-year-old with an anxiety attack who gets discharged after four hours. (This is the kind of gap that's easy to describe and genuinely hard to fix well.)

For use cases focused on chronic disease modeling, clinical trials simulation, or EHR system testing, the encounter distribution is fit-for-purpose. For population health analytics that depend on accurate ED utilization, it would need calibration — most likely by adding a separate acute event layer that generates non-trajectory ED visits at age/sex-appropriate rates.

Why This Matters

If you're building an EHR integration, a clinical decision support tool, a population health dashboard, or a FHIR-based analytics pipeline, you need test data. Not five hand-crafted patients — thousands of them, with realistic disease burdens, correlated comorbidities, and plausible clinical timelines. The alternative is waiting months for a de-identified dataset that arrives with half the fields stripped and a BAA that took longer to negotiate than the software took to build.

mock.health provides complete FHIR R4 bundles — conditions, encounters, medications, procedures, labs, imaging studies, clinical notes — that pass US Core validation and hold up under the kind of statistical scrutiny we just walked through.

We publish these numbers because the bar for synthetic data should be higher than "it parses." If you're evaluating synthetic data for your team, ask the vendor to show you their comorbidity correlations and prevalence ratios against published benchmarks. Those are the numbers that matter.

Explore the data or get in touch.

Full methodology is reproducible. The 10K population is available via the mock.health API.

Your Clinical AI Agent Needs More Than 5 Patients — Your prior auth agent works in testing. Then it meets a 68-year-old with CKD, hypertension, and a specialist referral — and crashes.
Building a FHIR API Gateway: What HAPI Won't Do for You — HAPI stores FHIR and runs queries. It doesn't auth users, enforce access, or fix URLs behind a load balancer. Here's the gateway layer.
The FHIR Sandbox Problem: Why Open Epic Isn't Enough — You opened a Patient resource and found TEST TEST. The sandbox is built for certification, not demos. Here's what's missing and the fix.

All posts · Home · Docs