Why the Best and Worst Cardiac AI Look Identical in a Pitch Deck
This is the final installment in our HRSTV2026 blog series, written by Pacemate's Head of AI, Sean Shoffstall. We close with a practical guide for EP physicians and nurses on how to see past the pitch deck and evaluate cardiac AI claims with the right questions.
One of the sharpest moments of our HRS2026 panel came when Dr. Jonathan Piccini drew a contrast that should make every EP buyer pause. On one hand, he said, some of the AI prediction work on the EKG — identifying atrial fibrillation — has reached genuinely impressive discrimination, up in the high 90s. On the other hand, AI prediction of hospitalization is, in his words, a hot mess that he doesn’t even know how to begin using.
Same category. “Cardiac AI.” Wildly different reality. And here’s the uncomfortable part: from a pitch deck, you usually can’t tell which one you’re looking at. Both show you a confident number and a clean chart. The gap between the best and the worst is enormous, and it’s nearly invisible at the point of sale.
So I want to give EP physicians and nurses a practical way to see through the deck — because the single number on the slide is almost never enough.
Start with what the number on the slide usually is: discrimination, often reported as an AUC or C-index. It answers one narrow question — how well does the model separate the patients who will have the event from those who won’t? A 0.99 sounds like a near-certainty. But discrimination alone tells you nothing about whether the model holds up outside the lab.
Dr. Steinberg gave a humbling example. Predicting potassium from the ECG — an outcome we know is physiologically linked to the tracing — turned out to be far harder than expected, even with sample sizes he’d have thought more than sufficient. And it surfaced a brutal question from Dr. Piccini that has no textbook answer: how many false positives will you tolerate to catch the one patient who’s dangerously hyperkalemic? There’s no chapter in Harrison’s for that trade-off.
The lesson both physicians kept returning to is that a model is only as good as the population it was trained on and the population you’re pointing it at. A model can post a beautiful number in one cohort and quietly fall apart in another that doesn’t look like the first.
There’s a striking illustration of this in the literature. A Circulation study trained an ECG-based deep learning model to predict five-year atrial fibrillation risk and asked a question most vendors don’t: does the AI actually beat the simple clinical risk factors we already have? The honest answer was nuanced — the model was useful and generalizable across samples, but the greatest predictive accuracy came from combining AI with clinical risk factors, not from the AI alone. The pure black box wasn’t the winner. The hybrid was.
So here’s the buyer’s checklist I’d carry into every vendor conversation. When you see the headline number, ask four questions the deck won’t volunteer:
- Power. How big was the cohort, and is the sample actually large enough to support this claim?
- External validation. Has this been tested at sites other than the one that built it?
- Calibration. Not just whether the model ranks patients correctly, but whether its predicted probabilities are actually true.
- Population. Who was it trained on, and do my patients look like them?
A vendor with a genuinely strong tool will welcome those questions, because the answers are their advantage. A vendor counting on the headline number to do the work will get uncomfortable. That discomfort is the signal.
The pitch deck shows you one number. Demand the other four.
Watch the full HRS-TV panel with Dr. Steinberg and Dr. Piccini