Skip to content
evidence-over-explanations:-put-medical-ai-to-the-test-–-npj-artificial-intelligence

Evidence over explanations: put medical AI to the test – npj Artificial Intelligence

Medical AI often optimizes end-to-end performance while offering post-hoc explanations, widening an epistemic gap between behavior and justified clinical claims. Interpretable models and XAI help, but are insufficient. We advocate governance by testability: preregistered empirical trials of causal alignment (avoiding shortcuts) and invariance across scanners, sites, and subgroups, plus monitoring and rollback. We propose Institutional AI—auditable provenance, prospective validation, external audits, and analogous adversarial evaluations for LLMs.

A friction in the deployment of artificial intelligence (AI) in medicine is the mismatch between how modern systems are built and how medicine draws reliable inferences. Clinical research advances by articulated steps, mirroring the steps of the canonical scientific method: posing questions, proposing hypotheses, designing experiments, analyzing results, and communicating successes and failures with bounded claims. Contemporary medical AI — spanning imaging, clinical prediction/risk models, decision support, and, increasingly, generative systems such as large language models (LLMs) — compresses this sequence. It optimizes end-to-end performance and offers post-hoc explanations after the fact, creating what we define as the “epistemic gap” between what a system does and what we can justifiably conclude about why it works, for whom, and under which conditions1.

As the terminology in this area is inconsistent across communities, we use intrinsic (ante-hoc) interpretability to denote models whose decision logic is transparent by design, like the mapping from inputs to outputs can be directly inspected at the level of the model form and post-hoc explainability to denote methods applied after training to produce explanatory artifacts about an otherwise opaque model’s behavior, like saliency/attribution maps, feature importance, counterfactuals, and surrogate models2.

The limits of explanation are well documented. Strong critiques have argued that post-hoc explanations can distract from a more fundamental requirement: if decisions are consequential, use models whose workings are intrinsically interpretable or, at a minimum, whose claims can be put on trial through rigorous tests2. Health-care-specific analyses have highlighted the false hope of current XAI practices, noting that popular saliency and attribution methods can fail basic sanity checks, change under small input perturbations, or remain stable when the model is randomized3.4. In radiology, quantitative evaluations in mammography confirm that saliency maps can be inconsistent and poorly calibrated to clinically relevant locations5, while broader discussions emphasize that explanations are heterogeneous objects—how, why, and when to explain are distinct questions that technical toolkits often conflate6. Systematic reviews suggest that despite widespread invocation, XAI is rarely used in a way that meaningfully supports clinical decision-making or safety claims3 although they do not delegitimize explanation, they restrict what explanatory artifacts can credibly guarantee7,8,9.

What must be guaranteed in medicine is not introspective access to a model’s internal states but grounds for believing that its outputs rest on meaningful signals and will remain reliable when the environment changes. Two dimensions are central.

First, causal alignment. Are predictions driven by features plausibly related to disease biology or clinically relevant intermediates? In imaging, this means distinguishing decision rules based on lesion morphology or tissue texture from rules keyed to scanner logos, image borders, or view markers. The literature offers cautionary tales: high internal accuracy on one hospital’s chest radiographs collapsed on external data because models exploited site-specific confounders rather than pathology10. In breast imaging, the acquisition signatures, vendor differences, and demographic imbalances can serve as shortcuts that inflate in-sample metrics while undermining transportability5,11. Because shortcut signals in imaging can be site- and workflow-specific, external validation is necessary but often insufficient on its own; it should be paired with explicit shortcut tests that manipulate suspected nuisance cues12. Causal alignment can be tested without peering inside the network through intervention-style probes. In tabular data this may involve manipulating or withholding candidate predictors and re-testing on held-out cohorts. In high-dimensional imaging, features need not be discrete variables: they can be spatial regions, acquisition and processing signatures, borders/overlays/markers, or latent concepts (e.g., organ or lesion compartments). Accordingly, tests can include targeted occlusion or ROI-restriction (e.g., lesion or organ masks), removal of view markers/borders, counterfactual swaps of background/context, and stress tests that preserve pathology while perturbing nuisance factors (scanner/vendor/site texture or preprocessing). Negative controls remain central: if predictions change when only irrelevant image factors are altered, the model is not causally aligned.

Second, invariance. Do performance and clinically relevant aspects of behavior remain stable across plausible shifts. Like scanner models, acquisition protocols, sites, and demographic subgroups? Shortcut learning is a known failure mode whereby models rely on correlations that maximize training performance but are brittle under distribution shift13. In practice, invariance is assessed by stratified external validation designed to stress the model along clinically salient axes. Instability is not automatically disqualifying, but it obliges either scope restriction (narrow intended use) or targeted remediation (re-training, re-calibration).

Generative LLMs provide a clear worked example of the same governance problem in a different modality. They are vulnerable to confident, unsupported statements and fabricated premises. Retrieval-augmented generation (RAG) can constrain outputs to cited sources and conservative refusal behaviors can reduce speculative answers, but these mitigations do not create causal understanding14. Here, too, governance should emphasize tests. For example: Does the system consistently ground each clinical recommendation in evidence? Can it be induced to invent references or contraindications? Do guardrails hold when prompts contain adversarial mixtures of correct and incorrect statements? The goal is not to philosophically dissolve opacity; it is to expose claims to disciplined interrogation.

The objection most often raised is pragmatic: even if testability is the right principle, who will run the tests, and with what data? One answer is to bring development closer to where care occurs, with what we call “Institutional AI”, a hospital-embedded pipelines that offer a practical route to testability. Local data reflect the population, scanners, protocols, and reporting culture of deployment. When models are trained or calibrated on such data and validated prospectively, ecological validity improves and calibration error typically narrows15,16. Yet local does not mean unbiased. Site-specific labeling conventions, measurement errors, and under-representation of rare subgroups can be amplified, and portability across a regional network can degrade13,15. The remedy is not to retreat from local development but to pair it with environment-stratified external audits. Internal fit and external generalization must be treated as co-equal design constraints.

Institutional AI is a program more than a product. It starts with auditable provenance. Data, labels, code, and model weights require immutable versioning so that each change has the status of a protocol amendment rather than an informal tweak. It continues with preregistered acceptance tests, including:

  1. a.

    prespecified hypotheses about the information pathways and context of use;

  2. b.

    preregistered endpoints and acceptance thresholds (e.g., discrimination with uncertainty, calibration on the local case-mix and external cohorts, and invariance across scanners/sites/subgroups);

  3. c.

    pre-defined stressors (ablations, negative controls, and environment-stratified shift tests) that directly target plausible shortcut mechanisms;

  4. d.

    explicit go/no-go and rollback rules for deployment, update, or recalibration.

Before deployment or update, institutions should define the minimum acceptable discrimination (with uncertainty), calibration on the local case-mix and on external cohorts, and invariance across scanners, sites, and key subgroups. When explanations are used for user-facing tasks (for example, lesion localization), tests must quantify explanation fidelity and stability rather than assuming saliency maps are accurate by default. After deployment, drift surveillance should monitor both input data and performance relative to preregistered baselines, with prespecified rollback or recalibration triggers. Crucially, local development must be linked to external, environment-stratified audits that can surface shortcut behavior and subgroup harms early.

This program re-embeds the logic of the scientific method into the lifecycle of medical AI. Hypotheses become explicit claims about information pathways and clinician interaction17. Experiments become preregistered quantitative and human-factors evaluations that a system must pass to progress, aligned with established protocol and trial reporting guidance (e.g., SPIRIT-AI/CONSORT-AI where applicable)18,19. Conclusions become bounded intended uses with declared failure modes. Communication becomes the publication of full results, including degraded external performance and negative ablation studies, aligned with established reporting standards—TRIPOD for prediction models and the CONSORT-AI and SPIRIT-AI extensions for interventional trials18,19,20,21. Transparency remains desirable, but testability is the stronger guarantee when systems resist full interpretation.

There are trade-offs. Institutional pipelines require data engineering, MLOps, statistical audit, and governance, which risks confining their feasibility to well-resourced centers: a real inequity that must be addressed22. Two mitigations are feasible. First, standards and tooling can be shared. The tests like provenance schemas, preregistration templates, invariance batteries, and drift dashboards, can be packaged and disseminated as open protocols, lowering the fixed costs for smaller hospitals. Second, networks of centers can coordinate distributed validation: a system locally calibrated on one site can be audited across a consortium to quantify generalization and equity before widespread use15,16.

Overall, where interpretable models can meet clinical performance requirements, they should be preferred. But medicine often confronts problems where fully transparent-by-design models are not yet available at clinically acceptable performance or breadth of use—particularly in free-text understanding, many high-dimensional imaging tasks, and multi-modal fusion. In these domains, the practical choice is often not between full transparency and reckless opacity, but between restricting modeling to transparent forms or partial/indirect interpretability aids (with narrower scope or degraded performance), and using higher-capacity opaque models under a testability-first regime with preregistered acceptance thresholds, shortcut-focused stress tests, and continuous monitoring. Testability provides a principled middle path. It does not ask clinicians to accept stories that might be unfaithful representations of what a model is doing; it asks institutions to gather empirical evidence that a system’s behavior is causally sound and robust for the intended population and use.

A conceptual point frequently overlooked in literature is that not all non-human decision criteria are spurious23,24. Medicine is already comfortable with signals that clinicians cannot directly perceive but accept because they can be validated and mechanistically situated: a distinction articulated in the black-box decision debate and echoed across the broader explainability ethics literature23,24. In imaging AI, models may identify higher-order parenchymal patterns that correlate with stromal biology without being reducible to existing lexicons13. The challenge is to separate such non-human-but-causally-valid signals from genuinely spurious shortcuts. That separation cannot be achieved by saliency maps that decorate predictions after the fact; it requires the causal and invariance tests described above3,13.

Language models merit a similar reframing. Calls for explainability often reduce to better confidence measures or traceable citations, both useful but insufficient. A clinically acceptable system should be able to demonstrate, under adversarial evaluation, that fabricated citation events and unsupported factual claims are below prespecified acceptability thresholds for the intended use, that the model reliably abstains when evidence is absent or insufficient, and that retrieval constraints (when used) measurably bind outputs. RAG architectures help14, but the guarantee rests on how they are tested and monitored in the clinical setting, not on architectural labels. Because hallucinations cannot be eliminated entirely, institutions should treat acceptability as a managed error budget: define task-specific metrics (e.g., citation fidelity, unsupported-claim rate, abstention calibration), set go/no-go thresholds and rollback triggers, and restrict higher-risk clinical decision support uses to regimes where the residual error budget is extremely small and abstention is frequent. For example, tolerable error rates are very different for drafting administrative text, summarizing retrieved guideline passages with verifiable citations, and generating patient- or clinician-facing recommendations: only the latter demands near-zero tolerance for source fabrication and should default to ‘no answer’ unless the evidence base is explicitly retrieved and consistent.

Medicine does not need mysticism about machine understanding, nor does it need to reduce every model to a set of human concepts to be safe. It needs disciplined claims, adversarial tests, and bounded conclusions. Interpretable models should be used where they suffice. Where they do not, opaque models can be responsibly deployed if, and only if, they pass tests that establish causal alignment and invariance for the stated context of use. Institutional AI offers a practical framework to do so. The result is not perfect transparency, but something closer to medicine’s traditional strengths: explicit hypotheses, disciplined experiments, robust analyses, and full communication, including informative failures. That is how the epistemic gap narrows: putting models on trial, not by asking them to tell better stories.

Data availability

No datasets were generated or analyzed during the current study.

References

  1. Pesapane, F. & Sardanelli, F. Keeping AI in medicine and radiology within the framework of scientific method: measuring to close the epistemic gap. Insights Imaging 16, 287 (2025).

    Article  Google Scholar 

  2. Rudin, C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat. Mach. Intell. 1, 206–215 (2019).

    Article  Google Scholar 

  3. Ghassemi, M., Oakden-Rayner, L. & Beam, A. L. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit. Health 3, e745–e750 (2021).

    Article  Google Scholar 

  4. Chen, H., Gomez, C., Huang, C. M. & Unberath, M. Explainable medical imaging AI needs human-centered design: guidelines and evidence from a systematic review. NPJ Digit. Med. 5, 156 (2022).

    Article  Google Scholar 

  5. Cerekci, E. et al. Quantitative evaluation of Saliency-Based Explainable artificial intelligence (XAI) methods in Deep Learning-Based Mammogram Analysis. Eur. J. Radio. 173, 111356 (2024).

    Article  Google Scholar 

  6. Marcus, E. & Teuwen, J. Artificial intelligence and explanation: How, why, and when to explain black boxes. Eur. J. Radio. 173, 111393 (2024).

    Article  Google Scholar 

  7. Behzad S., Tabatabaei S. M. H., Lu M. Y., Eibschutz L. S., Gholamrezanezhad A. Pitfalls in interpretive applications of artificial intelligence in radiology. Am. J. Roentgenol. 2024:1-12. https://doi.org/10.2214/AJR.24.31493

  8. Groen, A. M., Kraan, R., Amirkhan, S. F., Daams, J. G. & Maas, M. A systematic review on the use of explainability in deep learning systems for computer aided diagnosis in radiology: Limited use of explainable AI?. Eur. J. Radio. 157, 110592 (2022).

    Article  Google Scholar 

  9. Ghasemi, A., Hashtarkhani, S., Schwartz, D. L. & Shaban-Nejad, A. Explainable artificial intelligence in breast cancer detection and risk prediction: A systematic scoping review. Cancer Innov. 3, e136 (2024).

    Article  Google Scholar 

  10. Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med. 15, e1002683 (2018).

    Article  Google Scholar 

  11. Pesapane, F. et al. Recent radiomics advancements in breast cancer: lessons and pitfalls for the next future. Curr. Oncol. 28, 2351–2372 (2021).

    Article  Google Scholar 

  12. Brown, A. et al. Detecting shortcut learning for fair medical AI using shortcut testing. Nat. Commun. 14, 4314 (2023).

    Article  Google Scholar 

  13. Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).

    Article  Google Scholar 

  14. Weinert, D. A. & Rauschecker, A. M. Enhancing large language models with retrieval-augmented generation: a radiology-specific approach. Radio. Artif. Intell. 7, e240313 (2025).

    Article  Google Scholar 

  15. de Hond, A. A. H. et al. Perspectives on validation of clinical predictive algorithms. NPJ Digit. Med. 6, 86, https://doi.org/10.1038/s41746-023-00832-9 (2023).

    Article  Google Scholar 

  16. Pesapane, F. et al. The translation of in-house imaging AI research into a medical device ensuring ethical and regulatory integrity. Eur. J. Radio. 182, 111852 (2024).

    Article  Google Scholar 

  17. Zhang, T., Mosier, J., Campbell, E. S. & Subbian, V. To NIRS or not: understanding clinical decision-making of respiratory support management related to acute respiratory failure using Critical Decision Method. IISE Trans. Healthc. Syst. Eng. 14, 277–288 (2024).

    Article  Google Scholar 

  18. Cruz Rivera, S. et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Lancet Digit. Health 2, e549–e560 (2020).

    Article  Google Scholar 

  19. Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit. Health 2, e537–e548 (2020).

    Article  Google Scholar 

  20. Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement. Ann. Intern. Med. 162, 55–63 (2015).

    Article  Google Scholar 

  21. Nagendran, M. et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ 368, m689 (2020).

    Article  Google Scholar 

  22. Celi, L. A. et al. Sources of bias in artificial intelligence that perpetuate healthcare disparities-A global review. PLOS Digit. Health 1, e0000022 (2022).

    Article  Google Scholar 

  23. Freyer, N., Gross, D. & Lipprandt, M. The ethical requirement of explainability for AI-DSS in healthcare: a systematic review of reasons. BMC Med. Ethics 25, 104 (2024).

    Article  Google Scholar 

  24. London, A. J. Artificial intelligence and black-box medical decisions: accuracy versus explainability. Hastings Cent. Rep. 49, 15–21 (2019).

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the Italian Ministry of Health with Ricerca Corrente 5 × 1000 funds.

Author information

Authors and Affiliations

  1. Breast Imaging Division, Radiology Department, IEO European Institute of Oncology IRCCS, Milan, Italy

    Filippo Pesapane, Anna Rotili, Silvia Penco, Luca Nicosia & Enrico Cassano

  2. University of Milan, Università degli Studi di Milano, Milan, Italy

    Filippo Pesapane

Authors

  1. Filippo Pesapane
  2. Anna Rotili
  3. Silvia Penco
  4. Luca Nicosia
  5. Enrico Cassano

Contributions

F.P. conceptualized the commentary, drafted the original manuscript, and coordinated the overall development of the work. A.R., S.P., and L.N. contributed to the intellectual content of the manuscript and critically revised it for important scholarly content. E.C. provided senior oversight, contributed to the conceptual framing of the commentary, and critically revised the manuscript. All authors have read and approved the final version of the manuscript and agree to be accountable for all aspects of the work.

Corresponding author

Correspondence to Filippo Pesapane.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pesapane, F., Rotili, A., Penco, S. et al. Evidence over explanations: put medical AI to the test. npj Artif. Intell. 2, 53 (2026). https://doi.org/10.1038/s44387-026-00092-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s44387-026-00092-4

colind88

Back To Top