Round 1 Review

JMIRx Med

xmed

JMIRx Med

2563-6316

JMIR Publications

Toronto, Canada

v7i1e96225

10.2196/96225

Peer-Review Report

Peer Review of “The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study”

You

Jacqueline Guan-Ting

Mass General Brigham

Boston

United States

Schwartz

Amy

2026

2742026

e96225

260320262603202626032026

2026

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIRx Med, is properly cited. The complete bibliographic information, a link to the original publication on https://med.jmirx.org/, as well as this copyright and license information must be included.

https://www.medrxiv.org/content/10.1101/2025.04.29.25326666v1

https://med.jmirx.org/2026/1/e96220

https://med.jmirx.org/2026/1/e76822

large reasoning modelLRMlarge language modelLLMaccuracymedical scenarioDeepSeek R1Gemini 3

This is a peer-review report for “The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study.”

Round 1 ReviewGeneral Comments

This paper [1] seeks to evaluate the accuracy of DeepSeek R1 in correctly identifying the primary medical diagnosis in the medical scenarios dataset portion of Massive Multitask Language Understanding Pro (MMLU-Pro) using an open-ended format. Some clarifications on the methods and results (especially around the roles of subject matter experts vs core team members in the publication), would be helpful in understanding how these results were derived.

Specific CommentsMinor Comments

Introduction: consider citing Deepseek AI’s Deepseek R1 paper [2].

Methods: please clarify who your subject matter experts were (eg, physicians, researchers) in terms of rank, specialty, and role and how they were used to grade answers (eg, selected based on specialty, 2 reviewer process, etc).

Methods: please indicate when the analyses were run.

Results: who determines whether references are related or unrelated?

Results and Discussion: it is unclear to me from reading the discussion portion of the paper as to whether we have any sense of whether DeepSeek R1 has correct reasoning for questions with correct diagnoses (eg, it may get the right diagnosis but may have incorrect reasoning). Similarly, did you determine the “correct answer” based on string matching (for example, if the answer was “septic arthritis” and the DeepSeek output stated “septic shock,” would this be incorrect)?

Discussion: consider acknowledging the sample size of questions as a limitation.

Round 2 ReviewGeneral Comments

The paper has been revised to address the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis–Large Language Model (TRIPOD-LLM) guidelines. Overall, it appears most concerns from both reviewers have been addressed.

None declared.

Abbreviations

MMLU-Pro

Massive Multitask Language Understanding Pro

TRIPOD-LLM

Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis–Large Language Model

References1

Bajwa

Hoyt

Knight

Haider

The performance of DeepSeek R1 and Gemini 3 in complex medical scenarios: comparative study

JMIRx Med20267e76822

10.2196/76822

Guo

Yang

Zhang

DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning

Preprint posted online on Jan 22, 2025

10.48550/arXiv.2501.12948