<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN" "journalpublishing.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="2.0" xml:lang="en" article-type="reviewer-report"><front><journal-meta><journal-id journal-id-type="nlm-ta">JMIRx Med</journal-id><journal-id journal-id-type="publisher-id">xmed</journal-id><journal-id journal-id-type="index">34</journal-id><journal-title>JMIRx Med</journal-title><abbrev-journal-title>JMIRx Med</abbrev-journal-title><issn pub-type="epub">2563-6316</issn><publisher><publisher-name>JMIR Publications</publisher-name><publisher-loc>Toronto, Canada</publisher-loc></publisher></journal-meta><article-meta><article-id pub-id-type="publisher-id">v7i1e96223</article-id><article-id pub-id-type="doi">10.2196/96223</article-id><article-categories><subj-group subj-group-type="heading"><subject>Peer-Review Report</subject></subj-group></article-categories><title-group><article-title>Peer Review of &#x201C;The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study&#x201D;</article-title></title-group><contrib-group><contrib contrib-type="author"><name name-style="western"><surname>Wang</surname><given-names>Ziyu</given-names></name><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff id="aff1"><institution>University of California, Irvine</institution><addr-line>Irvine</addr-line><addr-line>CA</addr-line><country>United States</country></aff><contrib-group><contrib contrib-type="editor"><name name-style="western"><surname>Schwartz</surname><given-names>Amy</given-names></name></contrib></contrib-group><pub-date pub-type="collection"><year>2026</year></pub-date><pub-date pub-type="epub"><day>27</day><month>4</month><year>2026</year></pub-date><volume>7</volume><elocation-id>e96223</elocation-id><history><date date-type="received"><day>26</day><month>03</month><year>2026</year></date><date date-type="rev-recd"><day>26</day><month>03</month><year>2026</year></date><date date-type="accepted"><day>26</day><month>03</month><year>2026</year></date></history><copyright-statement>&#x00A9; Ziyu Wang. Originally published in JMIRx Med (<ext-link ext-link-type="uri" xlink:href="https://med.jmirx.org">https://med.jmirx.org</ext-link>), 27.4.2026. </copyright-statement><copyright-year>2026</copyright-year><license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIRx Med, is properly cited. The complete bibliographic information, a link to the original publication on <ext-link ext-link-type="uri" xlink:href="https://med.jmirx.org/">https://med.jmirx.org/</ext-link>, as well as this copyright and license information must be included.</p></license><self-uri xlink:type="simple" xlink:href="https://xmed.jmir.org/2026/1/e96223"/><related-article related-article-type="companion" ext-link-type="doi" xlink:href="10.1101/2025.04.29.25326666" xlink:title="Preprint (medRxiv)" xlink:type="simple">https://www.medrxiv.org/content/10.1101/2025.04.29.25326666v1</related-article><related-article related-article-type="companion" ext-link-type="doi" xlink:href="10.2196/96220" xlink:title="Authors' Response to Peer-Review Reports" xlink:type="simple">https://med.jmirx.org/2026/1/e96220</related-article><related-article related-article-type="companion" ext-link-type="doi" xlink:href="10.2196/76822" xlink:title="Published Article" xlink:type="simple">https://med.jmirx.org/2026/1/e76822</related-article><kwd-group><kwd>large reasoning model</kwd><kwd>LRM</kwd><kwd>large language model</kwd><kwd>LLM</kwd><kwd>accuracy</kwd><kwd>medical scenario</kwd><kwd>DeepSeek R1</kwd><kwd>Gemini 3</kwd></kwd-group></article-meta></front><body><p><italic>This is a peer-review report for &#x201C;The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study.&#x201D;</italic></p><sec id="s2"><title>Round 1 Review</title><sec id="s1-1"><title>General Comments</title><p>This is a timely and well-structured paper [<xref ref-type="bibr" rid="ref1">1</xref>] that investigates the application of DeepSeek R1, a state-of-the-art large reasoning model, in the medical domain using the Multitask Language Understanding Pro (MMLU-Pro) benchmark. This paper presents a follow-up evaluation of the DeepSeek R1 large reasoning model on open-ended medical scenarios from the MMLU-Pro benchmark. The study finds that DeepSeek R1 achieves a high accuracy of 92% without multiple-choice options, demonstrating its potential utility in more realistic clinical settings. The paper is timely and relevant, with strong empirical results and clear motivation. However, it would benefit from revisions to improve clarity, contextual grounding in existing work, and methodological detail. The authors may also consider citing recent work that examines the questioning strategies of large language models (LLMs) in clinical dialogues to better position this study in the broader landscape.</p><p>The study is commendable for its effort in combining expert validation with benchmark testing and highlighting both performance and interpretability aspects. The paper is generally well-written, informative, and relevant to the research community on artificial intelligence (AI) in health care.</p><p>However, before being suitable for publication, several important revisions are required. These include expanding the related work section to better situate the contribution of current research efforts, addressing some methodological limitations more transparently, and improving the robustness and generalizability of conclusions. Thus, I recommend revision and re-review.</p></sec><sec id="s1-2"><title>Specific Comments</title><sec id="s1-2-1"><title>Major Comments</title><p>1. While the paper references MMLU, MedQA, and some domain-specific LLM evaluations, it lacks a deeper discussion on recent approaches to questioning capabilities and long-context understanding in medical AI. Two notable papers should be included. First, &#x201C;HealthQ: Unveiling Questioning Capabilities of LLM Chains in Healthcare Conversations&#x201D; by Wang et al [<xref ref-type="bibr" rid="ref2">2</xref>]. This paper presents a benchmarking framework focusing on the inquiry and elicitation capacity of LLM chains, which directly relates to the &#x201C;reasoning&#x201D; and prompt design aspects discussed here. Second, &#x201D;Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHR Data&#x201D; by Wornow et al [<xref ref-type="bibr" rid="ref3">3</xref>]. This study highlights how context windows and task framing affect LLM performance on clinical reasoning&#x2014;relevant for understanding how question complexity and format might interact with LLM accuracy.</p><p>The paper could be strengthened by referencing more recent work on prompting and questioning strategies in clinical LLM applications. The paper would benefit from referencing Wang et al [<xref ref-type="bibr" rid="ref2">2</xref>], which evaluates LLM chains&#x2019; ability to optimize questions through reflection and prompting. This is relevant to the current paper&#x2019;s interest in open-ended diagnostic reasoning and LLM behavior in clinical settings.</p><p>2. Although the DeepSeek R1 model is rigorously evaluated against MMLU-Pro, there&#x2019;s a lack of direct performance comparison to other LLMs (eg, MedPaLM, GPT-4, Claude) on the same dataset or medical scenarios. Even informal or partial benchmarks would help contextualize the model&#x2019;s effectiveness. Also, the novelty should be better emphasized&#x2014;is this the first comprehensive large reasoning model evaluation on MMLU-Pro&#x2019;s health subset?</p><p>3. The paper rightly points out issues with cueing and &#x201C;testwiseness&#x201D; in multiple-choice questions but doesn&#x2019;t propose concrete mitigations. The planned future work of testing without answer choices is excellent&#x2014;consider incorporating a small pilot of this now or discussing expected outcomes in more depth. Also, the limitations of using only 162 scenarios across many specialties could be made more transparent, especially regarding statistical robustness and specialty-specific insights.</p><p>4. The study uses a fixed prompt but does not explore or discuss the impact of prompt variations, which may influence results in open-ended tasks.</p><p>5. While the discussion of biases and failure modes is helpful, a more structured breakdown of error types and their frequency would improve the interpretability of findings.</p><p>6. The discussion on reasoning steps and transparency is insightful but could be expanded to address recent concerns about the faithfulness of chain-of-thought outputs.</p></sec><sec id="s1-2-2"><title>Minor Comments</title><p>7. Model latency and usability: while the latency of DeepSeek is acknowledged, it&#x2019;s not contextualized with respect to potential clinical utility or workflow integration. A brief paragraph on practical deployment implications would strengthen the discussion.</p><p>8. Citation formatting: ensure all references (especially web-based ones like Perplexity and PromptHub) are consistently formatted and maintained in the reference list.</p><p>9. Future directions could be made more actionable by suggesting benchmark expansions with real patient data or multimodal inputs.</p></sec></sec></sec><sec id="s3"><title>Round 2 Review</title><p>My comments have been addressed.</p></sec></body><back><fn-group><fn fn-type="conflict"><p>None declared.</p></fn></fn-group><glossary><title>Abbreviations</title><def-list><def-item><term id="abb1">AI</term><def><p>artificial intelligence</p></def></def-item><def-item><term id="abb2">LLM</term><def><p>large language model</p></def></def-item><def-item><term id="abb3">MMLU-Pro</term><def><p>Multitask Language Understanding Pro</p></def></def-item></def-list></glossary><ref-list><title>References</title><ref id="ref1"><label>1</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Bajwa</surname><given-names>M</given-names> </name><name name-style="western"><surname>Hoyt</surname><given-names>R</given-names> </name><name name-style="western"><surname>Knight</surname><given-names>D</given-names> </name><name name-style="western"><surname>Haider</surname><given-names>M</given-names> </name></person-group><article-title>The performance of DeepSeek R1 and Gemini 3 in complex medical scenarios: comparative study</article-title><source>JMIRx Med</source><year>2026</year><volume>7</volume><fpage>e76822</fpage><pub-id pub-id-type="doi">10.2196/76822</pub-id></nlm-citation></ref><ref id="ref2"><label>2</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Wang</surname><given-names>Z</given-names> </name><name name-style="western"><surname>Li</surname><given-names>H</given-names> </name><name name-style="western"><surname>Huang</surname><given-names>D</given-names> </name><name name-style="western"><surname>Kim</surname><given-names>HS</given-names> </name><name name-style="western"><surname>Shin</surname><given-names>CW</given-names> </name><name name-style="western"><surname>Rahmani</surname><given-names>AM</given-names> </name></person-group><article-title>HealthQ: unveiling questioning capabilities of LLM chains in healthcare conversations</article-title><source>Smart Health (2014)</source><year>2025</year><month>06</month><volume>36</volume><fpage>100570</fpage><pub-id pub-id-type="doi">10.1016/j.smhl.2025.100570</pub-id></nlm-citation></ref><ref id="ref3"><label>3</label><nlm-citation citation-type="confproc"><person-group person-group-type="author"><name name-style="western"><surname>Wornow</surname><given-names>M</given-names> </name><name name-style="western"><surname>Bedi</surname><given-names>S</given-names> </name><name name-style="western"><surname>Hernandez</surname><given-names>MAF</given-names> </name><etal/></person-group><article-title>Context clues: evaluating long context models for clinical prediction tasks on EHR data</article-title><conf-name>ICLR 2025; the Thirteenth International Conference on Learning Representations</conf-name><conf-date>Apr 24-28, 2025</conf-date></nlm-citation></ref></ref-list></back></article>