<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN" "journalpublishing.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="2.0" xml:lang="en" article-type="reviewer-report"><front><journal-meta><journal-id journal-id-type="nlm-ta">JMIRx Med</journal-id><journal-id journal-id-type="publisher-id">xmed</journal-id><journal-id journal-id-type="index">34</journal-id><journal-title>JMIRx Med</journal-title><abbrev-journal-title>JMIRx Med</abbrev-journal-title><issn pub-type="epub">2563-6316</issn></journal-meta><article-meta><article-id pub-id-type="publisher-id">60280</article-id><article-id pub-id-type="doi">10.2196/60280</article-id><title-group><article-title>Peer Review of &#x201C;Performance Drift in Machine Learning Models for Cardiac Surgery Risk Prediction: Retrospective Analysis&#x201D;</article-title></title-group><contrib-group><contrib contrib-type="author"><name name-style="western"><surname>Zeng</surname><given-names>Juntong</given-names></name><degrees>MD, PhD</degrees><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff id="aff1"><institution>Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College</institution>, <addr-line>Beijing</addr-line>, <country>China</country></aff><contrib-group><contrib contrib-type="editor"><name name-style="western"><surname>Meinert</surname><given-names>Edward</given-names></name></contrib></contrib-group><pub-date pub-type="collection"><year>2024</year></pub-date><pub-date pub-type="epub"><day>12</day><month>6</month><year>2024</year></pub-date><volume>5</volume><elocation-id>e60280</elocation-id><history><date date-type="received"><day>06</day><month>05</month><year>2024</year></date><date date-type="accepted"><day>06</day><month>05</month><year>2024</year></date></history><copyright-statement>&#x00A9; Juntong Zeng. Originally published in JMIRx Med (<ext-link ext-link-type="uri" xlink:href="https://med.jmirx.org">https://med.jmirx.org</ext-link>), 12.6.2024. </copyright-statement><copyright-year>2024</copyright-year><license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIRx Med, is properly cited. The complete bibliographic information, a link to the original publication on <ext-link ext-link-type="uri" xlink:href="https://med.jmirx.org/">https://med.jmirx.org/</ext-link>, as well as this copyright and license information must be included.</p></license><self-uri xlink:type="simple" xlink:href="https://xmed.jmir.org/2024/1/e60280"/><related-article related-article-type="companion" ext-link-type="doi" xlink:href="10.2196/preprints.45973" xlink:title="Preprint (JMIR Preprints)" xlink:type="simple">https://preprints.jmir.org/preprint/45973</related-article><related-article related-article-type="companion" ext-link-type="doi" xlink:href="10.1101/2023.01.21.23284795" xlink:title="Preprint (MedRxiv)" xlink:type="simple">https://www.medrxiv.org/content/10.1101/2023.01.21.23284795v1</related-article><related-article related-article-type="companion" ext-link-type="doi" xlink:href="10.2196/60384" xlink:title="Authors' Response to Peer-Review Reports" xlink:type="simple">https://med.jmirx.org/2024/1/e60384</related-article><related-article related-article-type="companion" ext-link-type="doi" xlink:href="10.2196/45973" xlink:title="Published Article" xlink:type="simple">https://med.jmirx.org/2024/1/e45973</related-article><kwd-group><kwd>cardiac surgery</kwd><kwd>artificial intelligence</kwd><kwd>risk prediction</kwd><kwd>machine learning</kwd><kwd>operative mortality</kwd><kwd>data set drift</kwd><kwd>performance drift</kwd><kwd>national data set</kwd><kwd>adult</kwd><kwd>data</kwd><kwd>cardiac</kwd><kwd>surgery</kwd><kwd>cardiology</kwd><kwd>heart</kwd><kwd>risk</kwd><kwd>prediction</kwd><kwd>United Kingdom</kwd><kwd>mortality</kwd><kwd>performance</kwd><kwd>model</kwd></kwd-group></article-meta></front><body><p><italic>This is the peer-review report for &#x201C;Performance Drift in Machine Learning Models for Cardiac Surgery Risk Prediction: Retrospective Analysis.&#x201D;</italic></p><sec id="s2"><title>Round 1 Review</title><sec id="s1-1"><title>General Comments</title><p>This manuscript [<xref ref-type="bibr" rid="ref1">1</xref>] presents an interesting study that explores temporal trends in various performance metrics for different types of prediction models used in the prediction of in-hospital mortality after cardiac surgery in the United Kingdom from 2012 to 2019. The data set was divided into 2 periods: from 2012 to 2016 for model training and internal validation and from 2017 to 2019 for external validation. The study evaluated 5 prediction models: logistic regression, support vector machine (SVM), random forest, extreme gradient boosting (XGBoost), neural network, and European System for Cardiac Operative Risk Evaluation (EuroSCORE) II. The authors aimed to assess the model performance on 5 metrics (1 &#x2013; expected calibration error [ECE], area under the curve [AUC], 1 &#x2013; Brier score, <italic>F</italic><sub>1</sub>-score, and net benefit) and proposed a composite metric, the clinical effectiveness metric (CEM), calculated as the geometric mean of the 5 mentioned metrics, as the primary metric.</p><p>The study began with a nontemporal baseline evaluation of different models in the 2017&#x2010;2019 temporal validation and then conducted a series of drift analyses, including an examination of overall trends from 2012 to 2019, within-period trends in the first 3 months of 2017 and 2019, and between-period trends between the first 3 months of 2017 and 2019. The authors also analyzed drift in variable importance and variable distribution, defined by the temporal change in the ratio of several top-importance features within the data set, to profile data set drift.</p><p>The authors demonstrated that XGBoost and random forest were the best-performing models, both in nontemporal and temporal evaluations, whereas the EuroSCORE II model exhibited a significant drop in performance. Temporal declines in model performance were observed across all models and were consistent with data set drift.</p><p>Overall, the question of the generalizability of prediction models, whether temporal or spatial, has long been a topic of discussion in clinical research. This study takes a commendable approach to addressing this question. However, there are some issues that require clarification and revision, including (1) methodological concerns related to the justification of the main metric (CEM) using averaging, and the appropriateness of some statistical tests; (2) the clinical significance of the identified performance drift; and (3) the overall clarity of the study&#x2019;s design and presentation.</p></sec><sec id="s1-2"><title>Specific Comments</title><sec id="s1-2-1"><title>Major Comments</title><p>1. The statement of the study&#x2019;s objectives should be improved for more clarity, particularly regarding the phrase &#x201C;verify suspected dataset drift by assessing the relationship between and within performance drift, variable importance drift, and dataset drift across ML and ES II approaches.&#x201D; It is unclear what is meant by the &#x201C;relationship between and within.&#x201D; Does this refer to the analysis of performance drift within and between different periods? The overall study design is quite challenging to grasp initially, even with the graphical overview provided in Figure 1. To enhance clarity, additional details and explanations should be added to the aims, overall design, graphical overview, and text the <italic>Methods</italic> and <italic>Results</italic> sections.</p><p>2. The rationale for introducing CEM as the primary performance metric, calculated as the geometric mean of 5 distinct individual metrics, is debatable and lacks strong justification. Although the geometric mean is less sensitive to outliers compared to the arithmetic mean, it raises the fundamental question of why these metrics need to be summarized. Is it merely to obtain a single quantitative measure for analysis, or does it aim to provide a more comprehensive understanding of overall model performance? It appears to serve primarily the former purpose, which may not be an appropriate practice given that the 5 metrics assess entirely different aspects of model performance: 1 &#x2013; ECE for calibration, AUC for discrimination, 1 &#x2013; Brier score (which already encompasses calibration and discrimination components), <italic>F</italic><sub>1</sub>-score for threshold-specific discrimination, and net benefit index for cost-effectiveness. Consequently, interpreting the exact meaning of CEM becomes challenging, as it reduces these diverse aspects to a single numerical value. Therefore, I suggest just reporting and examining all 5 metrics individually, with or without highlighting certain ones as primary areas of interest.</p><p>3. The manuscript used several statistical tests, and some of them are relatively less commonly used. Please provide a more detailed description of the objectives and specific statistical situations for each test used. Additionally, for the baseline nontemporal performance comparison, a more conventional approach for comparing AUC would be the use of the DeLong method (you could choose the best model as the reference), and bootstrapping can be used to assess the statistical significance when comparing other metrics.</p><p>4. During the training and internal validation phase with 5-fold cross-validation, additional details are needed to understand how the final model for each model type was selected for subsequent temporal validation, including whether hyperparameter tuning was carried out and whether there was a final refitting process on the entire training data set following the cross-validation, etc.</p><p>5. The <italic>Introduction</italic> section should incorporate more background information on previous studies reporting or relating to performance variation in prediction models for cardiac surgery outcomes. In the <italic>Discussion</italic> section, it is also important to discuss how this work contributes to existing evidence in the context of these previous studies. Some relevant studies, based on my preliminary search, include Benedetto et al [<xref ref-type="bibr" rid="ref2">2</xref>], Zeng et al [<xref ref-type="bibr" rid="ref3">3</xref>], Mori et al [<xref ref-type="bibr" rid="ref4">4</xref>], and potentially more.</p><p>6. Although the authors observed numerical declines in CEM and other metrics, the magnitude of these declines appears to be relatively small, particularly when considering metrics such as AUC. As a result, it is essential to discuss how to interpret this magnitude of drift in the context of clinical practice. In other words, what is the clinical significance of this variation in performance, and how does it justify the necessity of actively monitoring model drift in terms of cost-effectiveness? Please discuss.</p><p>7. The conclusion should only focus on the primary findings outlined in the aims of the <italic>Introduction</italic> section. Avoid incorporating less central findings and speculative elements. Additionally, it may not be fair to suggest replacing the EuroSCORE II model simply based on the inferior performance in this study, since it was already established and this study essentially conducted an external validation for it, whereas the other machine learning models were developed using these data sets.</p></sec><sec id="s1-2-2"><title>Minor Comments</title><p>1. More detailed definitions and explanations should be provided for each performance metric.</p><p>2. In the <italic>Methods</italic> section, please provide a clear outline of the inclusion and exclusion criteria. Additionally, consider including a flowchart that illustrates the data set development process, outlining how these criteria were applied.</p><p>3. I had difficulty understanding what &#x201C;outliers&#x201D; and &#x201C;distribution&#x201D; meant in the <italic>Results</italic> section for the baseline nontemporal performance of each model. I thought that each metric of each model should be just a numerical value and a 95% CI from bootstrapping.</p><p>4. The title of the manuscript should be an objective reflection of the overall study design and aim, rather than drawing conclusions from the findings.</p><p>5. I did not find the supplementary materials in the review system. I am not sure whether this issue is on my end or not.</p></sec></sec></sec><sec id="s3"><title>Round 2 Review</title><sec id="s2-1"><title>General Comments</title><p>I appreciate the opportunity to rereview this manuscript. The authors&#x2019; efforts in revising their manuscript in response to previous concerns are commendable. This manuscript has been improved and is now in principle publishable. It could potentially be accepted upon reasonable response to a few follow-up minor comments, outlined below.</p></sec><sec id="s2-2"><title>Specific Comments</title><p>1. About my previous major comment 1, the authors meticulously elaborated on (1) the reasons for performance drift and (2) its importance, which are both valid points. However, the current <italic>Introduction</italic> (lines 121&#x2010;179) is quite lengthy. I recommend consolidating these 2 parts into a single paragraph, listing each point without the need for detailed individual explanations. Additionally, my query about the exact meaning of &#x201C;the relationship between and within variable importance drift, performance drift, and actual dataset drift&#x201D; remains unaddressed. Even though it was removed from the <italic>Introduction</italic>, it still appears in the abstract. I suggest the authors explicitly explain it to readers and incorporate it into the manuscript when first mentioned.</p><p>2. Regarding the justification for the CEM, the authors have added more explanation and supporting literature for its use. However, it would strengthen their case if they could provide examples from external studies or use cases where a similar practice (averaging different aspects of metrics for model performance evaluation) was used, beyond their own studies.</p><p>3. About the statistical tests for comparing AUC with the DeLong method, I believe that performing the DeLong test for AUC comparison is not overly computationally demanding, even on a relatively large data set. I recommend the authors explore commonly used R packages (eg, &#x201C;pROC&#x201D;) that facilitate AUC calculation and comparison with the DeLong method. The DeLong comparison typically requires paired variables of the label and 2 models&#x2019; predicted probabilities, and the 95% CI and <italic>P</italic> value are automatically calculated by bootstrapping these paired samples, which is relatively efficient.</p><p>4. Regarding model tuning and specification of the best models (PS: I still cannot find the supplements, only a revised clean manuscript; I am not sure if this was due to issues from my end), I am curious why different tuning practices were used for different models, especially grid search for XGBoost and SVM but manual tuning for random forest.</p><p>5. In response to the query about the clinical significance of the relatively small scale of performance drift, the authors referred to one of their previous studies briefly discussing this matter. However, it would be much clearer if the authors could more explicitly elaborate in this study and, if possible, provide additional analysis to support this argument.</p></sec></sec></body><back><fn-group><fn fn-type="conflict"><p>None declared.</p></fn></fn-group><glossary><title>Abbreviations</title><def-list><def-item><term id="abb1">AUC</term><def><p>area under the curve</p></def></def-item><def-item><term id="abb2">CEM</term><def><p>clinical effectiveness metric</p></def></def-item><def-item><term id="abb3">ECE</term><def><p>expected calibration error</p></def></def-item><def-item><term id="abb4">EuroSCORE</term><def><p>European System for Cardiac Operative Risk Evaluation</p></def></def-item><def-item><term id="abb5">SVM</term><def><p>support vector machine</p></def></def-item><def-item><term id="abb6">XGBoost</term><def><p>extreme gradient boosting</p></def></def-item></def-list></glossary><ref-list><title>References</title><ref id="ref1"><label>1</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Dong</surname><given-names>T</given-names> </name><name name-style="western"><surname>Sinha</surname><given-names>S</given-names> </name><name name-style="western"><surname>Zhai</surname><given-names>B</given-names> </name><etal/></person-group><article-title>Performance drift in machine learning models for cardiac surgery risk prediction: retrospective analysis</article-title><source>JMIRx Med</source><year>2024</year><volume>5</volume><fpage>e45973</fpage><pub-id pub-id-type="doi">10.2196/45973</pub-id></nlm-citation></ref><ref id="ref2"><label>2</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Benedetto</surname><given-names>U</given-names> </name><name name-style="western"><surname>Sinha</surname><given-names>S</given-names> </name><name name-style="western"><surname>Lyon</surname><given-names>M</given-names> </name><etal/></person-group><article-title>Can machine learning improve mortality prediction following cardiac surgery?</article-title><source>Eur J Cardiothorac Surg</source><year>2020</year><month>12</month><day>1</day><volume>58</volume><issue>6</issue><fpage>1130</fpage><lpage>1136</lpage><pub-id pub-id-type="doi">10.1093/ejcts/ezaa229</pub-id><pub-id pub-id-type="medline">32810233</pub-id></nlm-citation></ref><ref id="ref3"><label>3</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Zeng</surname><given-names>J</given-names> </name><name name-style="western"><surname>Zhang</surname><given-names>D</given-names> </name><name name-style="western"><surname>Lin</surname><given-names>S</given-names> </name><etal/></person-group><article-title>Comparative analysis of machine learning vs. traditional modeling approaches for predicting in-hospital mortality after cardiac surgery: temporal and spatial external validation based on a nationwide cardiac surgery registry</article-title><source>Eur Heart J Qual Care Clin Outcomes</source><year>2024</year><month>03</month><day>1</day><volume>10</volume><issue>2</issue><fpage>121</fpage><lpage>131</lpage><pub-id pub-id-type="doi">10.1093/ehjqcco/qcad028</pub-id><pub-id pub-id-type="medline">37218710</pub-id></nlm-citation></ref><ref id="ref4"><label>4</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Mori</surname><given-names>M</given-names> </name><name name-style="western"><surname>Durant</surname><given-names>TJS</given-names> </name><name name-style="western"><surname>Huang</surname><given-names>C</given-names> </name><etal/></person-group><article-title>Toward dynamic risk prediction of outcomes after coronary artery bypass graft: improving risk prediction with intraoperative events using gradient boosting</article-title><source>Circ Cardiovasc Qual Outcomes</source><year>2021</year><month>06</month><volume>14</volume><issue>6</issue><fpage>e007363</fpage><pub-id pub-id-type="doi">10.1161/CIRCOUTCOMES.120.007363</pub-id><pub-id pub-id-type="medline">34078100</pub-id></nlm-citation></ref></ref-list></back></article>