Kwarah et al. BMC Global and Public Health (2025) 3:64 https://doi.org/10.1186/s44263-025-00184-4 SYSTEMATIC REVIEW Open Access © The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. BMC Global and Public Health Evaluating predictive performance, validity, and applicability of machine learning models for predicting HIV treatment interruption: a systematic review Williams Kwarah1,2*, Frances Baaba da‑Costa Vroom1, Duah Dwomoh1 and Samuel Bosomprah1  Abstract  Background  HIV treatment interruption remains a significant barrier to achieving global HIV/AIDS control goals. Machine learning (ML) models offer potential for predicting treatment interruption by leveraging large clinical data. Understanding how these models were developed, validated, and applied remains essential for advancing research. Methods  We searched databases including the PubMed, BMC, Cochrane Library, Scopus, ScienceDirect, Lancet, and Google Scholar, for studies published in English from 1990 to September 2024. Search terms covered HIV, machine learning, treatment interruption, and loss to follow-up. Articles were screened and reviewed independently, and data were extracted using the CHecklist for critical Appraisal and data extraction for systematic Reviews of pre‑ diction Modelling Studies (CHARMS) tool. Risk of bias was assessed with Prediction model Risk Of Bias Assessment Tool (PROBAST). The Preferred Reporting Items for Systematic reviews and Meta-analysis (PRISMA) guidelines were followed throughout. Results  Out of 116,672 records, 9 studies met the inclusion criteria and reported 12 ML models. Random For‑ est, XGBoost, and AdaBoost were predominant models (91.7%). Internal validation was performed in all models, but only two models included external validation. Performance varied, with a mean area under the receiver operat‑ ing characteristic curve (AUC-ROC) of 0.668 (standard deviation (SD) = 0.066), indicating moderate discrimination. About 75% of models showed a high risk of bias due to inadequate handling of missing data, lack of calibration, and the absence of decision curve analysis (DCA). Conclusions  ML models show promise for predicting HIV treatment interruption, particularly in resource-limited set‑ tings. Future research should prioritize external validation, robust missing data handling, and decision curve analysis and include sociocultural predictors to improve model robustness. Systematic review registration  PROSPERO CRD42024578109. Keywords  HIV treatment interruption, Machine learning, Predictive modeling Background Human immunodeficiency virus (HIV) treatment inter- ruption poses a significant challenge to global efforts in the HIV/AIDS epidemic response. In 2022, an estimated 39 million people were living with HIV (PLHIV) glob- ally, with an estimated 1.3 million new infections and *Correspondence: Williams Kwarah Kwarah@gmail.com 1 Department of Biostatistics, School of Public Health, University of Ghana, Accra, Ghana 2 United States Agency for International Development (USAID), Ghana Mission, Accra, Ghana http://creativecommons.org/licenses/by/4.0/ http://crossmark.crossref.org/dialog/?doi=10.1186/s44263-025-00184-4&domain=pdf Page 2 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 630,000 deaths reported [1]. The burden of HIV infec- tion is disproportionately high in sub-Saharan Africa, Asia, and the Pacific, which together account for about 88% of all cases [2]. Despite the availability of antiretro- viral therapy (ART), which has dramatically reduced the progression of HIV to AIDS and decreased AIDS-related mortality, many individuals living with HIV struggle to maintain consistent adherence to their treatment regi- men [3, 4]. It has been estimated that only 46% to 85% of patients continue to stay on ART 2 years after initiation [5, 6]. This lack of adherence is particularly concerning given that when left untreated, HIV weakens the immune system and can lead to life-threatening complications [4]. People who stay in treatment are economically viable and productive to their families and the community [7]. Interrupting HIV treatment may result in viral rebound, deterioration of the immune system, heightened trans- mission risk, and the development of drug resistance, thereby compromising both individual health and com- munity prevention initiatives. The situation places signifi- cant pressure on healthcare systems and compromises public health initiatives [8–11]. Improving ART adherence is critical to achieving global HIV/AIDS control goals. While current strategies to address treatment interruption primarily focus on re- engaging patients after missed doses [12, 13], these reac- tive measures often fall short of preventing the associated health risks and potential for increased transmission. The ability to predict treatment interruptions before they occur could revolutionize HIV care by enabling health- care providers to implement targeted and proactive inter- ventions that keep patients on therapy, thus enhancing their chances of achieving and sustaining viral suppres- sion. Machine learning (ML) and artificial intelligence (AI) offer powerful tools for developing such predictive models due to their capacity to dynamically analyze large, complex datasets and uncover patterns that traditional methods might miss [14–18]. Despite the promise of these technologies, there remains a significant evidence gap in their application to HIV treatment adherence, par- ticularly in low-resource settings where the burden of the disease is greatest. Addressing this gap through system- atic evaluation of existing predictive models is crucial for advancing the use of ML and AI in HIV care. This can lead to more effective and personalized treatment strategies that can help meet the ambitious Joint United Nations Programme on HIV/AIDS (UNAIDS) 95-95-95 targets by 2030 [2]. This systematic review aimed to evaluate the effec- tiveness of machine learning-based predictive models in forecasting HIV treatment interruptions. Specifically, the review (1) identified the types of predictive mod- els previously developed, (2) assessed their accuracy and applicability in various settings, and (3) determined which models have been validated and how they per- formed in different populations. The impact of this review could provide insights that can guide the inte- gration of advanced predictive technologies into HIV care programs, potentially improving patient retention, optimizing treatment outcomes, and supporting global efforts to eliminate HIV as a public health threat by 2030. Methods Search strategy We searched multiple electronic databases, including Scopus, PubMed, The Lancet, BioMed Central (BMC) Public Health, ScienceDirect, Google Scholar, and Cochrane Library. Our search covered publications from January 1990 to September 2024. We searched using a combination of Medical Subject Headings (MeSH) and free-text terms. The key terms included “HIV,” “Human Immunodeficiency Virus,” “AIDS,” and “Acquired Immu- nodeficiency Syndrome” for HIV-related concepts; “Machine Learning,” “ML,” “Artificial Intelligence,” “AI,” “Neural Networks,” and “Predictive Modeling” for machine learning concepts; and “Treatment Interrup- tion,” “Loss to Follow-Up “Non-adherence,” “Default,” and “Treatment Discontinuation” for treatment adherence concepts. These terms were combined using Boolean operators (AND, OR) to ensure a broad and inclusive search. Details of the search strategy for each database is provided in Additional File 1: Search Strategy. Eligibility criteria We applied specific eligibility criteria to select studies for inclusion. Eligible studies focused on developing or validating prediction models for HIV treatment interrup- tion at the individual level using machine learning meth- ods. We only included studies published in English. We included studies that focused on HIV treatment interrup- tion defined as missing a scheduled clinic or pharmacy appointment by at least 28  days. We excluded studies that identified predictors without focusing on prediction models and studies lacking full-text availability. Reviews, commentaries, conference abstracts, letters, reports, and opinions were excluded. In addition to database searches, we manually reviewed the reference lists of the included studies to identify additional relevant articles. To capture recent and unpublished research, we searched preprint servers such as bioRxiv, medRxiv, and arXiv. The corre- sponding authors of the included articles were emailed to seek further information and clarity. The search strategy was carefully documented (Additional File 1), and arti- cles were managed using Zotero 6.0.37 reference man- agement software, a project of Digital Scholar [19]. The Preferred Reporting Items for Systematic reviews and Page 3 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 Meta-Analysis (PRISMA) statements [20] (Additional File 2) and the conduct of systematic reviews [21] guided the review. A protocol for this review was registered on PROSPERO CRD42024578109. Selection process Article selection was conducted in multiple stages to ensure that only studies meeting the predefined inclusion criteria were included. Initially, two independent review- ers (W. K. and G. J. P.I.) screened the titles and abstracts of all records retrieved from the database searches to identify potentially relevant studies. We resolved any dis- agreements between reviewers during the article selec- tion process through discussion, and a third reviewer (N.Z.) was available to adjudicate unresolved disputes. To enhance the rigor of the selection process, systematic review software Distiller SR 2.35 developed by DistillerSR Incorporated [22] was used to assist in the identification and removal of duplicate records before the screening began. Data extraction Two independent reviewers (W K. and G. J. P.I.) extracted data from the selected studies to ensure accuracy and consistency. Each reviewer independently extracted data, using the standardized CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) tool [23, 24]. CHARMS was developed for systematic reviews of prognostic or diagnostic prediction models without external valida- tion, with external validation, or external prediction model validation with or without model updating. The data collected included the data sources, study charac- teristics, details of the predictive models, outcomes, and performance metrics [24]. We resolved any disagree- ments between reviewers during the data extraction pro- cess through discussion, and a third reviewer (N.Z.) was available to adjudicate unresolved disputes. The review- ers manually extracted all data and then cross-verified it to maintain the integrity of the data collection process. A consolidated final completed CHARMS tool was com- piled for this review. Risk of bias and applicability assessment We used the Prediction Model Risk of Bias Assessment Tool (PROBAST) [25] to assess the risk of bias (ROB) and applicability in the included studies. The PROBAST was designed to evaluate the risk of bias and applicabil- ity in prediction model studies. The tool evaluated four key domains: participants, predictors, outcomes, and analysis. There were two questions on participants, three questions on predictors, six questions about outcomes, and nine questions linked to the statistical analysis. Responses to these questions were either “yes,” “probably yes,” “probably no,” “no,” or “no information.” The ROB was classified as either low, high, or unclear based on the responses within these domains. A domain was classified as high risk if it included at least one question that has been answered with either “no” or “probably no,” low risk if all the questions indicated as “yes” or “probably yes,” and unclear if there is no information in the responses. If all domains were assessed as having a low risk, then the overall risk of bias was classified as low. However, if at least one domain was determined to have a high risk, then the overall risk of bias was classified as high. If there was a recognized concern for bias in at least one area and the level of concern was low for all other domains, it was classified as having a moderate level of concern for bias. Two reviewers (W. K. and G. J. P.I.) independently evaluated the risk of bias in each included study. When the reviewers disagreed on the risk-of-bias judgment, the discrepancies were discussed to reach a consensus. If the disagreement persisted, a third reviewer (N.Z.) was consulted to decide. Similarly, model applicability was assessed in the first three domains — participants, predictors, and outcome for each model. Model appli- cability was rated low concern, high concern, or unclear concern based on a defined rubric [25]. If there were low concerns regarding applicability for all domains, the prediction model evaluation was judged to have low concerns regarding applicability. If there were high con- cerns regarding applicability for at least one domain, the prediction model evaluation was judged to have high concerns regarding applicability. If there were unclear concerns (but no “high concern”) regarding applicability for at least one domain, the prediction model evaluation was judged to have unclear concerns regarding applica- bility overall. We conducted all evaluations manually and documented the results of the risk-of-bias assessments and applicability in detail, with summary judgments pre- sented in the form of charts to facilitate a clear under- standing of the quality and reliability of the included studies. Synthesis and analysis We tabulated the results of individual studies to provide a clear and organized presentation of the key findings. This included details such as study characteristics, model performance metrics (e.g., area under the receiver-oper- ating characteristic curve, calibration statistics), and risk- of-bias assessments. We used visual displays, including charts to enhance the clarity of the results and to facili- tate the comparison of study outcomes. For the synthe- sis of results, we used a narrative synthesis approach due to the anticipated heterogeneity of the included studies, particularly in terms of model types, outcome measures, Page 4 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 and study populations. This approach allowed us to sys- tematically describe and compare the predictive mod- els, highlighting common themes and differences among the studies. We did not perform a meta-analysis because there were insufficient external validation studies of the same index model to justify a quantitative synthesis [21]. The synthesis followed guidelines from the Transparent Reporting of a Multivariable prediction model for Indi- vidual Prognosis Or Diagnosis (TRIPOD) statement [26], CHARMS checklist [24], and PROBAST [25]. Results Characteristics of included studies Our search identified 116,672 studies, of which 9 met the inclusion criteria (Fig. 1). Seven of these studies focused on developing predictive models [27–33], while two included both model development and validation [34, 35]. Six studies were conducted in Africa [27–30, 33, 34], of which three were in South Africa, one in Tanzania, and one combining data from Nigeria and Mozambique. The remaining three studies were in the United States of Fig. 1  PRISMA flow of article selection Page 5 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 America (USA) [31, 32, 35] (Table 1). These studies were published between 2018 and 2024, with the majority pub- lished in 2023 and 2022. Seven studies were conducted in public healthcare facilities, while two were conducted in university clinics. Seven studies relied on retrospective cohort data, while two used existing registries (Table 1). Heterogeneity was not explored as only three models were externally validated. Model performance metrics Model performance is often measured using different metrics such as overall performance measures, discrimi- nation, calibration, and (re)classification. Discrimination assesses the model’s capacity to differentiate between individuals who have and do not have the outcome. The c-statistic, which is equivalent to the area under the curve of a receiver operating characteristic curve (AUC- ROC), is frequently used to assess discrimination. Other classification measures such as sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and F1 score are often used to assess model dis- crimination. Calibration measures how well the pre- dicted risks and observed outcomes match and is often assessed using graphical comparison of the observed and predicted event rates. Formal statistical tests such as the Hosmer–Lemeshow test for logistic regression are com- monly used in conjunction with calibration plots. Among the 9 studies selected, a total of 12 machine learning models were reported, with 9 focused on model development and 3 on model validation (Table  2). The median sample size across studies was 136,415 (interquartile range: 178–450,000), though 1 model was developed using a sample size of less than 1000 participants. On average, 15 predictors (standard deviation (SD) = 4.0) were included in the final models. Ensemble learning techniques were the most frequently used algorithms, accounting for 92% of the total mod- els. These included random forest (three models), Adaptive Boosting (AdaBoost, three models), Extreme Gradient Boosting (XGBoost, two models), Decision Trees (two models), and Categorical Boosting (Cat- Boost, one model) (Table  2). Logistic regression was used in only one model. Model performance was primarily assessed using the c-statistic or area under the receiver operating char- acteristic curve (AUC), with an average AUC of 0.668 (SD: 0.07). Some models also reported additional met- rics, including accuracy, sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV) (Table  2). Notably, two models reported only PPV, while another two reported the Mathews cor- relation coefficient. Model calibration methods were used in just three models, which reported an average F1 score of 0.292 (SD: 0.01) alongside the AUC. None of the studies used decision curve analysis (DCA) to assess clinical value and implications, a significant limitation in evaluating the practical utility of the mod- els. DCA is essential for assessing a model’s clinical relevance by weighing the benefits and risks at differ- ent decision thresholds, rendering its exclusion a sig- nificant constraint [36]. DCA are essential metrics that enhance calibration and discrimination measures in machine learning models [37] and help in incorporat- ing the clinical consequences of using a model. Besides conducting DCA, net benefit analysis is an alternative measure to assess the applicability of models in real- life situations. However, one study addressed model utility by gathering feedback from healthcare workers. Additional information is provided in Additional File 3: Model Characteristics Tab. Risk‑of‑bias assessment We reported the risk-of-bias assessment for the 12 models using the PROBAST tool (Fig. 2). Of these, nine models (75.0%) were rated as having a high risk of bias, two models (16.7%) were rated low risk, and one model (8.3%) had an unclear risk of bias. A notable majority (58.3%) expressed high risk in the statistical analysis domain. For example, nearly half of the models failed to report how missing data was handled, and 10 mod- els (83.3%) did not disclose the extent of missing data. Furthermore, only three models (25.0%) provided details on calibration measures, which are important for ensuring the reliability of predictions. None of the studies reported DCA or other methods to assess clinical utility, highlighting a critical gap in evaluating the practical application of these models. Additional details on the risk-of-bias analysis are provided in the supplementary material provided in Additional File 3: PROBAST summary tables. Applicability assessment We evaluated the applicability of the models for use in the intended population and primary healthcare settings. Overall, 83% of the models were rated as low concern, indicating their suitability for primary healthcare use. However, 17% were rated as high concern, reflecting limi- tations in certain aspects of model development (Fig. 3). Predictors were rated as low concern, suggesting that the included predictors were relevant to the target popula- tion and routinely collected in clinical settings. Similarly, the outcome domain was rated as low concern in 92% of the models, while 8% were marked as unclear due to insufficient reporting of key details. Page 6 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 Ta bl e  1  C ha ra ct er is tic s of th e in cl ud ed s tu di es Au th or , Y ea r St ud y D es ig n M L Te ch ni qu e En ro lm en t pe ri od St ud y se tt in g St ud y re gi on A ge o f pa rt ic ip an ts Fe m al e M al e Tr ea tm en t In te rr up tio n 1 - M at th ew -D av id O gb ec hi e, 2 02 3 [3 0] Re tr os pe ct iv e co ho rt XG Bo os t Ja nu ar y 20 05 - Fe br ua ry 2 02 1 H ea lth fa ci lit y N ig er ia 91 98 2 (6 7. 3) 44 76 5 (3 2. 7) 56 58 1 (4 1. 5) 2 - E sr a, R ac he l, 2 02 3 [3 4] Ex is tin g re gi st ry A da Bo os t Ja nu ar y 1, 2 01 7 - M ar ch 3 1, 2 02 0 Pu bl ic h ea lth fa ci lit ie s So ut h A fri ca 33 (2 7- 41 ) 17 21 70 (6 5) 92 70 7 (3 5) 26 04 67 (1 1. 9) Ca tB oo st Ja nu ar y 1, 2 01 7 - O ct ob er 1 , 2 01 8 Pu bl ic h ea lth fa ci lit ie s So ut h A fri ca 33 (2 7- 41 ) 17 21 70 (6 5) 92 70 7 (3 5) 3 - S to ck m an , J en i , 20 22 [3 3] Re tr os pe ct iv e co ho rt Ra nd om F or es t Ja nu ar y 1, 2 01 0 -  N ov em be r 2 8, 20 19 Pu bl ic S ec to r A RT c lin ic s M oz am bi qu e 47 .3 (1 3. 6) XG Bo os t Pu bl ic S ec to r A RT c lin ic s N ig er ia 47 .3 (1 3. 6) 4 - A rt hi , R am ac ha nd ra n, 20 20 [3 2] Re tr os pe ct iv e co ho rt Ra nd om F or es t Ja nu ar y 1, 2 00 8 to  M ay 3 1, 2 01 5 U ni ve rs ity o f C hi ca go H IV c ar e cl in ic U SA 47 .3 (1 3. 6) 31 4 (4 4% ) 39 9 (5 6% ) D ec is io n Tr ee s U ni ve rs ity o f C hi ca go H IV c ar e cl in ic U SA 47 .3 (1 3. 6) 31 4 (4 4% ) 39 9 (5 6% ) 5 - B ria n W . P en ce , 2 01 8 [3 1] Re tr os pe ct iv e co ho rt Lo gi st ic R eg re s‑ si on 20 02 - 20 15 U S- ba se d H IV p rim ar y ca re cl in ic s U SA 46 (3 9 - 5 2) 16 60 (1 6) 87 14 (8 4) 17 95 7 (1 7) 6 - M ha iri , M as ke w , 2 02 2 [2 8] Re tr os pe ct iv e co ho rt A da Bo os t Ja nu ar y 20 16 - D ec em be r 2 01 8 Pu bl ic S ec to r H IV c ar e fa ci lit ie s So ut h A fri ca 39 (3 1 - 4 9) 31 1, 94 5 (7 0% ) 13 3, 69 0 (3 0% ) 7 - M ha iri , M as ke w , 2 02 4 [2 9] Re tr os pe ct iv e co ho rt A da Bo os t Ja nu ar y 20 16 - D ec em be r 2 01 8 Pu bl ic S ec to r H IV c ar e fa ci lit ie s So ut h A fri ca 39 (2 7 - 4 9) 31 51 24 (6 8) 14 82 94 (3 2) 8 - J os ep h A  M as on , 2 02 3 [3 5] Ex is tin g re gi st ry Ra nd om F or es t Ja n 21 - M ar ch 3 0, 20 22 H os pi ta l i n a  un iv er si ty U SA 9 - C ar ol yn A  F ah ey , 2 02 2 [2 7] Re tr os pe ct iv e co ho rt D ec is io n Tr ee s 20 18 H IV c ar e ce nt er Ta nz an ia 36 (1 0) 11 3 (6 3. 5) 65 (3 6. 5) Page 7 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 Ta bl e  2  Su m m ar y of m od el p er fo rm an ce m et ric s us in g th e C H A RM S ch ec kl is t A ut ho r, Ye ar M od el lin g m et ho d Sa m pl e si ze Ev en ts n (% ) N o pr ed ic to rs Ca nd id at es Fi na l pr ed ic to rs EP V or E PP Se le ct io n of ca nd id at e pr ed ic to rs   Se le ct io n of fi na l pr ed ic to rs N um be r ( % ) an d ha nd lin g of m is sn g da ta Ty pe o f v al id at io n Pe rf or m an ce m ea su re s 1 - M at th ew - D av id O gb ec hi e, 20 23 [3 0] XG Bo os t 13 6, 74 7 56 58 1 (4 1. 4) 13 13 43 52 .4 Ba se d on  p rio r kn ow le dg e Pr e- sp ec ifi ed m od el (n ot se le ct io n) n (% ): U nk ow n M et ho d: K nn Im pu ta tio n In t: C ro ss -v al id at io n an d  ra nd om s pl it da ta Ex t : N on e Ca lib ra tio n m ea su re s: N ot e va lu at ed D is cr im in at io n m ea s‑ ur es : A cc ur ac y 0. 85 (0 .8 5 - 0 .8 6) , S en si tiv ‑ ity - 0. 81 ; S pe ci fic ity - 0 .8 8; P PV - 0. 83 ; N PV - 0 .8 7; K ap pa 0 .7 0 O ve ra ll m ea su re s: N ot ev al ua te d 2 - E sr a, R ac he l, 20 23 [3 4] A da Bo os t 26 4, 87 7 35 98 5 (1 3. 6) 13 13 27 68 .1 Ba se d on  p rio r kn ow le dg e Re cu rs iv e fe at ur e el im i‑ na tio n n (% ): 15 09 (0 .6 ) M et ho d: S in gl e im pu ta tio n In t: Ra nd om s pl it da ta Ex t : D iff er en t s et tin g Ca lib ra tio n m ea su re s: F1 S co re (0 .2 88 , 0 .2 86 - 0 .2 90 ) D is cr im in at io n m ea s‑ ur es : C -S ta tis tic / A U C gr ap h / Se ns iti vi ty (0 .6 08 , 0 .6 04 - 0. 61 1) ,  sp ec ifi ci ty (0 .6 47 , 0. 64 6 - 0 .6 48 ), pp v (0 .1 89 , 0 .1 87 - 0. 19 0) , np v (0 .9 24 , 0 .9 24 - 0. 92 5) O ve ra ll m ea su re s: N ot ev al ua te d 3 - E sr a, R ac he l, 20 23 [3 4] Ca tB oo st 13 6, 08 2 35 98 5 (2 6. 4) 13 13 27 68 .1 Ba se d on  p rio r kn ow le dg e Re cu rs iv e fe at ur e el im i‑ na tio n n (% ): 15 09 (1 .1 ) M et ho d: S in gl e im pu ta tio n In t: Ra nd om s pl it da ta Ex t : D iff er en t s et tin g Ca lib ra tio n m ea su re s: F1 C or e (0 .2 99 , 0 .2 97 - 0 .3 01 ) D is cr im in at io n m ea s‑ ur es : C -S ta tis tic / A U C gr ap h / Se ns iti vi ty (0 .6 46 , 0 .6 42 - 0. 64 9) ,  sp ec ifi ci ty (0 .6 46 , 0. 64 5 - 0 .6 48 ), pp v (0 .1 95 , 0 .1 93 - 0. 19 6) , np v (0 .9 32 , 0 .9 31 - 0. 93 3) O ve ra ll m ea su re s: N ot ev al ua te d 4 - S to ck m an , Je ni , 20 22 [3 3] Ra nd om Fo re st 36 0, 00 0 70 12 U nk no w n O th er N o in fo rm a‑ tio n n (% ): U nk ow n M et ho d: M is si ng va lu es e xc lu de d in  a na ly si s In t: C ro ss -v al id at io n Ex t : N o in fo rm at io n Ca lib ra tio n m ea su re s: N ot e va lu at ed D is cr im in at io n m ea su re s : C -S ta tis tic / AU C -P R, M CC (0 .4 5) O ve ra ll m ea su re s: N ot ev al ua te d Page 8 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 Ta bl e  2  (c on tin ue d) A ut ho r, Ye ar M od el lin g m et ho d Sa m pl e si ze Ev en ts n (% ) N o pr ed ic to rs Ca nd id at es Fi na l pr ed ic to rs EP V or E PP Se le ct io n of ca nd id at e pr ed ic to rs   Se le ct io n of fi na l pr ed ic to rs N um be r ( % ) an d ha nd lin g of m is sn g da ta Ty pe o f v al id at io n Pe rf or m an ce m ea su re s 5 - S to ck m an , Je ni , 20 22 [3 3] XG Bo os t 45 0, 00 0 70 12 U nk no w n O th er N o in fo rm a‑ tio n n (% ): U nk ow n M et ho d: M is si ng va lu es e xc lu de d in  a na ly si s In t: Te m po ra l c ro ss - va lid at io n Ex t : N o in fo rm at io n Ca lib ra tio n m ea su re s: N ot e va lu at ed D is cr im in at io n m ea su re s : C -S ta tis tic / AU C -P R, M CC (0 .3 7) O ve ra ll m ea su re s: N ot ev al ua te d 6 - A rt hi , Ra m ac ha nd ra n, 20 20 [3 2] Ra nd om Fo re st 11 ,4 45 13 73 (1 2. 0) 10 00 20 1. 4 Ba se d on  p rio r kn ow le dg e O th er n (% ): U nk ow n M et ho d: S in gl e im pu ta tio n In t: Te m po ra l c ro ss - va lid at io n Ex t : N o in fo rm at io n Ca lib ra tio n m ea su re s: N ot e va lu at ed D is cr im in at io n m ea su re s : P PV (2 4. 5, SD = 0 .0 1) O ve ra ll m ea su re s: N ot ev al ua te d 7 - A rt hi , Ra m ac ha nd ra n, 20 20 [3 2] D ec is io n Tr ee s 11 ,4 45 13 73 (1 2. 0) 80 0 20 1. 7 Ba se d on  p rio r kn ow le dg e O th er n (% ): U nk ow n M et ho d: S in gl e im pu ta tio n In t: Ra nd om s pl it da ta Ex t : N o in fo rm at io n Ca lib ra tio n m ea su re s: N ot e va lu at ed D is cr im in at io n m ea s‑ ur es : PP V (1 5. 5, 0 .0 4) O ve ra ll m ea su re s: N ot ev al ua te d 8 - B ria n W . Pe nc e, 2 01 8 [3 1] Lo gi st ic re gr es si on 10 5, 62 8 17 95 7 (1 7. 0) 14 14 12 82 .6 Ba se d on  p rio r kn ow le dg e Pr e- sp ec ifi ed m od el (n ot se le ct io n) n (% ): U nk ow n M et ho d: N o in fo rm at io n In t: C ro ss -v al id at io n Ex t : N o in fo rm at io n Ca lib ra tio n m ea su re s: N ot e va lu at ed D isc rim in at io n m ea s‑ ur es : C -S ta tis tic / AU C gr ap h / S en sit iv ity (0 .7 4,   0. 70 - 0. 78 ), Sp ec ifi ci ty (0 .5 4, 0 .4 4 - 0 .6 4) O ve ra ll m ea su re s: N ot ev al ua te d 9 - M ha iri , M as ke w , 2 02 2 [2 8] A da Bo os t 1, 39 9, 14 5 14 68 81 (1 0. 5) 75 20 19 58 .4 O th er Fe at ur e se le c‑ tio n us in g ra nd om fo re st n (% ): U nk ow n M et ho d: O th er In t: Ra nd om s pl it da ta Ex t : N o in fo rm at io n Ca lib ra tio n m ea su re s: F1 S co re (0 .2 9) D is cr im in at io n m ea s‑ ur es : C -S ta tis tic / A U C gr ap h / A cc ur ac y (0 .7 86 ), se ns iti vi ty (0 .4 06 ), sp ec ifi ci ty (0 .8 3) , n pv (0 .9 2) , p pv (0 .2 2) D is c : C -S ta tis tic / AU C g ra ph / A cc u‑ ra cy (0 .7 86 ), se ns iti v‑ ity (0 .4 06 ), sp ec ifi ci ty (0 .8 3) , n pv (0 .9 2) , p pv (0 .2 2) O ve ra ll m ea su re s: N ot ev al ua te d Page 9 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 Ta bl e  2  (c on tin ue d) A ut ho r, Ye ar M od el lin g m et ho d Sa m pl e si ze Ev en ts n (% ) N o pr ed ic to rs Ca nd id at es Fi na l pr ed ic to rs EP V or E PP Se le ct io n of ca nd id at e pr ed ic to rs   Se le ct io n of fi na l pr ed ic to rs N um be r ( % ) an d ha nd lin g of m is sn g da ta Ty pe o f v al id at io n Pe rf or m an ce m ea su re s 10 - M ha iri , M as ke w , 2 02 4 [2 9] A da Bo os t 3, 26 4, 67 1 14 68 81 (4 .5 ) 11 10 13 35 2. 8 N o in fo rm at io n N o in fo rm a‑ tio n n (% ): U nk ow n M et ho d: N o in fo rm at io n In t: Ra nd om s pl it da ta Ex t : N o in fo rm at io n Ca lib ra tio n m ea su re s: N ot e va lu at ed D is cr im in at io n m ea su re s : C -S ta tis tic / A cc ur ac y = 0 .6 3, Sp ec ifi ci ty = 0 .5 2, sp ec ifi ci ty = 0. 64 , p pv = 0 .1 9, N PV = 0 .8 9 O ve ra ll m ea su re s: N ot ev al ua te d 11 - Jo se ph A  M as on , 2 02 3 [3 5] Ra nd om Fo re st 33 1 0 (0 .0 ) 11 11 0 Ba se d on  p rio r kn ow le dg e N o in fo rm a‑ tio n n (% ): U nk ow n M et ho d: N o in fo rm at io n In t: Ra nd om s pl it da ta Ex t : D iff er en t d at as et an d  pr ov id er fe ed ba ck Ca lib ra tio n m ea su re s: N ot e va lu at ed D is cr im in at io n m ea s‑ ur es : C -S ta tis tic / A U C gr ap h O ve ra ll m ea su re s: N ot ev al ua te d 12 - Ca ro ly n A  F ah ey , 2 02 2 [2 7] D ec is io n Tr ee s 17 8 72 (4 0. 4) 22 22 3. 3 Ba se d on  p rio r kn ow le dg e N o in fo rm a‑ tio n n (% ): 0 (0 .0 ) M et ho d: O th er In t: C ro ss -v al id at io n an d  ra nd om s pl it da ta Ex t : U nc le ar Ca lib ra tio n m ea su re s: N ot e va lu at ed D is cr im in at io n m ea su re s : C -S ta tis tic / A cc ur ac y (0 .7 23 ) O ve ra ll m ea su re s: N ot ev al ua te d EP V Ev en ts p er v ar ia bl e, E PP E ve nt s pe r p re di ct or , P PV P os iti ve P re di ct iv e Va lu e, N PV N eg at iv e Pr ed ic tiv e Va lu e, A U C- PR A re a U nd er th e Pr ec is io n Re ca ll Cu rv e, M CC M at th ew s Co rr el at io n Co effi ci en t Page 10 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 Model validation All 12 models reported internal validation. These included random sample split (6), cross-validation (4), and a combination of random sample split and cross-val- idation (2) (Table 2). Three models were externally vali- dated, but only two reported discrimination measures, with an average F1 score of 0.2935, alongside c-statistic (AUC) values. These validations were done using data- sets received from registries of people living with HIV and scheduled for clinical appointments. While sensitiv- ity, specificity, PPV, and NPV were included, one model lacked critical details on eligibility criteria and missing data handling. None of the externally validated models assessed clinical utility. Further details are provided in the supplementary material (Additional File 3: Model characteristics tables). Discussions This review examined 12 machine learning models devel- oped to predict interruptions in HIV treatment, with most relying on advanced ensemble techniques like ran- dom forest, AdaBoost, and XGBoost. These models were built using data from large retrospective cohorts, with a median sample size of 120,000 participants, and were validated internally through methods like cross-vali- dation and random sample splitting. The models dem- onstrated acceptable predictive performance, with an average AUC-ROC of 0.668, and utilized data commonly collected in clinical settings, making them practical for real-world use. For prognostic predictive models, AUC of 0.5–0.7 suggests poor discrimination, and 0.7–0.8 is considered acceptable, 0.8–0.9 excellent, and > 0.9 as out- standing [38, 39]. Although only two models were exter- nally validated, most models showed strong potential for application in primary healthcare, highlighting their promise in improving adherence and supporting HIV care strategies. Electronic medical records (EMRs) are increasingly prevalent worldwide, including in Africa [40], facilitating the ongoing accumulation of extensive healthcare data and enabling big data analytics [41–46], as well as the application of machine learning and artificial intelligence [44, 47, 48]. Numerous prognostic studies have employed EMR data to create models for predicting individual Fig. 2  Summary of risk-of-bias assessment Fig. 3  Summary of applicability assessment Page 11 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 diagnoses of HIV, healthcare attendance, and viral load suppression [49–51]. The growing utilization of these analytic tools is likely due to the interest in employing predictive models as decision support instruments at the point of care. Moreover, executing focused, high-impact treatments with limited resources in underprivileged healthcare environments is essential [52, 53]. Two-thirds of the research was conducted in Africa, predominantly in South Africa, an area characterized by a high incidence of HIV [54]. This emphasis is praise- worthy, yet it constrains the comprehension of predictive model application in areas with low prevalence. Utilizing data from high-prevalence regions, such as South Africa, offers essential insights into models that help tackle adherence difficulties in analogous circumstances. This emphasis requires careful consideration when extrapo- lating results to areas with varying healthcare systems and compliance challenges. The research conducted in the USA [31, 32, 35], however limited in number, offered a divergent viewpoint, highlighting the necessity for regionally appropriate models. The machine learning techniques in our analysis have shown significant potential in forecasting treatment interruption by utilizing routinely gathered clinical data. Ensemble learning methodologies, specifically random forest, AdaBoost, and XGBoost, were significant, collec- tively representing 91.7% of the models created. Previ- ous studies have demonstrated that ensemble approaches effectively address the complex, nonlinear interactions prevalent in healthcare datasets [55, 56]. These algo- rithms have achieved above 90% accuracy across many datasets [57, 58]. Ensemble algorithms are beneficial because of their resilience to overfitting and their capac- ity to handle extensive feature sets. The outcomes of our review correspond with these results. Upon analysis, most models in our study provided the c-statistic (AUC), which evaluates the discriminatory capability of predic- tive models. The average AUC of 0.668 in our analysis aligns with the findings of Chilamkurthy et al. (2018) who stated that whereas ML models excel at distinguishing different outcomes, their clinical performance criteria, such as accuracy, sensitivity, and specificity, frequently lack efficacy due to unbalanced datasets or inadequate predictor selection often found in healthcare datasets. Other studies have emphasized the need for ML algo- rithms to employ the AUC as a more effective and supe- rior metric in conjunction with calibration and decision curve analysis for assessing model performance in com- parison to accuracy [59]. We discovered in our review that several studies failed to include calibration and clinical efficacy in their reports. Although there are many possible problems in the creation and validation of prediction models, it is essential to disclose calibration measurements, which are vital components of statistical performance [60, 61]. Calibration measures are essential since they guaran- tee that model prediction probabilities correspond with real probabilities, hence ensuring model dependability. Merely 25% of the research included in our evaluation assessed model calibration. In the absence of calibration, predictive models may provide probabilities that inaccu- rately reflect actual hazards, hence compromising their therapeutic relevance [62]. We noted significant prob- lems with the ROB in the developed prediction models. Seventy-five percent of the reviewed models were classi- fied as exhibiting a high risk of bias, mostly due to inad- equacies in the statistical analysis and data management. Approximately 83.3% of models did not disclose the mag- nitude of missing data or the methodologies employed to mitigate it, underscoring its significance as a key con- cern. This conclusion aligns with prior research demon- strating that most predictive model studies do not report their methods for addressing missing data [63]. Missing data is a widespread problem in retrospective healthcare datasets and, if not properly managed, can compromise model performance and integrity [63–65]. Several stud- ies have utilized imputation approaches, precisely pre- dicting missing values to mirror reality, which increases the probability of acquiring high-quality and reusable data [66]. However, if this is not handled appropriately, it can lead to systemic biases and diminish the validity and integrity of models, particularly in datasets utilized in healthcare research [67, 68]. Furthermore, our review observed the lack of decision curve analysis (DCA) in all the studies included. Besides conducting DCA, net ben- efit analysis is an alternative measure to assess the appli- cability of models in real-life situations. The reviewed models show potential for improving HIV treatment interruption predictions; nevertheless, their reliability and applicability in clinical environments are constrained, as shown in the risk-of-bias and appli- cability results. Overall, an 83% applicability score was achieved for the reviewed models, suggesting their broad appropriateness for the target groups and settings. This result indicates the incorporation of frequently gathered predictors in clinical contexts, including demographic information, adherence records, and clinical indicators, which improve the practicality of applying these models in actual healthcare settings [69]. Ninety-two percent of models assessed the outcome domains as minimal con- cern; nevertheless, the absence of external validation and decision curve analysis presents serious constraints in the practical use in guiding clinical decisions [62]. For opti- mal real-world applicability, models must address these deficiencies by integrating external validation across diverse contexts and evaluating clinical significance using Page 12 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 methodologies such as DCA, net benefit analysis, or net reclassification improvement assessments. Aligning with clinical processes is crucial for maximizing the efficacy of machine learning in enhancing adherence and mini- mizing inappropriate treatment exclusion in HIV care. Enhancing future research through stringent report- ing standards and robust statistical methodologies, such as those outlined in the TRIPOD recommendations, is essential to mitigate biases and improve the reliability of predictive modeling in HIV care [70]. The results of this review should be interpreted with certain limitations in mind. First, the review included only journal articles published in English with free-text availability, and the search was conducted across a lim- ited number of databases, which may introduce language and publication bias. Excluding studies conducted in other languages besides English presented a potential selection bias. This potentially limits the generalizability of the findings to English-speaking settings. To address potential selection and publication bias stemming from the restricted database search, we supplemented our efforts by conducting backward and forward citation searches in Google Scholar and reviewing article refer- ences. Most of these studies were conducted in resource- poor settings, which made it difficult for validation studies to be carried out. It is recommended that in such circumstances, validation studies should be conducted on different datasets or settings. Future studies should prioritize implementing robust external validation across diverse populations and geo- graphic regions, which is essential to evaluate model performance under varying demographic, clinical, and systemic conditions, ensuring reliability in real-world applications. The inclusion of sociocultural and struc- tural factors in model development should be considered in future research. Also, addressing missing data is criti- cal for enhancing model accuracy and reliability. Future studies should adopt systematic strategies such as mul- tiple imputations or sensitivity analyses and adhere to standardized reporting guidelines like TRIPOD. Finally, incorporating decision curve analysis (DCA) into model assessment is recommended to bridge the gap between statistical performance and practical, real-world impact. Conclusions This study provides key insights into the current state of predictive modeling for HIV treatment interrup- tions. Machine learning, particularly ensemble learning techniques, is popularly used with retrospective cohort data to address adherence issues in HIV programs, demonstrating moderate accuracy and applicability in primary healthcare settings. However, critical shortcom- ings, including insufficient calibration reporting, lack of decision curve analysis (DCA), and limited external vali- dation, restrict the models’ clinical utility and generaliz- ability. Predictive modeling holds significant promise in supporting countries to achieve the UNAIDS 95-95-95 targets by advancing equitable access to medications, high treatment retention rates, and achieving widespread viral load suppression. Abbreviations HIV � Human immune virus AIDS � Acquired immunodeficiency syndrome PLHIV � People living with HIV ART​ � Antiretroviral therapy ML � Machine learning AI � Artificial intelligence UNAIDS � Joint United Nations Programme on HIV/AIDS PRISMA � Preferred Reporting Items for Systematic reviews and Meta-Analyses PROSPERO � International Prospective Register of Systematic Reviews BMC � BioMed Central MeSH � Medical Subject Headings CHARMS � CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies PROBAST � Prediction model Risk Of Bias Assessment Tool ROB � Risk of bias TRIPOD � Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis SD � Standard deviation XGBoost � Extreme Gradient Boosting AdaBoost � Adaptive Boosting CatBoost � Categorical Boosting AUC-ROC � Area under the receiver operating characteristic curve AUC-PR � Area under the precision-recall curve NPV � Negative predictive value PPV � Positive predictive value EPV � Events per variable EPP � Events per predictor MCC � Mathews correlation coefficient DCA � Decision curve analysis EMR � Electronic medical records Supplementary Information The online version contains supplementary material available at https://​doi.​ org/​10.​1186/​s44263-​025-​00184-4. Additional File 1: Search Strategy (Revised). Additional File 2: PRISMA 2020 Checklist. Additional File 3: CHARMS checklist, PROBAST checklist. Study charac‑ teristics: Table 1. Characteristics of the studies included in the systematic review. Model characteristics: Table 2: Characteristics of the models included in the systematic review and critical for risk of bias and applica‑ bility. PROBAST summary: Table 3: Risk of Bias and applicability assess‑ ment. Drop-down lists for CHARMS. Acknowledgements We would like to express our sincere gratitude to Gabriel Jamal Peazang Ibra‑ him and Nabilatu Zakari for their assistance in data extraction. Also would like to express sincere gratitude to Dr. Ekua E. Houphouet and Dr. Jasmin Kwarah for generously reviewing the manuscript and providing the stationery that was crucial for the successful completion of this systematic review. Authors’ contributions WK conceived the research topic, led data review and extraction, analyzed and interpreted the extracted data, and wrote the first draft of the manuscript. FBV, DD, and SB contributed to the methods, analysis, and reporting and reviewed the manuscript. All authors read and approved the final manuscript. https://doi.org/10.1186/s44263-025-00184-4 https://doi.org/10.1186/s44263-025-00184-4 Page 13 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 Funding Not applicable. Data availability All data generated or analyzed during this study are part of the supplementary information in the Additional File 3: SUMMARY, CHARMS, and PROBAST tabs. Declarations Ethics approval and consent to participate Given that this study is nested within another study on HIV treatment inter‑ ruptions, ethical approval was received from the Ghana Health Service Ethics Review Committee with approval number GHS-ERC:003/08/24. All ethical prin‑ ciples were followed in this review. Consent to participate is not applicable. Consent for publication Not applicable. Competing interests The authors declare no competing interests. Received: 11 January 2025 Accepted: 9 July 2025 References 1. UNAIDS_FactSheet_en.pdf, (n.d.). https://​www.​unaids.​org/​sites/​defau​lt/​ files/​media_​asset/​UNAIDS_​FactS​heet_​en.​pdf. Accessed 17 Dec 2024. 2. Frescura L, Godfrey-Faussett P, Feizzadeh AA, El-Sadr W, Syarif O, Ghys PD. Achieving the 95 95 95 targets for all: a pathway to ending AIDS. PLoS ONE. 2022;17:e0272405. https://​doi.​org/​10.​1371/​journ​al.​pone.​02724​05. 3. Altice F, Evuarherhe O, Shina S, Carter G, Beaubrun AC. Adherence to HIV treatment regimens: systematic literature review and meta-analysis. Patient Prefer Adherence. 2019;13:475–90. https://​doi.​org/​10.​2147/​PPA.​S1927​35. 4. Dubrocq G, Rakhmanina N. Antiretroviral therapy interruptions: impact on HIV treatment and transmission. HIVAIDS - Res Palliat Care. 2018;10:91– 101. https://​doi.​org/​10.​2147/​HIV.​S1419​65. 5. Akpan U, Kakanfo K, Ekele OD, Ukpong K, Toyo O, Nwaokoro P, James E, Pandey S, Olatubosun K, Bateganya M. Predictors of treatment interrup‑ tion among patients on antiretroviral therapy in Akwa Ibom, Nigeria: outcomes after 12 months. AIDS Care. 2023;35:114–22. https://​doi.​org/​10.​ 1080/​09540​121.​2022.​20938​26. 6. Rosen S, Fox MP, Gill CJ. Patient retention in antiretroviral therapy pro‑ grams in sub-Saharan Africa: a systematic review. PLoS Med. 2007;4: e298. https://​doi.​org/​10.​1371/​journ​al.​pmed.​00402​98. 7. Thirumurthy H, Galárraga O, Larson B, Rosen S. HIV treatment produces economic returns through increased work and education, and warrants continued US support. Health Aff Proj Hope. 2012;31:1470–7. https://​doi.​ org/​10.​1377/​hltha​ff.​2012.​0217. 8. Jewell B, Smith J, Hallett T. The potential impact of interrup‑ tions to HIV services: a modelling case study for South Africa. 2020.2020.04.22.20075861. https://​doi.​org/​10.​1101/​2020.​04.​22.​20075​861. 9. Mills EJ, Funk A, Kanters S, Kawuma E, Cooper C, Mukasa B, Odit M, Kara‑ magi Y, Mwehire D, Nachega J, Yaya S, Featherstone A, Ford N. Long-term health care interruptions among HIV-positive patients in Uganda. JAIDS J Acquir Immune Defic Syndr. 2013;63: e23. https://​doi.​org/​10.​1097/​QAI.​ 0b013​e3182​8a3fb8. 10. Thomadakis C, Yiannoutsos CT, Pantazis N, Diero L, Mwangi A, Musick BS, Wools-Kaloustian K, Touloumi G. The effect of HIV treatment inter‑ ruption on subsequent immunological response. Am J Epidemiol. 2023;192:1181–91. https://​doi.​org/​10.​1093/​aje/​kwad0​76. 11. Trickey A, Zhang L, Rentsch CT, Pantazis N, Izquierdo R, Antinori A, Leierer G, Burkholder G, Cavassini M, Palacio-Vieira J, Gill MJ, Teira R, Stephan C, Obel N, Vehreschild J-J, Sterling TR, Van Der Valk M, Bonnet F, Crane HM, Silverberg MJ, Ingle SM, Sterne JAC, the A.T.C. Collaboration (ART-CC). Care interruptions and mortality among adults in Europe and North America. AIDS. 2024;38:1533. https://​doi.​org/​10.​1097/​QAD.​00000​00000​ 003924. 12. Chamberlin S, Mphande M, Phiri K, Kalande P, Dovel K. How HIV clients find their way back to the ART clinic: a qualitative study of disen‑ gagement and re-engagement with HIV care in Malawi. AIDS Behav. 2022;26:674–85. https://​doi.​org/​10.​1007/​s10461-​021-​03427-1. 13. Palacio-Vieira J, Reyes-Urueña JM, Imaz A, Bruguera A, Force L, Llaveria AO, Llibre JM, Vilaró I, Borràs FH, Falcó V, Riera M, Domingo P, de Lazzari E, Miró JM, Casabona J. Strategies to reengage patients lost to follow up in HIV care in high income countries, a scoping review. BMC Public Health. 2021;21:1596. https://​doi.​org/​10.​1186/​s12889-​021-​11613-y. 14. Bektaş M, Tuynman JB, Costa Pereira J, Burchell GL, van der Peet DL. Machine learning algorithms for predicting surgical outcomes after colorectal surgery: a systematic review. World J Surg. 2022;46:1. https://​ doi.​org/​10.​1007/​s00268-​022-​06728-1. 15. Huang Y, Li J, Li M, Aparasu RR. Application of machine learning in predicting survival outcomes involving real-world data: a scoping review. BMC Med Res Methodol. 2023;23:268. https://​doi.​org/​10.​1186/​ s12874-​023-​02078-1. 16. Senders JT, Staples PC, Karhade AV, Zaki MM, Gormley WB, Broekman MLD, Smith TR, Arnaout O. Machine learning and neurosurgical outcome prediction: a systematic review. World Neurosurg. 2018;109:476-486.e1. https://​doi.​org/​10.​1016/j.​wneu.​2017.​09.​149. 17. E.W. Steyerberg, Applications of Prediction Models, in: E.W. Steyerberg (Ed.), Clin. Predict. Models Pract. Approach Dev. Valid. Updat., Springer International Publishing, Cham, 2019: pp. 15–36. https://​doi.​org/​10.​1007/​ 978-3-​030-​16399-0_2. 18. Zu W, Huang X, Xu T, Du L, Wang Y, Wang L, Nie W. Machine learning in predicting outcomes for stroke patients following rehabilitation treat‑ ment: a systematic review. PLoS ONE. 2023;18: e0287308. https://​doi.​org/​ 10.​1371/​journ​al.​pone.​02873​08. 19. Corporation for Digital Scholarship. Zotero (6.0.37) [Software]. Listing the institution (Corporation for Digital Scholarship) instead of individu‑ als is advisable because several programmers and an active community contributed to developing the software. 2023. https://​www.​zotero.​org/. Original work published 2006. 20. Page MJ, Moher D, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, McGuinness LA, Stewart LA, Thomas J, Tricco AC, Welch VA, Whiting P, McKenzie JE. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ. 2021;372: n160. https://​doi.​org/​10.​1136/​bmj.​n160. 21. Damen JAA, Moons KGM, van Smeden M, Hooft L. How to conduct a systematic review and meta-analysis of prognostic model studies. Clin Microbiol Infect. 2023;29:434–40. https://​doi.​org/​10.​1016/j.​cmi.​2022.​07.​019. 22. Systematic Review and Literature Review Software by DistillerSR, Distill‑ erSR (n.d.). https://​www.​disti​llersr.​com/. Accessed 17 Dec 2024. 23. Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, Clarke M, Devereaux PJ, Kleijnen J, Moher D. The PRISMA Statement for Reporting Systematic reviews and Meta-Analyses of studies that evaluate health care interventions: explanation and elaboration. PLOS Med. 2009;6: e1000100. https://​doi.​org/​10.​1371/​journ​al.​pmed.​10001​00. 24. Moons KGM, de Groot JAH, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, Reitsma JB, Collins GS. Critical appraisal and data extraction for system‑ atic reviews of prediction modelling studies: the CHARMS checklist. PLOS Med. 2014;11: e1001744. https://​doi.​org/​10.​1371/​journ​al.​pmed.​10017​44. 25. Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, Reitsma JB, Kleijnen J, Mallett S. PROBAST Group†, PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med. 2019;170:51–8. https://​doi.​org/​10.​7326/​M18-​1376. 26. Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, Vickers AJ, Ransohoff DF, Collins GS. Transparent Reporting of a multi‑ variable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162:W1–73. https://​ doi.​org/​10.​7326/​M14-​0698. 27. Fahey CA, Wei L, Njau PF, Shabani S, Kwilasa S, Maokola W, Packel L, Zheng Z, Wang J, McCoy SI. Machine learning with routine electronic medical record data to identify people at high risk of disengagement from HIV care in Tanzania. PLOS Glob Public Health. 2022;2: e0000720. https://​doi.​ org/​10.​1371/​journ​al.​pgph.​00007​20. 28. Maskew M, Sharpey-Schafer K, De Voux L, Crompton T, Bor J, Rennick M, Chirowodza A, Miot J, Molefi S, Onaga C, Majuba P, Sanne I, Pisa https://www.unaids.org/sites/default/files/media_asset/UNAIDS_FactSheet_en.pdf https://www.unaids.org/sites/default/files/media_asset/UNAIDS_FactSheet_en.pdf https://doi.org/10.1371/journal.pone.0272405 https://doi.org/10.2147/PPA.S192735 https://doi.org/10.2147/HIV.S141965 https://doi.org/10.1080/09540121.2022.2093826 https://doi.org/10.1080/09540121.2022.2093826 https://doi.org/10.1371/journal.pmed.0040298 https://doi.org/10.1377/hlthaff.2012.0217 https://doi.org/10.1377/hlthaff.2012.0217 https://doi.org/10.1101/2020.04.22.20075861 https://doi.org/10.1097/QAI.0b013e31828a3fb8 https://doi.org/10.1097/QAI.0b013e31828a3fb8 https://doi.org/10.1093/aje/kwad076 https://doi.org/10.1097/QAD.0000000000003924 https://doi.org/10.1097/QAD.0000000000003924 https://doi.org/10.1007/s10461-021-03427-1 https://doi.org/10.1186/s12889-021-11613-y https://doi.org/10.1007/s00268-022-06728-1 https://doi.org/10.1007/s00268-022-06728-1 https://doi.org/10.1186/s12874-023-02078-1 https://doi.org/10.1186/s12874-023-02078-1 https://doi.org/10.1016/j.wneu.2017.09.149 https://doi.org/10.1007/978-3-030-16399-0_2 https://doi.org/10.1007/978-3-030-16399-0_2 https://doi.org/10.1371/journal.pone.0287308 https://doi.org/10.1371/journal.pone.0287308 https://www.zotero.org/ https://doi.org/10.1136/bmj.n160 https://doi.org/10.1016/j.cmi.2022.07.019 https://www.distillersr.com/ https://doi.org/10.1371/journal.pmed.1000100 https://doi.org/10.1371/journal.pmed.1001744 https://doi.org/10.7326/M18-1376 https://doi.org/10.7326/M14-0698 https://doi.org/10.7326/M14-0698 https://doi.org/10.1371/journal.pgph.0000720 https://doi.org/10.1371/journal.pgph.0000720 Page 14 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 P. Applying machine learning and predictive modeling to retention and viral suppression in South African HIV treatment cohorts. Sci Rep. 2022;12:12715. https://​doi.​org/​10.​1038/​s41598-​022-​16062-0. 29. Maskew M, Smith S, Voux LD, Sharpey-Schafer K, Crompton T, Govender A, Pisa P, Rosen S. Triaging clients at risk of disengagement from HIV care: application of a predictive model to clinical trial data in South Africa. 2024.2024.08.05.24311488. https://​doi.​org/​10.​1101/​2024.​08.​05.​24311​488. 30. Ogbechie M-D, Walker CF, Lee M-T, Gana AA, Oduola A, Idemudia A, Edor M, Harris EL, Stephens J, Gao X, Chen P-L, Persaud NE. Predicting treatment interruption among people living with HIV in Nigeria: machine learning approach. JMIR AI. 2023;2: e44432. https://​doi.​org/​10.​2196/​44432. 31. Pence BW, Bengtson AM, Boswell S, Christopoulos KA, Crane HM, Geng E, Keruly JC, Mathews WC, Mugavero MJ. Who will show? Predicting missed visits among patients in routine HIV primary care in the United States, AIDS Behav. 2019;23:418–26. https://​doi.​org/​10.​1007/​s10461-​018-​2215-1. 32. Ramachandran A, Kumar A, Koenig H, De Unanue A, Sung C, Walsh J, Schneider J, Ghani R, Ridgway JP. Predictive analytics for retention in care in an urban HIV clinic. Sci Rep. 2020;10:6421. https://​doi.​org/​10.​1038/​ s41598-​020-​62729-x. 33. Stockman J, Friedman J, Sundberg J, Harris E. Predictive analytics using machine learning to identify ART clients at health system level at greatest risk of treatment interruption in Mozambique and Nigeria. JAIDS J Acquir Immune Defic Syndr. 2022. https://​doi.​org/​10.​1097/​QAI.​00000​00000​002947.​ 10.​1097/​QAI.​00000​00000​002947. 34. Esra R, Carstens J, Le Roux S, Mabuto T, Eisenstein M, Keiser O, Orel E, Mer‑ zouki A, De Voux L, Maskew M, Sharpey-Schafer K. Validation and improve‑ ment of a machine learning model to predict interruptions in antiretroviral treatment in South Africa. JAIDS J Acquir Immune Defic Syndr. 2023;92:42. https://​doi.​org/​10.​1097/​QAI.​00000​00000​003108. 35. Mason JA, Friedman EE, Rojas JC, Ridgway JP. No-show prediction model performance among people with HIV: external validation study. J Med Internet Res. 2023;25: e43277. https://​doi.​org/​10.​2196/​43277. 36. Vickers AJ, van Calster B, Steyerberg EW. A simple, step-by-step guide to interpreting decision curve analysis. Diagn Progn Res. 2019;3:18. https://​doi.​ org/​10.​1186/​s41512-​019-​0064-7. 37. Wu Y, Xu L, Yang P, Lin N, Huang X, Pan W, Li H, Lin P, Li B, Bunpetch V, Luo C, Jiang Y, Yang D, Huang M, Niu T, Ye Z. Survival prediction in high-grade osteosarcoma using radiomics of diagnostic computed tomography. eBio‑ Medicine. 2018;34:27–34. https://​doi.​org/​10.​1016/j.​ebiom.​2018.​07.​006. 38. Carrington AM, Manuel DG, Fieguth PW, Ramsay T, Osmani V, Wernly B, Bennett C, Hawken S, Magwood O, Sheikh Y, McInnes M, Holzinger A. Deep ROC analysis and AUC as balanced average accuracy, for Improved Clas‑ sifier Selection, Audit and Explanation. IEEE Trans Pattern Anal Mach Intell. 2023;45:329–41. https://​doi.​org/​10.​1109/​TPAMI.​2022.​31453​92. 39. White N, Parsons R, Collins G, Barnett A. Evidence of questionable research practices in clinical prediction models. BMC Med. 2023;21:339. https://​doi.​ org/​10.​1186/​s12916-​023-​03048-6. 40. Akanbi MO, Ocheke AN, Agaba PA, Daniyam CA, Agaba EI, Okeke EN, Ukoli CO. Use of electronic health records in sub-Saharan Africa: progress and challenges. J Med Trop. 2012;14:1. 41. Colombo F, Oderkirk J, Slawomirski L. Health information systems, electronic medical records, and big data in global healthcare: progress and challenges in OECD countries, in: R. Haring, I. Kickbusch, D. Ganten, M. Moeti (Eds.), Handb. Glob. Health, Springer International Publishing, Cham, 2020: pp. 1–31. https://​doi.​org/​10.​1007/​978-3-​030-​05325-3_​71-1. 42. Cyganek B, Graña M, Krawczyk B, Kasprzak A, Porwik P, Walkowiak K, Woźniak M. A survey of big data issues in electronic health record analysis. Appl Artif Intell. 2016;30:497–520. https://​doi.​org/​10.​1080/​08839​514.​2016.​11937​14. 43. Khan ZF, Alotaibi SR. Applications of artificial intelligence and big data analytics in m-Health: a healthcare system perspective. J Healthc Eng. 2020;2020:8894694. https://​doi.​org/​10.​1155/​2020/​88946​94. 44. Schwartz JT, Gao M, Geng EA, Mody KS, Mikhail CM, Cho SK. Applications of machine learning using electronic medical records in spine surgery. Neurospine. 2019;16:643–53. https://​doi.​org/​10.​14245/​ns.​19383​86.​193. 45. Shinozaki A. Electronic medical records and machine learning in approaches to drug development, in: Artif. Intell. Oncol. Drug Discov. Dev., IntechOpen, 2020. https://​doi.​org/​10.​5772/​intec​hopen.​92613. 46. Syed FM, F.K.E. S, AI in securing electronic health records (EHR) systems. Int J Adv Eng Technol Innov. 1 (2024) 593–620. 47. Kawamoto K, Finkelstein J, Fiol GD. Implementing machine learning in the electronic health record: checklist of essential considerations. Mayo Clin Proc. 2023;98:366–9. https://​doi.​org/​10.​1016/j.​mayocp.​2023.​01.​013. 48. Weng SF, Reps J, Kai J, Garibaldi JM, Qureshi N. Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS ONE. 2017;12: e0174944. https://​doi.​org/​10.​1371/​journ​al.​pone.​01749​44. 49. Critelli B, Hassan A, Lahooti I, Noh L, Park JS, Tong K, Lahooti A, Matzko N, Adams JN, Liss L, Quion J, Restrepo D, Nikahd M, Culp S, Lacy-Hulbert A, Speake C, Buxbaum J, Bischof J, Yazici C, Phillips AE, Terp S, Weissman A, Conwell D, Hart P, Ramsey M, Krishna S, Han S, Park E, Shah R, Akshintala V, Windsor JA, Mull NK, Papachristou GI, Celi LA, Lee PJ. A systematic review of machine learning-based prognostic models for acute pancreatitis: towards improving methods and reporting quality. 2024;2024.06.26.24309389. https://​doi.​org/​10.​1101/​2024.​06.​26.​24309​389. 50. Endebu T, Taye G, Addissie A, Deksisa A, Deressa W. Electronic medical record-based prediction models developed and deployed in the HIV care continuum: a systematic review. Discov Health Syst. 2024;3:25. https://​doi.​ org/​10.​1007/​s44250-​024-​00092-8. 51. Ridgway JP, Lee A, Devlin S, Kerman J, Mayampurath A. Machine learning and clinical informatics for improving HIV care continuum outcomes. Curr HIV/AIDS Rep. 2021;18:229–36. https://​doi.​org/​10.​1007/​s11904-​021-​00552-3. 52. Chin RJ, Sangmanee D, Piergallini L. PEPFAR funding and reduction in HIV infection rates in 12 focus sub-Saharan African countries: a quantitative analysis. Int J MCH AIDS. 2015;3:150. 53. Pal M, Parija S, Panda G, Dhama K, Mohapatra RK. Risk prediction of cardiovascular disease using machine learning classifiers. Open Med. 2022;17:1100–13. https://​doi.​org/​10.​1515/​med-​2022-​0508. 54. South Africa, (n.d.). https://​www.​unaids.​org/​en/​regio​nscou​ntries/​count​ries/​ south​africa. Accessed 17 Dec 2024. 55. Dietterich TG. Ensemble methods in machine learning, in: Mult. Clas‑ sif. Syst., Springer, Berlin, Heidelberg, 2000: pp. 1–15. https://​doi.​org/​10.​ 1007/3-​540-​45014-9_1. 56. Rane N, Choudhary SP, Rane J. Ensemble deep learning and machine learn‑ ing: applications, opportunities, challenges, and future directions. Stud Med Health Sci. 1 (2024) 18–41. https://​doi.​org/​10.​48185/​smhs.​v1i2.​1225. 57. Namamula LR, Chaytor D. Effective ensemble learning approach for large- scale medical data analytics. Int J Syst Assur Eng Manag. 2024;15:13–20. https://​doi.​org/​10.​1007/​s13198-​021-​01552-7. 58. Chilamkurthy S, Ghosh R, Tanamala S, Biviji M, Campeau NG, Venugopal VK, Mahajan V, Rao P, Warier P. Development and validation of deep learning algorithms for detection of critical findings in head CT scans, 2018. https://​ doi.​org/​10.​48550/​arXiv.​1803.​05854. 59. Huang J, Ling CX. Using AUC and accuracy in evaluating learning algo‑ rithms. IEEE Trans Knowl Data Eng. 2005;17:299–310. https://​doi.​org/​10.​ 1109/​TKDE.​2005.​50. 60. Alba AC, Agoritsas T, Walsh M, Hanna S, Iorio A, Devereaux PJ, McGinn T, Guyatt G. Discrimination and calibration of clinical prediction models: users’ guides to the medical literature. JAMA. 2017;318:1377–84. https://​doi.​org/​ 10.​1001/​jama.​2017.​12126. 61. Binuya MAE, Engelhardt EG, Schats W, Schmidt MK, Steyerberg EW. Meth‑ odological guidance for the evaluation and updating of clinical prediction models: a systematic review. BMC Med Res Methodol. 2022;22:316. https://​ doi.​org/​10.​1186/​s12874-​022-​01801-8. 62. Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol. 2016;74:167–76. https://​doi.​org/​10.​1016/j.​jclin​epi.​2015.​12.​005. 63. Nijman SWJ, Leeuwenberg AM, Beekers I, Verkouter I, Jacobs JJL, Bots ML, Asselbergs FW, Moons KGM, Debray TPA. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J Clin Epidemiol. 2022;142:218–29. https://​doi.​org/​10.​1016/j.​jclin​epi.​ 2021.​11.​023. 64. Misra DP, Yadav AS. Impact of preprocessing methods on healthcare predic‑ tions. 2019. https://​doi.​org/​10.​2139/​ssrn.​33495​86. 65. Newman DA. Missing data: five practical guidelines, Organ. Res. Methods. 2014;17:372–411. https://​doi.​org/​10.​1177/​10944​28114​548590. 66. Afkanpour M, Hosseinzadeh E, Tabesh H. Identify the most appropriate imputation method for handling missing values in clinical structured data‑ sets: a systematic review. BMC Med Res Methodol. 2024;24:188. https://​doi.​ org/​10.​1186/​s12874-​024-​02310-6. 67. Buuren, S. van. Flexible Imputation of Missing Data. CRC Press. 2012. https://doi.org/10.1038/s41598-022-16062-0 https://doi.org/10.1101/2024.08.05.24311488 https://doi.org/10.2196/44432 https://doi.org/10.1007/s10461-018-2215-1 https://doi.org/10.1038/s41598-020-62729-x https://doi.org/10.1038/s41598-020-62729-x https://doi.org/10.1097/QAI.0000000000002947.10.1097/QAI.0000000000002947 https://doi.org/10.1097/QAI.0000000000002947.10.1097/QAI.0000000000002947 https://doi.org/10.1097/QAI.0000000000003108 https://doi.org/10.2196/43277 https://doi.org/10.1186/s41512-019-0064-7 https://doi.org/10.1186/s41512-019-0064-7 https://doi.org/10.1016/j.ebiom.2018.07.006 https://doi.org/10.1109/TPAMI.2022.3145392 https://doi.org/10.1186/s12916-023-03048-6 https://doi.org/10.1186/s12916-023-03048-6 https://doi.org/10.1007/978-3-030-05325-3_71-1 https://doi.org/10.1080/08839514.2016.1193714 https://doi.org/10.1155/2020/8894694 https://doi.org/10.14245/ns.1938386.193 https://doi.org/10.5772/intechopen.92613 https://doi.org/10.1016/j.mayocp.2023.01.013 https://doi.org/10.1371/journal.pone.0174944 https://doi.org/10.1101/2024.06.26.24309389 https://doi.org/10.1007/s44250-024-00092-8 https://doi.org/10.1007/s44250-024-00092-8 https://doi.org/10.1007/s11904-021-00552-3 https://doi.org/10.1515/med-2022-0508 https://www.unaids.org/en/regionscountries/countries/southafrica https://www.unaids.org/en/regionscountries/countries/southafrica https://doi.org/10.1007/3-540-45014-9_1 https://doi.org/10.1007/3-540-45014-9_1 https://doi.org/10.48185/smhs.v1i2.1225 https://doi.org/10.1007/s13198-021-01552-7 https://doi.org/10.48550/arXiv.1803.05854 https://doi.org/10.48550/arXiv.1803.05854 https://doi.org/10.1109/TKDE.2005.50 https://doi.org/10.1109/TKDE.2005.50 https://doi.org/10.1001/jama.2017.12126 https://doi.org/10.1001/jama.2017.12126 https://doi.org/10.1186/s12874-022-01801-8 https://doi.org/10.1186/s12874-022-01801-8 https://doi.org/10.1016/j.jclinepi.2015.12.005 https://doi.org/10.1016/j.jclinepi.2021.11.023 https://doi.org/10.1016/j.jclinepi.2021.11.023 https://doi.org/10.2139/ssrn.3349586 https://doi.org/10.1177/1094428114548590 https://doi.org/10.1186/s12874-024-02310-6 https://doi.org/10.1186/s12874-024-02310-6 Page 15 of 15Kwarah et al. BMC Global and Public Health (2025) 3:64 68. Rios R, Miller RJ, Manral N, Sharir T, Einstein AJ, Fish MB, Ruddy TD, Kaufmann PA, Sinusas AJ, Miller EJ, Bateman TM, Dorbala S, Carli MD, Kriekinge SDV, Kavanagh PB, Parekh T, Liang JX, Dey D, Berman DS, Slomka PJ. Handling missing values in machine learning to predict patient-specific risk of adverse cardiac events: insights from REFINE SPECT registry. Comput Biol Med. 2022;145: 105449. https://​doi.​org/​10.​1016/j.​compb​iomed.​2022.​105449. 69. Vickers AJ, Van Claster B, Wynants L, Steyerberg EW. Decision curve analysis: confidence intervals and hypothesis testing for net benefit. Diagn Progn Res. 2023;7:11. https://​doi.​org/​10.​1186/​s41512-​023-​00148-y. 70. Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350: g7594. https://​doi.​org/​10.​ 1136/​bmj.​g7594. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in pub‑ lished maps and institutional affiliations. https://doi.org/10.1016/j.compbiomed.2022.105449 https://doi.org/10.1186/s41512-023-00148-y https://doi.org/10.1136/bmj.g7594 https://doi.org/10.1136/bmj.g7594 Evaluating predictive performance, validity, and applicability of machine learning models for predicting HIV treatment interruption: a systematic review Abstract Background Methods Results Conclusions Systematic review registration Background Methods Search strategy Eligibility criteria Selection process Data extraction Risk of bias and applicability assessment Synthesis and analysis Results Characteristics of included studies Model performance metrics Risk-of-bias assessment Applicability assessment Model validation Discussions Conclusions Acknowledgements References