Kwarah et al. BMC Global and Public Health            (2025) 3:64  
https://doi.org/10.1186/s44263-025-00184-4

SYSTEMATIC REVIEW Open Access

© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which 
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the 
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or 
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line 
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory 
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this 
licence, visit http://creativecommons.org/licenses/by/4.0/.

BMC Global and
Public Health

Evaluating predictive performance, validity, 
and applicability of machine learning models 
for predicting HIV treatment interruption: 
a systematic review
Williams Kwarah1,2*, Frances Baaba da‑Costa Vroom1, Duah Dwomoh1 and Samuel Bosomprah1 

Abstract 

Background  HIV treatment interruption remains a significant barrier to achieving global HIV/AIDS control goals. 
Machine learning (ML) models offer potential for predicting treatment interruption by leveraging large clinical data. 
Understanding how these models were developed, validated, and applied remains essential for advancing research.

Methods  We searched databases including the PubMed, BMC, Cochrane Library, Scopus, ScienceDirect, Lancet, 
and Google Scholar, for studies published in English from 1990 to September 2024. Search terms covered HIV, 
machine learning, treatment interruption, and loss to follow-up. Articles were screened and reviewed independently, 
and data were extracted using the CHecklist for critical Appraisal and data extraction for systematic Reviews of pre‑
diction Modelling Studies (CHARMS) tool. Risk of bias was assessed with Prediction model Risk Of Bias Assessment 
Tool (PROBAST). The Preferred Reporting Items for Systematic reviews and Meta-analysis (PRISMA) guidelines were 
followed throughout.

Results  Out of 116,672 records, 9 studies met the inclusion criteria and reported 12 ML models. Random For‑
est, XGBoost, and AdaBoost were predominant models (91.7%). Internal validation was performed in all models, 
but only two models included external validation. Performance varied, with a mean area under the receiver operat‑
ing characteristic curve (AUC-ROC) of 0.668 (standard deviation (SD) = 0.066), indicating moderate discrimination. 
About 75% of models showed a high risk of bias due to inadequate handling of missing data, lack of calibration, 
and the absence of decision curve analysis (DCA).

Conclusions  ML models show promise for predicting HIV treatment interruption, particularly in resource-limited set‑
tings. Future research should prioritize external validation, robust missing data handling, and decision curve analysis 
and include sociocultural predictors to improve model robustness.

Systematic review registration  PROSPERO CRD42024578109.

Keywords  HIV treatment interruption, Machine learning, Predictive modeling

Background
Human immunodeficiency virus (HIV) treatment inter-
ruption poses a significant challenge to global efforts in 
the HIV/AIDS epidemic response. In 2022, an estimated 
39 million people were living with HIV (PLHIV) glob-
ally, with an estimated 1.3 million new infections and 

*Correspondence:
Williams Kwarah
Kwarah@gmail.com
1 Department of Biostatistics, School of Public Health, University of Ghana, 
Accra, Ghana
2 United States Agency for International Development (USAID), Ghana 
Mission, Accra, Ghana

http://creativecommons.org/licenses/by/4.0/
http://crossmark.crossref.org/dialog/?doi=10.1186/s44263-025-00184-4&domain=pdf


Page 2 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 

630,000 deaths reported [1]. The burden of HIV infec-
tion is disproportionately high in sub-Saharan Africa, 
Asia, and the Pacific, which together account for about 
88% of all cases [2]. Despite the availability of antiretro-
viral therapy (ART), which has dramatically reduced the 
progression of HIV to AIDS and decreased AIDS-related 
mortality, many individuals living with HIV struggle to 
maintain consistent adherence to their treatment regi-
men [3, 4]. It has been estimated that only 46% to 85% of 
patients continue to stay on ART 2 years after initiation 
[5, 6]. This lack of adherence is particularly concerning 
given that when left untreated, HIV weakens the immune 
system and can lead to life-threatening complications 
[4]. People who stay in treatment are economically viable 
and productive to their families and the community [7]. 
Interrupting HIV treatment may result in viral rebound, 
deterioration of the immune system, heightened trans-
mission risk, and the development of drug resistance, 
thereby compromising both individual health and com-
munity prevention initiatives. The situation places signifi-
cant pressure on healthcare systems and compromises 
public health initiatives [8–11].

Improving ART adherence is critical to achieving 
global HIV/AIDS control goals. While current strategies 
to address treatment interruption primarily focus on re-
engaging patients after missed doses [12, 13], these reac-
tive measures often fall short of preventing the associated 
health risks and potential for increased transmission. 
The ability to predict treatment interruptions before they 
occur could revolutionize HIV care by enabling health-
care providers to implement targeted and proactive inter-
ventions that keep patients on therapy, thus enhancing 
their chances of achieving and sustaining viral suppres-
sion. Machine learning (ML) and artificial intelligence 
(AI) offer powerful tools for developing such predictive 
models due to their capacity to dynamically analyze large, 
complex datasets and uncover patterns that traditional 
methods might miss [14–18]. Despite the promise of 
these technologies, there remains a significant evidence 
gap in their application to HIV treatment adherence, par-
ticularly in low-resource settings where the burden of the 
disease is greatest. Addressing this gap through system-
atic evaluation of existing predictive models is crucial 
for advancing the use of ML and AI in HIV care. This 
can lead to more effective and personalized treatment 
strategies that can help meet the ambitious Joint United 
Nations Programme on HIV/AIDS (UNAIDS) 95-95-95 
targets by 2030 [2].

This systematic review aimed to evaluate the effec-
tiveness of machine learning-based predictive models 
in forecasting HIV treatment interruptions. Specifically, 
the review (1) identified the types of predictive mod-
els previously developed, (2) assessed their accuracy 

and applicability in various settings, and (3) determined 
which models have been validated and how they per-
formed in different populations. The impact of this 
review could provide insights that can guide the inte-
gration of advanced predictive technologies into HIV 
care programs, potentially improving patient retention, 
optimizing treatment outcomes, and supporting global 
efforts to eliminate HIV as a public health threat by 2030.

Methods
Search strategy
We searched multiple electronic databases, including 
Scopus, PubMed, The Lancet, BioMed Central (BMC) 
Public Health, ScienceDirect, Google Scholar, and 
Cochrane Library. Our search covered publications from 
January 1990 to September 2024. We searched using a 
combination of Medical Subject Headings (MeSH) and 
free-text terms. The key terms included “HIV,” “Human 
Immunodeficiency Virus,” “AIDS,” and “Acquired Immu-
nodeficiency Syndrome” for HIV-related concepts; 
“Machine Learning,” “ML,” “Artificial Intelligence,” 
“AI,” “Neural Networks,” and “Predictive Modeling” for 
machine learning concepts; and “Treatment Interrup-
tion,” “Loss to Follow-Up “Non-adherence,” “Default,” and 
“Treatment Discontinuation” for treatment adherence 
concepts. These terms were combined using Boolean 
operators (AND, OR) to ensure a broad and inclusive 
search. Details of the search strategy for each database is 
provided in Additional File 1: Search Strategy.

Eligibility criteria
We applied specific eligibility criteria to select studies 
for inclusion. Eligible studies focused on developing or 
validating prediction models for HIV treatment interrup-
tion at the individual level using machine learning meth-
ods. We only included studies published in English. We 
included studies that focused on HIV treatment interrup-
tion defined as missing a scheduled clinic or pharmacy 
appointment by at least 28  days. We excluded studies 
that identified predictors without focusing on prediction 
models and studies lacking full-text availability. Reviews, 
commentaries, conference abstracts, letters, reports, and 
opinions were excluded. In addition to database searches, 
we manually reviewed the reference lists of the included 
studies to identify additional relevant articles. To capture 
recent and unpublished research, we searched preprint 
servers such as bioRxiv, medRxiv, and arXiv. The corre-
sponding authors of the included articles were emailed to 
seek further information and clarity. The search strategy 
was carefully documented (Additional File 1), and arti-
cles were managed using Zotero 6.0.37 reference man-
agement software, a project of Digital Scholar [19]. The 
Preferred Reporting Items for Systematic reviews and 


Page 3 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 	

Meta-Analysis (PRISMA) statements [20] (Additional 
File 2) and the conduct of systematic reviews [21] guided 
the review. A protocol for this review was registered on 
PROSPERO CRD42024578109.

Selection process
Article selection was conducted in multiple stages to 
ensure that only studies meeting the predefined inclusion 
criteria were included. Initially, two independent review-
ers (W. K. and G. J. P.I.) screened the titles and abstracts 
of all records retrieved from the database searches to 
identify potentially relevant studies. We resolved any dis-
agreements between reviewers during the article selec-
tion process through discussion, and a third reviewer 
(N.Z.) was available to adjudicate unresolved disputes. 
To enhance the rigor of the selection process, systematic 
review software Distiller SR 2.35 developed by DistillerSR 
Incorporated [22] was used to assist in the identification 
and removal of duplicate records before the screening 
began.

Data extraction
Two independent reviewers (W K. and G. J. P.I.) extracted 
data from the selected studies to ensure accuracy and 
consistency. Each reviewer independently extracted data, 
using the standardized CHecklist for critical Appraisal 
and data extraction for systematic Reviews of prediction 
Modelling Studies (CHARMS) tool [23, 24]. CHARMS 
was developed for systematic reviews of prognostic or 
diagnostic prediction models without external valida-
tion, with external validation, or external prediction 
model validation with or without model updating. The 
data collected included the data sources, study charac-
teristics, details of the predictive models, outcomes, and 
performance metrics [24]. We resolved any disagree-
ments between reviewers during the data extraction pro-
cess through discussion, and a third reviewer (N.Z.) was 
available to adjudicate unresolved disputes. The review-
ers manually extracted all data and then cross-verified it 
to maintain the integrity of the data collection process. 
A consolidated final completed CHARMS tool was com-
piled for this review.

Risk of bias and applicability assessment
We used the Prediction Model Risk of Bias Assessment 
Tool (PROBAST) [25] to assess the risk of bias (ROB) 
and applicability in the included studies. The PROBAST 
was designed to evaluate the risk of bias and applicabil-
ity in prediction model studies. The tool evaluated four 
key domains: participants, predictors, outcomes, and 
analysis. There were two questions on participants, three 
questions on predictors, six questions about outcomes, 
and nine questions linked to the statistical analysis. 

Responses to these questions were either “yes,” “probably 
yes,” “probably no,” “no,” or “no information.” The ROB 
was classified as either low, high, or unclear based on the 
responses within these domains. A domain was classified 
as high risk if it included at least one question that has 
been answered with either “no” or “probably no,” low risk 
if all the questions indicated as “yes” or “probably yes,” 
and unclear if there is no information in the responses. 
If all domains were assessed as having a low risk, then 
the overall risk of bias was classified as low. However, if 
at least one domain was determined to have a high risk, 
then the overall risk of bias was classified as high. If there 
was a recognized concern for bias in at least one area 
and the level of concern was low for all other domains, 
it was classified as having a moderate level of concern for 
bias. Two reviewers (W. K. and G. J. P.I.) independently 
evaluated the risk of bias in each included study. When 
the reviewers disagreed on the risk-of-bias judgment, 
the discrepancies were discussed to reach a consensus. 
If the disagreement persisted, a third reviewer (N.Z.) 
was consulted to decide. Similarly, model applicability 
was assessed in the first three domains — participants, 
predictors, and outcome for each model. Model appli-
cability was rated low concern, high concern, or unclear 
concern based on a defined rubric [25]. If there were 
low concerns regarding applicability for all domains, 
the prediction model evaluation was judged to have low 
concerns regarding applicability. If there were high con-
cerns regarding applicability for at least one domain, the 
prediction model evaluation was judged to have high 
concerns regarding applicability. If there were unclear 
concerns (but no “high concern”) regarding applicability 
for at least one domain, the prediction model evaluation 
was judged to have unclear concerns regarding applica-
bility overall. We conducted all evaluations manually and 
documented the results of the risk-of-bias assessments 
and applicability in detail, with summary judgments pre-
sented in the form of charts to facilitate a clear under-
standing of the quality and reliability of the included 
studies.

Synthesis and analysis
We tabulated the results of individual studies to provide 
a clear and organized presentation of the key findings. 
This included details such as study characteristics, model 
performance metrics (e.g., area under the receiver-oper-
ating characteristic curve, calibration statistics), and risk-
of-bias assessments. We used visual displays, including 
charts to enhance the clarity of the results and to facili-
tate the comparison of study outcomes. For the synthe-
sis of results, we used a narrative synthesis approach due 
to the anticipated heterogeneity of the included studies, 
particularly in terms of model types, outcome measures, 


Page 4 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 

and study populations. This approach allowed us to sys-
tematically describe and compare the predictive mod-
els, highlighting common themes and differences among 
the studies. We did not perform a meta-analysis because 
there were insufficient external validation studies of the 
same index model to justify a quantitative synthesis [21]. 
The synthesis followed guidelines from the Transparent 
Reporting of a Multivariable prediction model for Indi-
vidual Prognosis Or Diagnosis (TRIPOD) statement [26], 
CHARMS checklist [24], and PROBAST [25].

Results
Characteristics of included studies
Our search identified 116,672 studies, of which 9 met the 
inclusion criteria (Fig. 1). Seven of these studies focused 
on developing predictive models [27–33], while two 
included both model development and validation [34, 
35]. Six studies were conducted in Africa [27–30, 33, 34], 
of which three were in South Africa, one in Tanzania, 
and one combining data from Nigeria and Mozambique. 
The remaining three studies were in the United States of 

Fig. 1  PRISMA flow of article selection


Page 5 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 	

America (USA) [31, 32, 35] (Table 1). These studies were 
published between 2018 and 2024, with the majority pub-
lished in 2023 and 2022. Seven studies were conducted in 
public healthcare facilities, while two were conducted in 
university clinics. Seven studies relied on retrospective 
cohort data, while two used existing registries (Table 1). 
Heterogeneity was not explored as only three models 
were externally validated.

Model performance metrics
Model performance is often measured using different 
metrics such as overall performance measures, discrimi-
nation, calibration, and (re)classification. Discrimination 
assesses the model’s capacity to differentiate between 
individuals who have and do not have the outcome. The 
c-statistic, which is equivalent to the area under the 
curve of a receiver operating characteristic curve (AUC-
ROC), is frequently used to assess discrimination. Other 
classification measures such as sensitivity, specificity, 
negative predictive value (NPV), positive predictive value 
(PPV), and F1 score are often used to assess model dis-
crimination. Calibration measures how well the pre-
dicted risks and observed outcomes match and is often 
assessed using graphical comparison of the observed and 
predicted event rates. Formal statistical tests such as the 
Hosmer–Lemeshow test for logistic regression are com-
monly used in conjunction with calibration plots.

Among the 9 studies selected, a total of 12 machine 
learning models were reported, with 9 focused 
on model development and 3 on model validation 
(Table  2). The median sample size across studies was 
136,415 (interquartile range: 178–450,000), though 1 
model was developed using a sample size of less than 
1000 participants. On average, 15 predictors (standard 
deviation (SD) = 4.0) were included in the final models. 
Ensemble learning techniques were the most frequently 
used algorithms, accounting for 92% of the total mod-
els. These included random forest (three models), 
Adaptive Boosting (AdaBoost, three models), Extreme 
Gradient Boosting (XGBoost, two models), Decision 
Trees (two models), and Categorical Boosting (Cat-
Boost, one model) (Table  2). Logistic regression was 
used in only one model.

Model performance was primarily assessed using the 
c-statistic or area under the receiver operating char-
acteristic curve (AUC), with an average AUC of 0.668 
(SD: 0.07). Some models also reported additional met-
rics, including accuracy, sensitivity, specificity, negative 
predictive value (NPV), and positive predictive value 
(PPV) (Table  2). Notably, two models reported only 
PPV, while another two reported the Mathews cor-
relation coefficient. Model calibration methods were 

used in just three models, which reported an average 
F1 score of 0.292 (SD: 0.01) alongside the AUC. None 
of the studies used decision curve analysis (DCA) to 
assess clinical value and implications, a significant 
limitation in evaluating the practical utility of the mod-
els. DCA is essential for assessing a model’s clinical 
relevance by weighing the benefits and risks at differ-
ent decision thresholds, rendering its exclusion a sig-
nificant constraint [36]. DCA are essential metrics that 
enhance calibration and discrimination measures in 
machine learning models [37] and help in incorporat-
ing the clinical consequences of using a model. Besides 
conducting DCA, net benefit analysis is an alternative 
measure to assess the applicability of models in real-
life situations. However, one study addressed model 
utility by gathering feedback from healthcare workers. 
Additional information is provided in Additional File 3: 
Model Characteristics Tab.

Risk‑of‑bias assessment
We reported the risk-of-bias assessment for the 12 
models using the PROBAST tool (Fig. 2). Of these, nine 
models (75.0%) were rated as having a high risk of bias, 
two models (16.7%) were rated low risk, and one model 
(8.3%) had an unclear risk of bias. A notable majority 
(58.3%) expressed high risk in the statistical analysis 
domain. For example, nearly half of the models failed 
to report how missing data was handled, and 10 mod-
els (83.3%) did not disclose the extent of missing data. 
Furthermore, only three models (25.0%) provided 
details on calibration measures, which are important 
for ensuring the reliability of predictions. None of 
the studies reported DCA or other methods to assess 
clinical utility, highlighting a critical gap in evaluating 
the practical application of these models. Additional 
details on the risk-of-bias analysis are provided in the 
supplementary material provided in Additional File 3: 
PROBAST summary tables.

Applicability assessment
We evaluated the applicability of the models for use in 
the intended population and primary healthcare settings. 
Overall, 83% of the models were rated as low concern, 
indicating their suitability for primary healthcare use. 
However, 17% were rated as high concern, reflecting limi-
tations in certain aspects of model development (Fig. 3). 
Predictors were rated as low concern, suggesting that the 
included predictors were relevant to the target popula-
tion and routinely collected in clinical settings. Similarly, 
the outcome domain was rated as low concern in 92% 
of the models, while 8% were marked as unclear due to 
insufficient reporting of key details.


Page 6 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 

Ta
bl

e 
1 

C
ha

ra
ct

er
is

tic
s 

of
 th

e 
in

cl
ud

ed
 s

tu
di

es

Au
th

or
, Y

ea
r

St
ud

y 
 

D
es

ig
n

M
L 

Te
ch

ni
qu

e
En

ro
lm

en
t  

pe
ri

od
St

ud
y 

 
se

tt
in

g
St

ud
y 

re
gi

on
A

ge
 o

f 
pa

rt
ic

ip
an

ts
Fe

m
al

e
M

al
e

Tr
ea

tm
en

t 
In

te
rr

up
tio

n

1 
- M

at
th

ew
-D

av
id

  
O

gb
ec

hi
e,

 2
02

3 
[3

0]
Re

tr
os

pe
ct

iv
e 

co
ho

rt
XG

Bo
os

t
Ja

nu
ar

y 
20

05
 - 

 
Fe

br
ua

ry
 2

02
1

H
ea

lth
  

fa
ci

lit
y

N
ig

er
ia

91
98

2 
(6

7.
3)

44
76

5 
(3

2.
7)

56
58

1 
(4

1.
5)

2 
- E

sr
a,

 R
ac

he
l, 2

02
3 

[3
4]

Ex
is

tin
g 

 
re

gi
st

ry
A

da
Bo

os
t

Ja
nu

ar
y 

1,
 2

01
7 

- 
M

ar
ch

 3
1,

 2
02

0
Pu

bl
ic

 h
ea

lth
 fa

ci
lit

ie
s

So
ut

h 
A

fri
ca

33
 (2

7-
41

)
17

21
70

 (6
5)

92
70

7 
(3

5)
26

04
67

 (1
1.

9)

Ca
tB

oo
st

Ja
nu

ar
y 

1,
 2

01
7 

- 
O

ct
ob

er
 1

, 2
01

8
Pu

bl
ic

 h
ea

lth
 fa

ci
lit

ie
s

So
ut

h 
A

fri
ca

33
 (2

7-
41

)
17

21
70

 (6
5)

92
70

7 
(3

5)

3 
- S

to
ck

m
an

, J
en

i , 
20

22
 

[3
3]

Re
tr

os
pe

ct
iv

e 
co

ho
rt

Ra
nd

om
 F

or
es

t
Ja

nu
ar

y 
1,

 2
01

0 
-  

N
ov

em
be

r 2
8,

  
20

19

Pu
bl

ic
 S

ec
to

r A
RT

 c
lin

ic
s

M
oz

am
bi

qu
e

47
.3

 (1
3.

6)

XG
Bo

os
t

Pu
bl

ic
 S

ec
to

r A
RT

 c
lin

ic
s

N
ig

er
ia

47
.3

 (1
3.

6)

4 
- A

rt
hi

, R
am

ac
ha

nd
ra

n,
 

20
20

 [3
2]

Re
tr

os
pe

ct
iv

e 
co

ho
rt

Ra
nd

om
 F

or
es

t
Ja

nu
ar

y 
1,

 2
00

8 
to

 M
ay

 3
1,

 2
01

5
U

ni
ve

rs
ity

 o
f C

hi
ca

go
  

H
IV

 c
ar

e 
cl

in
ic

U
SA

47
.3

 (1
3.

6)
31

4 
(4

4%
)

39
9 

(5
6%

)

D
ec

is
io

n 
Tr

ee
s

U
ni

ve
rs

ity
 o

f C
hi

ca
go

  
H

IV
 c

ar
e 

cl
in

ic
U

SA
47

.3
 (1

3.
6)

31
4 

(4
4%

)
39

9 
(5

6%
)

5 
- B

ria
n 

W
. P

en
ce

, 2
01

8 
 

[3
1]

Re
tr

os
pe

ct
iv

e 
co

ho
rt

Lo
gi

st
ic

 R
eg

re
s‑

si
on

20
02

 - 
20

15
U

S-
ba

se
d 

H
IV

 p
rim

ar
y 

ca
re

 
cl

in
ic

s
U

SA
46

 (3
9 

- 5
2)

16
60

 (1
6)

87
14

 (8
4)

17
95

7 
(1

7)

6 
- M

ha
iri

, M
as

ke
w

, 2
02

2 
[2

8]
Re

tr
os

pe
ct

iv
e 

co
ho

rt
A

da
Bo

os
t

Ja
nu

ar
y 

20
16

 - 
D

ec
em

be
r 2

01
8

Pu
bl

ic
 S

ec
to

r H
IV

 c
ar

e 
 

fa
ci

lit
ie

s
So

ut
h 

A
fri

ca
39

 (3
1 

- 4
9)

31
1,

94
5 

 
(7

0%
)

13
3,

69
0 

 
(3

0%
)

7 
- M

ha
iri

, M
as

ke
w

, 2
02

4 
[2

9]
Re

tr
os

pe
ct

iv
e 

co
ho

rt
A

da
Bo

os
t

Ja
nu

ar
y 

20
16

 - 
D

ec
em

be
r 2

01
8

Pu
bl

ic
 S

ec
to

r H
IV

 c
ar

e 
 

fa
ci

lit
ie

s
So

ut
h 

A
fri

ca
39

 (2
7 

- 4
9)

31
51

24
 (6

8)
14

82
94

 (3
2)

8 
- J

os
ep

h 
A 

M
as

on
, 2

02
3 

[3
5]

Ex
is

tin
g 

 
re

gi
st

ry
Ra

nd
om

 F
or

es
t

Ja
n 

21
 - 

M
ar

ch
 3

0,
 

20
22

H
os

pi
ta

l i
n 

a 
un

iv
er

si
ty

U
SA

9 
- C

ar
ol

yn
 A

 F
ah

ey
, 2

02
2 

[2
7]

Re
tr

os
pe

ct
iv

e 
co

ho
rt

D
ec

is
io

n 
Tr

ee
s

20
18

H
IV

 c
ar

e 
ce

nt
er

Ta
nz

an
ia

36
 (1

0)
11

3 
(6

3.
5)

65
 (3

6.
5)


Page 7 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 	

Ta
bl

e 
2 

Su
m

m
ar

y 
of

 m
od

el
 p

er
fo

rm
an

ce
 m

et
ric

s 
us

in
g 

th
e 

C
H

A
RM

S 
ch

ec
kl

is
t

A
ut

ho
r, 

Ye
ar

M
od

el
lin

g 
m

et
ho

d
Sa

m
pl

e 
si

ze
Ev

en
ts

 n
 (%

)
N

o 
pr

ed
ic

to
rs

 
Ca

nd
id

at
es

Fi
na

l 
pr

ed
ic

to
rs

EP
V 

or
 E

PP
Se

le
ct

io
n 

of
 

ca
nd

id
at

e 
pr

ed
ic

to
rs

 
Se
le

ct
io

n 
of

 fi
na

l 
pr

ed
ic

to
rs

N
um

be
r (

%
) 

an
d 

ha
nd

lin
g 

of
 

m
is

sn
g 

da
ta

Ty
pe

 o
f v

al
id

at
io

n
Pe

rf
or

m
an

ce
 

m
ea

su
re

s

1 
- M

at
th

ew
-

D
av

id
 O

gb
ec

hi
e,

 
20

23
 [3

0]

XG
Bo

os
t

13
6,

74
7

56
58

1 
(4

1.
4)

13
13

43
52

.4
Ba

se
d 

on
 p

rio
r 

kn
ow

le
dg

e
Pr

e-
sp

ec
ifi

ed
 

m
od

el
 (n

ot
 

se
le

ct
io

n)

n 
(%

): 
U

nk
ow

n
M

et
ho

d:
 K

nn
 

Im
pu

ta
tio

n

In
t: 

C
ro

ss
-v

al
id

at
io

n 
an

d 
ra

nd
om

 s
pl

it 
da

ta
Ex

t :
 N

on
e

Ca
lib

ra
tio

n 
m

ea
su

re
s: 

N
ot

 e
va

lu
at

ed
D

is
cr

im
in

at
io

n 
m

ea
s‑

ur
es

 : 
A

cc
ur

ac
y 

0.
85

 
(0

.8
5 

- 0
.8

6)
, S

en
si

tiv
‑

ity
 - 

0.
81

; S
pe

ci
fic

ity
 

- 0
.8

8;
 P

PV
 - 

0.
83

; N
PV

 
- 0

.8
7;

 K
ap

pa
 0

.7
0

O
ve

ra
ll 

m
ea

su
re

s: 
N

ot
 

ev
al

ua
te

d

2 
- E

sr
a,

 R
ac

he
l, 

20
23

 [3
4]

A
da

Bo
os

t
26

4,
87

7
35

98
5 

(1
3.

6)
13

13
27

68
.1

Ba
se

d 
on

 p
rio

r 
kn

ow
le

dg
e

Re
cu

rs
iv

e 
fe

at
ur

e 
el

im
i‑

na
tio

n

n 
(%

): 
15

09
 (0

.6
)

M
et

ho
d:

 S
in

gl
e 

im
pu

ta
tio

n

In
t: 

Ra
nd

om
 s

pl
it 

da
ta

Ex
t :

 D
iff

er
en

t s
et

tin
g

Ca
lib

ra
tio

n 
m

ea
su

re
s: 

F1
 S

co
re

 (0
.2

88
, 0

.2
86

 
- 0

.2
90

)
D

is
cr

im
in

at
io

n 
m

ea
s‑

ur
es

 : 
C

-S
ta

tis
tic

 /
 A

U
C

 
gr

ap
h 

/ 
Se

ns
iti

vi
ty

 
(0

.6
08

, 0
.6

04
 - 

0.
61

1)
,  

sp
ec

ifi
ci

ty
 (0

.6
47

, 
0.

64
6 

- 0
.6

48
), 

pp
v 

(0
.1

89
, 0

.1
87

 - 
0.

19
0)

, 
np

v 
(0

.9
24

, 0
.9

24
 - 

0.
92

5)
O

ve
ra

ll 
m

ea
su

re
s: 

N
ot

 
ev

al
ua

te
d

3 
- E

sr
a,

 R
ac

he
l, 

20
23

 [3
4]

Ca
tB

oo
st

13
6,

08
2

35
98

5 
(2

6.
4)

13
13

27
68

.1
Ba

se
d 

on
 p

rio
r 

kn
ow

le
dg

e
Re

cu
rs

iv
e 

fe
at

ur
e 

el
im

i‑
na

tio
n

n 
(%

): 
15

09
 (1

.1
)

M
et

ho
d:

 S
in

gl
e 

im
pu

ta
tio

n

In
t: 

Ra
nd

om
 s

pl
it 

da
ta

Ex
t :

 D
iff

er
en

t s
et

tin
g

Ca
lib

ra
tio

n 
m

ea
su

re
s: 

F1
 C

or
e 

(0
.2

99
, 0

.2
97

 
- 0

.3
01

)
D

is
cr

im
in

at
io

n 
m

ea
s‑

ur
es

 : 
C

-S
ta

tis
tic

 /
 A

U
C

 
gr

ap
h 

/ 
Se

ns
iti

vi
ty

 
(0

.6
46

, 0
.6

42
 - 

0.
64

9)
,  

sp
ec

ifi
ci

ty
 (0

.6
46

, 
0.

64
5 

- 0
.6

48
), 

pp
v 

(0
.1

95
, 0

.1
93

 - 
0.

19
6)

, 
np

v 
(0

.9
32

, 0
.9

31
 - 

0.
93

3)
O

ve
ra

ll 
m

ea
su

re
s: 

N
ot

 
ev

al
ua

te
d

4 
- S

to
ck

m
an

, 
Je

ni
 , 

20
22

 [3
3]

Ra
nd

om
 

Fo
re

st
36

0,
00

0
70

12
U

nk
no

w
n

O
th

er
N

o 
in

fo
rm

a‑
tio

n
n 

(%
): 

U
nk

ow
n

M
et

ho
d:

 M
is

si
ng

 
va

lu
es

 e
xc

lu
de

d 
in

 a
na

ly
si

s

In
t: 

C
ro

ss
-v

al
id

at
io

n
Ex

t :
 N

o 
in

fo
rm

at
io

n
Ca

lib
ra

tio
n 

m
ea

su
re

s: 
N

ot
 e

va
lu

at
ed

D
is

cr
im

in
at

io
n 

m
ea

su
re

s 
: C

-S
ta

tis
tic

 
/ 

AU
C

-P
R,

 M
CC

 (0
.4

5)
O

ve
ra

ll 
m

ea
su

re
s: 

N
ot

 
ev

al
ua

te
d


Page 8 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 

Ta
bl

e 
2 

(c
on

tin
ue

d)

A
ut

ho
r, 

Ye
ar

M
od

el
lin

g 
m

et
ho

d
Sa

m
pl

e 
si

ze
Ev

en
ts

 n
 (%

)
N

o 
pr

ed
ic

to
rs

 
Ca

nd
id

at
es

Fi
na

l 
pr

ed
ic

to
rs

EP
V 

or
 E

PP
Se

le
ct

io
n 

of
 

ca
nd

id
at

e 
pr

ed
ic

to
rs

 
Se
le

ct
io

n 
of

 fi
na

l 
pr

ed
ic

to
rs

N
um

be
r (

%
) 

an
d 

ha
nd

lin
g 

of
 

m
is

sn
g 

da
ta

Ty
pe

 o
f v

al
id

at
io

n
Pe

rf
or

m
an

ce
 

m
ea

su
re

s

5 
- S

to
ck

m
an

, 
Je

ni
 , 

20
22

 [3
3]

XG
Bo

os
t

45
0,

00
0

70
12

U
nk

no
w

n
O

th
er

N
o 

in
fo

rm
a‑

tio
n

n 
(%

): 
U

nk
ow

n
M

et
ho

d:
 M

is
si

ng
 

va
lu

es
 e

xc
lu

de
d 

in
 a

na
ly

si
s

In
t: 

Te
m

po
ra

l c
ro

ss
-

va
lid

at
io

n
Ex

t :
 N

o 
in

fo
rm

at
io

n

Ca
lib

ra
tio

n 
m

ea
su

re
s: 

N
ot

 e
va

lu
at

ed
D

is
cr

im
in

at
io

n 
m

ea
su

re
s 

: C
-S

ta
tis

tic
 

/ 
AU

C
-P

R,
 M

CC
(0

.3
7)

O
ve

ra
ll 

m
ea

su
re

s: 
N

ot
 

ev
al

ua
te

d

6 
- A

rt
hi

, 
Ra

m
ac

ha
nd

ra
n,

 
20

20
 [3

2]

Ra
nd

om
 

Fo
re

st
11

,4
45

13
73

 (1
2.

0)
10

00
20

1.
4

Ba
se

d 
on

 p
rio

r 
kn

ow
le

dg
e

O
th

er
n 

(%
): 

U
nk

ow
n

M
et

ho
d:

 S
in

gl
e 

im
pu

ta
tio

n

In
t: 

Te
m

po
ra

l c
ro

ss
-

va
lid

at
io

n
Ex

t :
 N

o 
in

fo
rm

at
io

n

Ca
lib

ra
tio

n 
m

ea
su

re
s: 

N
ot

 e
va

lu
at

ed
D

is
cr

im
in

at
io

n 
m

ea
su

re
s 

: P
PV

 (2
4.

5,
 

SD
 =

 0
.0

1)
O

ve
ra

ll 
m

ea
su

re
s: 

N
ot

 
ev

al
ua

te
d

7 
- A

rt
hi

, 
Ra

m
ac

ha
nd

ra
n,

 
20

20
 [3

2]

D
ec

is
io

n 
Tr

ee
s

11
,4

45
13

73
 (1

2.
0)

80
0

20
1.

7
Ba

se
d 

on
 p

rio
r 

kn
ow

le
dg

e
O

th
er

n 
(%

): 
U

nk
ow

n
M

et
ho

d:
 S

in
gl

e 
im

pu
ta

tio
n

In
t: 

Ra
nd

om
 s

pl
it 

da
ta

Ex
t :

 N
o 

in
fo

rm
at

io
n

Ca
lib

ra
tio

n 
m

ea
su

re
s: 

N
ot

 e
va

lu
at

ed
D

is
cr

im
in

at
io

n 
m

ea
s‑

ur
es

 : 
PP

V 
(1

5.
5,

 0
.0

4)
O

ve
ra

ll 
m

ea
su

re
s: 

N
ot

 
ev

al
ua

te
d

8 
- B

ria
n 

W
. 

Pe
nc

e,
 2

01
8 

[3
1]

Lo
gi

st
ic

 
re

gr
es

si
on

10
5,

62
8

17
95

7 
(1

7.
0)

14
14

12
82

.6
Ba

se
d 

on
 p

rio
r 

kn
ow

le
dg

e
Pr

e-
sp

ec
ifi

ed
 

m
od

el
 (n

ot
 

se
le

ct
io

n)

n 
(%

): 
U

nk
ow

n
M

et
ho

d:
 N

o 
in

fo
rm

at
io

n

In
t: 

C
ro

ss
-v

al
id

at
io

n
Ex

t :
 N

o 
in

fo
rm

at
io

n
Ca

lib
ra

tio
n 

m
ea

su
re

s: 
N

ot
 e

va
lu

at
ed

D
isc

rim
in

at
io

n 
m

ea
s‑

ur
es

 : C
-S

ta
tis

tic
 / 

AU
C

 
gr

ap
h 

/ S
en

sit
iv

ity
 (0

.7
4,

  
0.

70
 - 

0.
78

), 
Sp

ec
ifi

ci
ty

 
(0

.5
4,

 0
.4

4 
- 0

.6
4)

O
ve

ra
ll m

ea
su

re
s: 

N
ot

 
ev

al
ua

te
d

9 
- M

ha
iri

, 
M

as
ke

w
, 2

02
2 

[2
8]

A
da

Bo
os

t
1,

39
9,

14
5

14
68

81
 (1

0.
5)

75
20

19
58

.4
O

th
er

Fe
at

ur
e 

se
le

c‑
tio

n 
us

in
g 

ra
nd

om
 fo

re
st

n 
(%

): 
U

nk
ow

n
M

et
ho

d:
 O

th
er

In
t: 

Ra
nd

om
 s

pl
it 

da
ta

Ex
t :

 N
o 

in
fo

rm
at

io
n

Ca
lib

ra
tio

n 
m

ea
su

re
s: 

F1
 S

co
re

 (0
.2

9)
D

is
cr

im
in

at
io

n 
m

ea
s‑

ur
es

 : 
C

-S
ta

tis
tic

 /
 A

U
C

 
gr

ap
h 

/ 
A

cc
ur

ac
y 

(0
.7

86
), 

se
ns

iti
vi

ty
 

(0
.4

06
), 

sp
ec

ifi
ci

ty
 

(0
.8

3)
, n

pv
 (0

.9
2)

, p
pv

 
(0

.2
2)

D
is

c 
: C

-S
ta

tis
tic

 
/ 

AU
C

 g
ra

ph
 /

 A
cc

u‑
ra

cy
 (0

.7
86

), 
se

ns
iti

v‑
ity

 (0
.4

06
), 

sp
ec

ifi
ci

ty
 

(0
.8

3)
, n

pv
 (0

.9
2)

, p
pv

 
(0

.2
2)

O
ve

ra
ll 

m
ea

su
re

s: 
N

ot
 

ev
al

ua
te

d


Page 9 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 	

Ta
bl

e 
2 

(c
on

tin
ue

d)

A
ut

ho
r, 

Ye
ar

M
od

el
lin

g 
m

et
ho

d
Sa

m
pl

e 
si

ze
Ev

en
ts

 n
 (%

)
N

o 
pr

ed
ic

to
rs

 
Ca

nd
id

at
es

Fi
na

l 
pr

ed
ic

to
rs

EP
V 

or
 E

PP
Se

le
ct

io
n 

of
 

ca
nd

id
at

e 
pr

ed
ic

to
rs

 
Se
le

ct
io

n 
of

 fi
na

l 
pr

ed
ic

to
rs

N
um

be
r (

%
) 

an
d 

ha
nd

lin
g 

of
 

m
is

sn
g 

da
ta

Ty
pe

 o
f v

al
id

at
io

n
Pe

rf
or

m
an

ce
 

m
ea

su
re

s

10
 - 

M
ha

iri
, 

M
as

ke
w

, 2
02

4 
[2

9]

A
da

Bo
os

t
3,

26
4,

67
1

14
68

81
 (4

.5
)

11
10

13
35

2.
8

N
o 

in
fo

rm
at

io
n

N
o 

in
fo

rm
a‑

tio
n

n 
(%

): 
U

nk
ow

n
M

et
ho

d:
 N

o 
in

fo
rm

at
io

n

In
t: 

Ra
nd

om
 s

pl
it 

da
ta

Ex
t :

 N
o 

in
fo

rm
at

io
n

Ca
lib

ra
tio

n 
m

ea
su

re
s: 

N
ot

 e
va

lu
at

ed
D

is
cr

im
in

at
io

n 
m

ea
su

re
s 

: C
-S

ta
tis

tic
 

/ 
A

cc
ur

ac
y 

=
 0

.6
3,

 
Sp

ec
ifi

ci
ty

 =
 0

.5
2,

 
sp

ec
ifi

ci
ty

 =
0.

64
, p

pv
 

=
 0

.1
9,

 N
PV

 =
 0

.8
9

O
ve

ra
ll 

m
ea

su
re

s: 
N

ot
 

ev
al

ua
te

d

11
 - 

Jo
se

ph
 

A
 M

as
on

, 2
02

3 
[3

5]

Ra
nd

om
 

Fo
re

st
33

1
0 

(0
.0

)
11

11
0

Ba
se

d 
on

 p
rio

r 
kn

ow
le

dg
e

N
o 

in
fo

rm
a‑

tio
n

n 
(%

): 
U

nk
ow

n
M

et
ho

d:
 N

o 
in

fo
rm

at
io

n

In
t: 

Ra
nd

om
 s

pl
it 

da
ta

Ex
t :

 D
iff

er
en

t d
at

as
et

 
an

d 
pr

ov
id

er
 fe

ed
ba

ck

Ca
lib

ra
tio

n 
m

ea
su

re
s: 

N
ot

 e
va

lu
at

ed
D

is
cr

im
in

at
io

n 
m

ea
s‑

ur
es

 : 
C

-S
ta

tis
tic

 /
 A

U
C

 
gr

ap
h

O
ve

ra
ll 

m
ea

su
re

s: 
N

ot
 

ev
al

ua
te

d

12
 - 

Ca
ro

ly
n 

A
 F

ah
ey

, 2
02

2 
[2

7]

D
ec

is
io

n 
Tr

ee
s

17
8

72
 (4

0.
4)

22
22

3.
3

Ba
se

d 
on

 p
rio

r 
kn

ow
le

dg
e

N
o 

in
fo

rm
a‑

tio
n

n 
(%

): 
0 

(0
.0

)
M

et
ho

d:
 O

th
er

In
t: 

C
ro

ss
-v

al
id

at
io

n 
an

d 
ra

nd
om

 s
pl

it 
da

ta
Ex

t :
 U

nc
le

ar

Ca
lib

ra
tio

n 
m

ea
su

re
s: 

N
ot

 e
va

lu
at

ed
D

is
cr

im
in

at
io

n 
m

ea
su

re
s 

: C
-S

ta
tis

tic
 

/ 
A

cc
ur

ac
y 

(0
.7

23
)

O
ve

ra
ll 

m
ea

su
re

s: 
N

ot
 

ev
al

ua
te

d

EP
V 

Ev
en

ts
 p

er
 v

ar
ia

bl
e,

 E
PP

 E
ve

nt
s 

pe
r p

re
di

ct
or

, P
PV

 P
os

iti
ve

 P
re

di
ct

iv
e 

Va
lu

e,
 N

PV
 N

eg
at

iv
e 

Pr
ed

ic
tiv

e 
Va

lu
e,

 A
U

C-
PR

 A
re

a 
U

nd
er

 th
e 

Pr
ec

is
io

n 
Re

ca
ll 

Cu
rv

e,
 M

CC
 M

at
th

ew
s 

Co
rr

el
at

io
n 

Co
effi

ci
en

t


Page 10 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 

Model validation
All 12 models reported internal validation. These 
included random sample split (6), cross-validation (4), 
and a combination of random sample split and cross-val-
idation (2) (Table 2). Three models were externally vali-
dated, but only two reported discrimination measures, 
with an average F1 score of 0.2935, alongside c-statistic 
(AUC) values. These validations were done using data-
sets received from registries of people living with HIV 
and scheduled for clinical appointments. While sensitiv-
ity, specificity, PPV, and NPV were included, one model 
lacked critical details on eligibility criteria and missing 
data handling. None of the externally validated models 
assessed clinical utility. Further details are provided in 
the supplementary material (Additional File 3: Model 
characteristics tables).

Discussions
This review examined 12 machine learning models devel-
oped to predict interruptions in HIV treatment, with 
most relying on advanced ensemble techniques like ran-
dom forest, AdaBoost, and XGBoost. These models were 

built using data from large retrospective cohorts, with 
a median sample size of 120,000 participants, and were 
validated internally through methods like cross-vali-
dation and random sample splitting. The models dem-
onstrated acceptable predictive performance, with an 
average AUC-ROC of 0.668, and utilized data commonly 
collected in clinical settings, making them practical for 
real-world use. For prognostic predictive models, AUC 
of 0.5–0.7 suggests poor discrimination, and 0.7–0.8 is 
considered acceptable, 0.8–0.9 excellent, and > 0.9 as out-
standing [38, 39]. Although only two models were exter-
nally validated, most models showed strong potential 
for application in primary healthcare, highlighting their 
promise in improving adherence and supporting HIV 
care strategies.

Electronic medical records (EMRs) are increasingly 
prevalent worldwide, including in Africa [40], facilitating 
the ongoing accumulation of extensive healthcare data 
and enabling big data analytics [41–46], as well as the 
application of machine learning and artificial intelligence 
[44, 47, 48]. Numerous prognostic studies have employed 
EMR data to create models for predicting individual 

Fig. 2  Summary of risk-of-bias assessment

Fig. 3  Summary of applicability assessment


Page 11 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 	

diagnoses of HIV, healthcare attendance, and viral load 
suppression [49–51]. The growing utilization of these 
analytic tools is likely due to the interest in employing 
predictive models as decision support instruments at the 
point of care. Moreover, executing focused, high-impact 
treatments with limited resources in underprivileged 
healthcare environments is essential [52, 53].

Two-thirds of the research was conducted in Africa, 
predominantly in South Africa, an area characterized 
by a high incidence of HIV [54]. This emphasis is praise-
worthy, yet it constrains the comprehension of predictive 
model application in areas with low prevalence. Utilizing 
data from high-prevalence regions, such as South Africa, 
offers essential insights into models that help tackle 
adherence difficulties in analogous circumstances. This 
emphasis requires careful consideration when extrapo-
lating results to areas with varying healthcare systems 
and compliance challenges. The research conducted in 
the USA [31, 32, 35], however limited in number, offered 
a divergent viewpoint, highlighting the necessity for 
regionally appropriate models.

The machine learning techniques in our analysis have 
shown significant potential in forecasting treatment 
interruption by utilizing routinely gathered clinical data. 
Ensemble learning methodologies, specifically random 
forest, AdaBoost, and XGBoost, were significant, collec-
tively representing 91.7% of the models created. Previ-
ous studies have demonstrated that ensemble approaches 
effectively address the complex, nonlinear interactions 
prevalent in healthcare datasets [55, 56]. These algo-
rithms have achieved above 90% accuracy across many 
datasets [57, 58]. Ensemble algorithms are beneficial 
because of their resilience to overfitting and their capac-
ity to handle extensive feature sets. The outcomes of our 
review correspond with these results. Upon analysis, 
most models in our study provided the c-statistic (AUC), 
which evaluates the discriminatory capability of predic-
tive models. The average AUC of 0.668 in our analysis 
aligns with the findings of Chilamkurthy et al. (2018) who 
stated that whereas ML models excel at distinguishing 
different outcomes, their clinical performance criteria, 
such as accuracy, sensitivity, and specificity, frequently 
lack efficacy due to unbalanced datasets or inadequate 
predictor selection often found in healthcare datasets. 
Other studies have emphasized the need for ML algo-
rithms to employ the AUC as a more effective and supe-
rior metric in conjunction with calibration and decision 
curve analysis for assessing model performance in com-
parison to accuracy [59].

We discovered in our review that several studies 
failed to include calibration and clinical efficacy in their 
reports. Although there are many possible problems in 
the creation and validation of prediction models, it is 

essential to disclose calibration measurements, which 
are vital components of statistical performance [60, 61]. 
Calibration measures are essential since they guaran-
tee that model prediction probabilities correspond with 
real probabilities, hence ensuring model dependability. 
Merely 25% of the research included in our evaluation 
assessed model calibration. In the absence of calibration, 
predictive models may provide probabilities that inaccu-
rately reflect actual hazards, hence compromising their 
therapeutic relevance [62]. We noted significant prob-
lems with the ROB in the developed prediction models. 
Seventy-five percent of the reviewed models were classi-
fied as exhibiting a high risk of bias, mostly due to inad-
equacies in the statistical analysis and data management. 
Approximately 83.3% of models did not disclose the mag-
nitude of missing data or the methodologies employed 
to mitigate it, underscoring its significance as a key con-
cern. This conclusion aligns with prior research demon-
strating that most predictive model studies do not report 
their methods for addressing missing data [63]. Missing 
data is a widespread problem in retrospective healthcare 
datasets and, if not properly managed, can compromise 
model performance and integrity [63–65]. Several stud-
ies have utilized imputation approaches, precisely pre-
dicting missing values to mirror reality, which increases 
the probability of acquiring high-quality and reusable 
data [66]. However, if this is not handled appropriately, 
it can lead to systemic biases and diminish the validity 
and integrity of models, particularly in datasets utilized 
in healthcare research [67, 68]. Furthermore, our review 
observed the lack of decision curve analysis (DCA) in all 
the studies included. Besides conducting DCA, net ben-
efit analysis is an alternative measure to assess the appli-
cability of models in real-life situations.

The reviewed models show potential for improving 
HIV treatment interruption predictions; nevertheless, 
their reliability and applicability in clinical environments 
are constrained, as shown in the risk-of-bias and appli-
cability results. Overall, an 83% applicability score was 
achieved for the reviewed models, suggesting their broad 
appropriateness for the target groups and settings. This 
result indicates the incorporation of frequently gathered 
predictors in clinical contexts, including demographic 
information, adherence records, and clinical indicators, 
which improve the practicality of applying these models 
in actual healthcare settings [69]. Ninety-two percent of 
models assessed the outcome domains as minimal con-
cern; nevertheless, the absence of external validation and 
decision curve analysis presents serious constraints in the 
practical use in guiding clinical decisions [62]. For opti-
mal real-world applicability, models must address these 
deficiencies by integrating external validation across 
diverse contexts and evaluating clinical significance using 


Page 12 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 

methodologies such as DCA, net benefit analysis, or net 
reclassification improvement assessments. Aligning with 
clinical processes is crucial for maximizing the efficacy 
of machine learning in enhancing adherence and mini-
mizing inappropriate treatment exclusion in HIV care. 
Enhancing future research through stringent report-
ing standards and robust statistical methodologies, such 
as those outlined in the TRIPOD recommendations, is 
essential to mitigate biases and improve the reliability of 
predictive modeling in HIV care [70].

The results of this review should be interpreted with 
certain limitations in mind. First, the review included 
only journal articles published in English with free-text 
availability, and the search was conducted across a lim-
ited number of databases, which may introduce language 
and publication bias. Excluding studies conducted in 
other languages besides English presented a potential 
selection bias. This potentially limits the generalizability 
of the findings to English-speaking settings. To address 
potential selection and publication bias stemming from 
the restricted database search, we supplemented our 
efforts by conducting backward and forward citation 
searches in Google Scholar and reviewing article refer-
ences. Most of these studies were conducted in resource-
poor settings, which made it difficult for validation 
studies to be carried out. It is recommended that in such 
circumstances, validation studies should be conducted 
on different datasets or settings.

Future studies should prioritize implementing robust 
external validation across diverse populations and geo-
graphic regions, which is essential to evaluate model 
performance under varying demographic, clinical, and 
systemic conditions, ensuring reliability in real-world 
applications. The inclusion of sociocultural and struc-
tural factors in model development should be considered 
in future research. Also, addressing missing data is criti-
cal for enhancing model accuracy and reliability. Future 
studies should adopt systematic strategies such as mul-
tiple imputations or sensitivity analyses and adhere to 
standardized reporting guidelines like TRIPOD. Finally, 
incorporating decision curve analysis (DCA) into model 
assessment is recommended to bridge the gap between 
statistical performance and practical, real-world impact.

Conclusions
This study provides key insights into the current state 
of predictive modeling for HIV treatment interrup-
tions. Machine learning, particularly ensemble learning 
techniques, is popularly used with retrospective cohort 
data to address adherence issues in HIV programs, 
demonstrating moderate accuracy and applicability in 
primary healthcare settings. However, critical shortcom-
ings, including insufficient calibration reporting, lack of 

decision curve analysis (DCA), and limited external vali-
dation, restrict the models’ clinical utility and generaliz-
ability. Predictive modeling holds significant promise in 
supporting countries to achieve the UNAIDS 95-95-95 
targets by advancing equitable access to medications, 
high treatment retention rates, and achieving widespread 
viral load suppression.

Abbreviations
HIV	� Human immune virus
AIDS	� Acquired immunodeficiency syndrome
PLHIV	� People living with HIV
ART​	� Antiretroviral therapy
ML	� Machine learning
AI	� Artificial intelligence
UNAIDS	� Joint United Nations Programme on HIV/AIDS
PRISMA	� Preferred Reporting Items for Systematic reviews and Meta-Analyses
PROSPERO	� International Prospective Register of Systematic Reviews
BMC	� BioMed Central
MeSH	� Medical Subject Headings
CHARMS	� CHecklist for critical Appraisal and data extraction for systematic 

Reviews of prediction Modelling Studies
PROBAST	� Prediction model Risk Of Bias Assessment Tool
ROB	� Risk of bias
TRIPOD	� Transparent Reporting of a multivariable prediction model for 

Individual Prognosis Or Diagnosis
SD	� Standard deviation
XGBoost	� Extreme Gradient Boosting
AdaBoost	� Adaptive Boosting
CatBoost	� Categorical Boosting
AUC-ROC	� Area under the receiver operating characteristic curve
AUC-PR	� Area under the precision-recall curve
NPV	� Negative predictive value
PPV	� Positive predictive value
EPV	� Events per variable
EPP	� Events per predictor
MCC	� Mathews correlation coefficient
DCA	� Decision curve analysis
EMR	� Electronic medical records

Supplementary Information
The online version contains supplementary material available at https://​doi.​
org/​10.​1186/​s44263-​025-​00184-4.

Additional File 1: Search Strategy (Revised).

Additional File 2: PRISMA 2020 Checklist.

Additional File 3: CHARMS checklist, PROBAST checklist. Study charac‑
teristics: Table 1. Characteristics of the studies included in the systematic 
review. Model characteristics: Table 2: Characteristics of the models 
included in the systematic review and critical for risk of bias and applica‑
bility. PROBAST summary: Table 3: Risk of Bias and applicability assess‑
ment. Drop-down lists for CHARMS.

Acknowledgements
We would like to express our sincere gratitude to Gabriel Jamal Peazang Ibra‑
him and Nabilatu Zakari for their assistance in data extraction. Also would like 
to express sincere gratitude to Dr. Ekua E. Houphouet and Dr. Jasmin Kwarah 
for generously reviewing the manuscript and providing the stationery that 
was crucial for the successful completion of this systematic review. 

Authors’ contributions
WK conceived the research topic, led data review and extraction, analyzed and 
interpreted the extracted data, and wrote the first draft of the manuscript. FBV, 
DD, and SB contributed to the methods, analysis, and reporting and reviewed 
the manuscript. All authors read and approved the final manuscript.

https://doi.org/10.1186/s44263-025-00184-4
https://doi.org/10.1186/s44263-025-00184-4


Page 13 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 	

Funding
Not applicable.

Data availability
All data generated or analyzed during this study are part of the supplementary 
information in the Additional File 3: SUMMARY, CHARMS, and PROBAST tabs.

Declarations

Ethics approval and consent to participate
Given that this study is nested within another study on HIV treatment inter‑
ruptions, ethical approval was received from the Ghana Health Service Ethics 
Review Committee with approval number GHS-ERC:003/08/24. All ethical prin‑
ciples were followed in this review. Consent to participate is not applicable.

Consent for publication
Not applicable.

Competing interests
The authors declare no competing interests.

Received: 11 January 2025   Accepted: 9 July 2025

References
	1.	 UNAIDS_FactSheet_en.pdf, (n.d.). https://​www.​unaids.​org/​sites/​defau​lt/​

files/​media_​asset/​UNAIDS_​FactS​heet_​en.​pdf. Accessed 17 Dec 2024.
	2.	 Frescura L, Godfrey-Faussett P, Feizzadeh AA, El-Sadr W, Syarif O, Ghys PD. 

Achieving the 95 95 95 targets for all: a pathway to ending AIDS. PLoS 
ONE. 2022;17:e0272405. https://​doi.​org/​10.​1371/​journ​al.​pone.​02724​05.

	3.	 Altice F, Evuarherhe O, Shina S, Carter G, Beaubrun AC. Adherence to HIV 
treatment regimens: systematic literature review and meta-analysis. Patient 
Prefer Adherence. 2019;13:475–90. https://​doi.​org/​10.​2147/​PPA.​S1927​35.

	4.	 Dubrocq G, Rakhmanina N. Antiretroviral therapy interruptions: impact 
on HIV treatment and transmission. HIVAIDS - Res Palliat Care. 2018;10:91–
101. https://​doi.​org/​10.​2147/​HIV.​S1419​65.

	5.	 Akpan U, Kakanfo K, Ekele OD, Ukpong K, Toyo O, Nwaokoro P, James E, 
Pandey S, Olatubosun K, Bateganya M. Predictors of treatment interrup‑
tion among patients on antiretroviral therapy in Akwa Ibom, Nigeria: 
outcomes after 12 months. AIDS Care. 2023;35:114–22. https://​doi.​org/​10.​
1080/​09540​121.​2022.​20938​26.

	6.	 Rosen S, Fox MP, Gill CJ. Patient retention in antiretroviral therapy pro‑
grams in sub-Saharan Africa: a systematic review. PLoS Med. 2007;4: e298. 
https://​doi.​org/​10.​1371/​journ​al.​pmed.​00402​98.

	7.	 Thirumurthy H, Galárraga O, Larson B, Rosen S. HIV treatment produces 
economic returns through increased work and education, and warrants 
continued US support. Health Aff Proj Hope. 2012;31:1470–7. https://​doi.​
org/​10.​1377/​hltha​ff.​2012.​0217.

	8.	 Jewell B, Smith J, Hallett T. The potential impact of interrup‑
tions to HIV services: a modelling case study for South Africa. 
2020.2020.04.22.20075861. https://​doi.​org/​10.​1101/​2020.​04.​22.​20075​861.

	9.	 Mills EJ, Funk A, Kanters S, Kawuma E, Cooper C, Mukasa B, Odit M, Kara‑
magi Y, Mwehire D, Nachega J, Yaya S, Featherstone A, Ford N. Long-term 
health care interruptions among HIV-positive patients in Uganda. JAIDS 
J Acquir Immune Defic Syndr. 2013;63: e23. https://​doi.​org/​10.​1097/​QAI.​
0b013​e3182​8a3fb8.

	10.	 Thomadakis C, Yiannoutsos CT, Pantazis N, Diero L, Mwangi A, Musick 
BS, Wools-Kaloustian K, Touloumi G. The effect of HIV treatment inter‑
ruption on subsequent immunological response. Am J Epidemiol. 
2023;192:1181–91. https://​doi.​org/​10.​1093/​aje/​kwad0​76.

	11.	 Trickey A, Zhang L, Rentsch CT, Pantazis N, Izquierdo R, Antinori A, Leierer 
G, Burkholder G, Cavassini M, Palacio-Vieira J, Gill MJ, Teira R, Stephan C, 
Obel N, Vehreschild J-J, Sterling TR, Van Der Valk M, Bonnet F, Crane HM, 
Silverberg MJ, Ingle SM, Sterne JAC, the A.T.C. Collaboration (ART-CC). 
Care interruptions and mortality among adults in Europe and North 
America. AIDS. 2024;38:1533. https://​doi.​org/​10.​1097/​QAD.​00000​00000​
003924.

	12.	 Chamberlin S, Mphande M, Phiri K, Kalande P, Dovel K. How HIV clients 
find their way back to the ART clinic: a qualitative study of disen‑
gagement and re-engagement with HIV care in Malawi. AIDS Behav. 
2022;26:674–85. https://​doi.​org/​10.​1007/​s10461-​021-​03427-1.

	13.	 Palacio-Vieira J, Reyes-Urueña JM, Imaz A, Bruguera A, Force L, Llaveria 
AO, Llibre JM, Vilaró I, Borràs FH, Falcó V, Riera M, Domingo P, de Lazzari E, 
Miró JM, Casabona J. Strategies to reengage patients lost to follow up in 
HIV care in high income countries, a scoping review. BMC Public Health. 
2021;21:1596. https://​doi.​org/​10.​1186/​s12889-​021-​11613-y.

	14.	 Bektaş M, Tuynman JB, Costa Pereira J, Burchell GL, van der Peet DL. 
Machine learning algorithms for predicting surgical outcomes after 
colorectal surgery: a systematic review. World J Surg. 2022;46:1. https://​
doi.​org/​10.​1007/​s00268-​022-​06728-1.

	15.	 Huang Y, Li J, Li M, Aparasu RR. Application of machine learning in 
predicting survival outcomes involving real-world data: a scoping 
review. BMC Med Res Methodol. 2023;23:268. https://​doi.​org/​10.​1186/​
s12874-​023-​02078-1.

	16.	 Senders JT, Staples PC, Karhade AV, Zaki MM, Gormley WB, Broekman 
MLD, Smith TR, Arnaout O. Machine learning and neurosurgical outcome 
prediction: a systematic review. World Neurosurg. 2018;109:476-486.e1. 
https://​doi.​org/​10.​1016/j.​wneu.​2017.​09.​149.

	17.	 E.W. Steyerberg, Applications of Prediction Models, in: E.W. Steyerberg 
(Ed.), Clin. Predict. Models Pract. Approach Dev. Valid. Updat., Springer 
International Publishing, Cham, 2019: pp. 15–36. https://​doi.​org/​10.​1007/​
978-3-​030-​16399-0_2.

	18.	 Zu W, Huang X, Xu T, Du L, Wang Y, Wang L, Nie W. Machine learning in 
predicting outcomes for stroke patients following rehabilitation treat‑
ment: a systematic review. PLoS ONE. 2023;18: e0287308. https://​doi.​org/​
10.​1371/​journ​al.​pone.​02873​08.

	19.	 Corporation for Digital Scholarship. Zotero (6.0.37) [Software]. Listing 
the institution (Corporation for Digital Scholarship) instead of individu‑
als is advisable because several programmers and an active community 
contributed to developing the software. 2023. https://​www.​zotero.​org/. 
Original work published 2006.

	20.	 Page MJ, Moher D, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, 
Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw 
JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald 
S, McGuinness LA, Stewart LA, Thomas J, Tricco AC, Welch VA, Whiting 
P, McKenzie JE. PRISMA 2020 explanation and elaboration: updated 
guidance and exemplars for reporting systematic reviews. BMJ. 2021;372: 
n160. https://​doi.​org/​10.​1136/​bmj.​n160.

	21.	 Damen JAA, Moons KGM, van Smeden M, Hooft L. How to conduct a 
systematic review and meta-analysis of prognostic model studies. Clin 
Microbiol Infect. 2023;29:434–40. https://​doi.​org/​10.​1016/j.​cmi.​2022.​07.​019.

	22.	 Systematic Review and Literature Review Software by DistillerSR, Distill‑
erSR (n.d.). https://​www.​disti​llersr.​com/. Accessed 17 Dec 2024.

	23.	 Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, 
Clarke M, Devereaux PJ, Kleijnen J, Moher D. The PRISMA Statement for 
Reporting Systematic reviews and Meta-Analyses of studies that evaluate 
health care interventions: explanation and elaboration. PLOS Med. 2009;6: 
e1000100. https://​doi.​org/​10.​1371/​journ​al.​pmed.​10001​00.

	24.	 Moons KGM, de Groot JAH, Bouwmeester W, Vergouwe Y, Mallett S, Altman 
DG, Reitsma JB, Collins GS. Critical appraisal and data extraction for system‑
atic reviews of prediction modelling studies: the CHARMS checklist. PLOS 
Med. 2014;11: e1001744. https://​doi.​org/​10.​1371/​journ​al.​pmed.​10017​44.

	25.	 Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, 
Reitsma JB, Kleijnen J, Mallett S. PROBAST Group†, PROBAST: a tool to 
assess the risk of bias and applicability of prediction model studies. Ann 
Intern Med. 2019;170:51–8. https://​doi.​org/​10.​7326/​M18-​1376.

	26.	 Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg 
EW, Vickers AJ, Ransohoff DF, Collins GS. Transparent Reporting of a multi‑
variable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): 
explanation and elaboration. Ann Intern Med. 2015;162:W1–73. https://​
doi.​org/​10.​7326/​M14-​0698.

	27.	 Fahey CA, Wei L, Njau PF, Shabani S, Kwilasa S, Maokola W, Packel L, Zheng 
Z, Wang J, McCoy SI. Machine learning with routine electronic medical 
record data to identify people at high risk of disengagement from HIV 
care in Tanzania. PLOS Glob Public Health. 2022;2: e0000720. https://​doi.​
org/​10.​1371/​journ​al.​pgph.​00007​20.

	28.	 Maskew M, Sharpey-Schafer K, De Voux L, Crompton T, Bor J, Rennick 
M, Chirowodza A, Miot J, Molefi S, Onaga C, Majuba P, Sanne I, Pisa 

https://www.unaids.org/sites/default/files/media_asset/UNAIDS_FactSheet_en.pdf
https://www.unaids.org/sites/default/files/media_asset/UNAIDS_FactSheet_en.pdf
https://doi.org/10.1371/journal.pone.0272405
https://doi.org/10.2147/PPA.S192735
https://doi.org/10.2147/HIV.S141965
https://doi.org/10.1080/09540121.2022.2093826
https://doi.org/10.1080/09540121.2022.2093826
https://doi.org/10.1371/journal.pmed.0040298
https://doi.org/10.1377/hlthaff.2012.0217
https://doi.org/10.1377/hlthaff.2012.0217
https://doi.org/10.1101/2020.04.22.20075861
https://doi.org/10.1097/QAI.0b013e31828a3fb8
https://doi.org/10.1097/QAI.0b013e31828a3fb8
https://doi.org/10.1093/aje/kwad076
https://doi.org/10.1097/QAD.0000000000003924
https://doi.org/10.1097/QAD.0000000000003924
https://doi.org/10.1007/s10461-021-03427-1
https://doi.org/10.1186/s12889-021-11613-y
https://doi.org/10.1007/s00268-022-06728-1
https://doi.org/10.1007/s00268-022-06728-1
https://doi.org/10.1186/s12874-023-02078-1
https://doi.org/10.1186/s12874-023-02078-1
https://doi.org/10.1016/j.wneu.2017.09.149
https://doi.org/10.1007/978-3-030-16399-0_2
https://doi.org/10.1007/978-3-030-16399-0_2
https://doi.org/10.1371/journal.pone.0287308
https://doi.org/10.1371/journal.pone.0287308
https://www.zotero.org/
https://doi.org/10.1136/bmj.n160
https://doi.org/10.1016/j.cmi.2022.07.019
https://www.distillersr.com/
https://doi.org/10.1371/journal.pmed.1000100
https://doi.org/10.1371/journal.pmed.1001744
https://doi.org/10.7326/M18-1376
https://doi.org/10.7326/M14-0698
https://doi.org/10.7326/M14-0698
https://doi.org/10.1371/journal.pgph.0000720
https://doi.org/10.1371/journal.pgph.0000720


Page 14 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 

P. Applying machine learning and predictive modeling to retention 
and viral suppression in South African HIV treatment cohorts. Sci Rep. 
2022;12:12715. https://​doi.​org/​10.​1038/​s41598-​022-​16062-0.

	29.	 Maskew M, Smith S, Voux LD, Sharpey-Schafer K, Crompton T, Govender 
A, Pisa P, Rosen S. Triaging clients at risk of disengagement from HIV care: 
application of a predictive model to clinical trial data in South Africa. 
2024.2024.08.05.24311488. https://​doi.​org/​10.​1101/​2024.​08.​05.​24311​488.

	30.	 Ogbechie M-D, Walker CF, Lee M-T, Gana AA, Oduola A, Idemudia A, Edor 
M, Harris EL, Stephens J, Gao X, Chen P-L, Persaud NE. Predicting treatment 
interruption among people living with HIV in Nigeria: machine learning 
approach. JMIR AI. 2023;2: e44432. https://​doi.​org/​10.​2196/​44432.

	31.	 Pence BW, Bengtson AM, Boswell S, Christopoulos KA, Crane HM, Geng E, 
Keruly JC, Mathews WC, Mugavero MJ. Who will show? Predicting missed 
visits among patients in routine HIV primary care in the United States, 
AIDS Behav. 2019;23:418–26. https://​doi.​org/​10.​1007/​s10461-​018-​2215-1.

	32.	 Ramachandran A, Kumar A, Koenig H, De Unanue A, Sung C, Walsh J, 
Schneider J, Ghani R, Ridgway JP. Predictive analytics for retention in care 
in an urban HIV clinic. Sci Rep. 2020;10:6421. https://​doi.​org/​10.​1038/​
s41598-​020-​62729-x.

	33.	 Stockman J, Friedman J, Sundberg J, Harris E. Predictive analytics using 
machine learning to identify ART clients at health system level at greatest 
risk of treatment interruption in Mozambique and Nigeria. JAIDS J Acquir 
Immune Defic Syndr. 2022. https://​doi.​org/​10.​1097/​QAI.​00000​00000​002947.​
10.​1097/​QAI.​00000​00000​002947.

	34.	 Esra R, Carstens J, Le Roux S, Mabuto T, Eisenstein M, Keiser O, Orel E, Mer‑
zouki A, De Voux L, Maskew M, Sharpey-Schafer K. Validation and improve‑
ment of a machine learning model to predict interruptions in antiretroviral 
treatment in South Africa. JAIDS J Acquir Immune Defic Syndr. 2023;92:42. 
https://​doi.​org/​10.​1097/​QAI.​00000​00000​003108.

	35.	 Mason JA, Friedman EE, Rojas JC, Ridgway JP. No-show prediction model 
performance among people with HIV: external validation study. J Med 
Internet Res. 2023;25: e43277. https://​doi.​org/​10.​2196/​43277.

	36.	 Vickers AJ, van Calster B, Steyerberg EW. A simple, step-by-step guide to 
interpreting decision curve analysis. Diagn Progn Res. 2019;3:18. https://​doi.​
org/​10.​1186/​s41512-​019-​0064-7.

	37.	 Wu Y, Xu L, Yang P, Lin N, Huang X, Pan W, Li H, Lin P, Li B, Bunpetch V, Luo 
C, Jiang Y, Yang D, Huang M, Niu T, Ye Z. Survival prediction in high-grade 
osteosarcoma using radiomics of diagnostic computed tomography. eBio‑
Medicine. 2018;34:27–34. https://​doi.​org/​10.​1016/j.​ebiom.​2018.​07.​006.

	38.	 Carrington AM, Manuel DG, Fieguth PW, Ramsay T, Osmani V, Wernly B, 
Bennett C, Hawken S, Magwood O, Sheikh Y, McInnes M, Holzinger A. Deep 
ROC analysis and AUC as balanced average accuracy, for Improved Clas‑
sifier Selection, Audit and Explanation. IEEE Trans Pattern Anal Mach Intell. 
2023;45:329–41. https://​doi.​org/​10.​1109/​TPAMI.​2022.​31453​92.

	39.	 White N, Parsons R, Collins G, Barnett A. Evidence of questionable research 
practices in clinical prediction models. BMC Med. 2023;21:339. https://​doi.​
org/​10.​1186/​s12916-​023-​03048-6.

	40.	 Akanbi MO, Ocheke AN, Agaba PA, Daniyam CA, Agaba EI, Okeke EN, Ukoli 
CO. Use of electronic health records in sub-Saharan Africa: progress and 
challenges. J Med Trop. 2012;14:1.

	41.	 Colombo F, Oderkirk J, Slawomirski L. Health information systems, electronic 
medical records, and big data in global healthcare: progress and challenges 
in OECD countries, in: R. Haring, I. Kickbusch, D. Ganten, M. Moeti (Eds.), 
Handb. Glob. Health, Springer International Publishing, Cham, 2020: pp. 
1–31. https://​doi.​org/​10.​1007/​978-3-​030-​05325-3_​71-1.

	42.	 Cyganek B, Graña M, Krawczyk B, Kasprzak A, Porwik P, Walkowiak K, Woźniak 
M. A survey of big data issues in electronic health record analysis. Appl Artif 
Intell. 2016;30:497–520. https://​doi.​org/​10.​1080/​08839​514.​2016.​11937​14.

	43.	 Khan ZF, Alotaibi SR. Applications of artificial intelligence and big data 
analytics in m-Health: a healthcare system perspective. J Healthc Eng. 
2020;2020:8894694. https://​doi.​org/​10.​1155/​2020/​88946​94.

	44.	 Schwartz JT, Gao M, Geng EA, Mody KS, Mikhail CM, Cho SK. Applications 
of machine learning using electronic medical records in spine surgery. 
Neurospine. 2019;16:643–53. https://​doi.​org/​10.​14245/​ns.​19383​86.​193.

	45.	 Shinozaki A. Electronic medical records and machine learning in 
approaches to drug development, in: Artif. Intell. Oncol. Drug Discov. Dev., 
IntechOpen, 2020. https://​doi.​org/​10.​5772/​intec​hopen.​92613.

	46.	 Syed FM, F.K.E. S, AI in securing electronic health records (EHR) systems. Int J 
Adv Eng Technol Innov. 1 (2024) 593–620.

	47.	 Kawamoto K, Finkelstein J, Fiol GD. Implementing machine learning in the 
electronic health record: checklist of essential considerations. Mayo Clin 
Proc. 2023;98:366–9. https://​doi.​org/​10.​1016/j.​mayocp.​2023.​01.​013.

	48.	 Weng SF, Reps J, Kai J, Garibaldi JM, Qureshi N. Can machine-learning 
improve cardiovascular risk prediction using routine clinical data? PLoS ONE. 
2017;12: e0174944. https://​doi.​org/​10.​1371/​journ​al.​pone.​01749​44.

	49.	 Critelli B, Hassan A, Lahooti I, Noh L, Park JS, Tong K, Lahooti A, Matzko N, 
Adams JN, Liss L, Quion J, Restrepo D, Nikahd M, Culp S, Lacy-Hulbert A, 
Speake C, Buxbaum J, Bischof J, Yazici C, Phillips AE, Terp S, Weissman A, 
Conwell D, Hart P, Ramsey M, Krishna S, Han S, Park E, Shah R, Akshintala V, 
Windsor JA, Mull NK, Papachristou GI, Celi LA, Lee PJ. A systematic review of 
machine learning-based prognostic models for acute pancreatitis: towards 
improving methods and reporting quality. 2024;2024.06.26.24309389. 
https://​doi.​org/​10.​1101/​2024.​06.​26.​24309​389.

	50.	 Endebu T, Taye G, Addissie A, Deksisa A, Deressa W. Electronic medical 
record-based prediction models developed and deployed in the HIV care 
continuum: a systematic review. Discov Health Syst. 2024;3:25. https://​doi.​
org/​10.​1007/​s44250-​024-​00092-8.

	51.	 Ridgway JP, Lee A, Devlin S, Kerman J, Mayampurath A. Machine learning 
and clinical informatics for improving HIV care continuum outcomes. Curr 
HIV/AIDS Rep. 2021;18:229–36. https://​doi.​org/​10.​1007/​s11904-​021-​00552-3.

	52.	 Chin RJ, Sangmanee D, Piergallini L. PEPFAR funding and reduction in HIV 
infection rates in 12 focus sub-Saharan African countries: a quantitative 
analysis. Int J MCH AIDS. 2015;3:150.

	53.	 Pal M, Parija S, Panda G, Dhama K, Mohapatra RK. Risk prediction of 
cardiovascular disease using machine learning classifiers. Open Med. 
2022;17:1100–13. https://​doi.​org/​10.​1515/​med-​2022-​0508.

	54.	 South Africa, (n.d.). https://​www.​unaids.​org/​en/​regio​nscou​ntries/​count​ries/​
south​africa. Accessed 17 Dec 2024.

	55.	 Dietterich TG.  Ensemble methods in machine learning, in: Mult. Clas‑
sif. Syst., Springer, Berlin, Heidelberg, 2000: pp. 1–15. https://​doi.​org/​10.​
1007/3-​540-​45014-9_1.

	56.	 Rane N, Choudhary SP, Rane J. Ensemble deep learning and machine learn‑
ing: applications, opportunities, challenges, and future directions. Stud Med 
Health Sci. 1 (2024) 18–41. https://​doi.​org/​10.​48185/​smhs.​v1i2.​1225.

	57.	 Namamula LR, Chaytor D. Effective ensemble learning approach for large-
scale medical data analytics. Int J Syst Assur Eng Manag. 2024;15:13–20. 
https://​doi.​org/​10.​1007/​s13198-​021-​01552-7.

	58.	 Chilamkurthy S, Ghosh R, Tanamala S, Biviji M, Campeau NG, Venugopal VK, 
Mahajan V, Rao P, Warier P. Development and validation of deep learning 
algorithms for detection of critical findings in head CT scans, 2018. https://​
doi.​org/​10.​48550/​arXiv.​1803.​05854.

	59.	 Huang J, Ling CX. Using AUC and accuracy in evaluating learning algo‑
rithms. IEEE Trans Knowl Data Eng. 2005;17:299–310. https://​doi.​org/​10.​
1109/​TKDE.​2005.​50.

	60.	 Alba AC, Agoritsas T, Walsh M, Hanna S, Iorio A, Devereaux PJ, McGinn T, 
Guyatt G. Discrimination and calibration of clinical prediction models: users’ 
guides to the medical literature. JAMA. 2017;318:1377–84. https://​doi.​org/​
10.​1001/​jama.​2017.​12126.

	61.	 Binuya MAE, Engelhardt EG, Schats W, Schmidt MK, Steyerberg EW. Meth‑
odological guidance for the evaluation and updating of clinical prediction 
models: a systematic review. BMC Med Res Methodol. 2022;22:316. https://​
doi.​org/​10.​1186/​s12874-​022-​01801-8.

	62.	 Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A 
calibration hierarchy for risk models was defined: from utopia to empirical data. 
J Clin Epidemiol. 2016;74:167–76. https://​doi.​org/​10.​1016/j.​jclin​epi.​2015.​12.​005.

	63.	 Nijman SWJ, Leeuwenberg AM, Beekers I, Verkouter I, Jacobs JJL, Bots ML, 
Asselbergs FW, Moons KGM, Debray TPA. Missing data is poorly handled and 
reported in prediction model studies using machine learning: a literature 
review. J Clin Epidemiol. 2022;142:218–29. https://​doi.​org/​10.​1016/j.​jclin​epi.​
2021.​11.​023.

	64.	 Misra DP, Yadav AS. Impact of preprocessing methods on healthcare predic‑
tions. 2019. https://​doi.​org/​10.​2139/​ssrn.​33495​86.

	65.	 Newman DA. Missing data: five practical guidelines, Organ. Res. Methods. 
2014;17:372–411. https://​doi.​org/​10.​1177/​10944​28114​548590.

	66.	 Afkanpour M, Hosseinzadeh E, Tabesh H. Identify the most appropriate 
imputation method for handling missing values in clinical structured data‑
sets: a systematic review. BMC Med Res Methodol. 2024;24:188. https://​doi.​
org/​10.​1186/​s12874-​024-​02310-6.

	67.	 Buuren, S. van. Flexible Imputation of Missing Data. CRC Press. 2012.

https://doi.org/10.1038/s41598-022-16062-0
https://doi.org/10.1101/2024.08.05.24311488
https://doi.org/10.2196/44432
https://doi.org/10.1007/s10461-018-2215-1
https://doi.org/10.1038/s41598-020-62729-x
https://doi.org/10.1038/s41598-020-62729-x
https://doi.org/10.1097/QAI.0000000000002947.10.1097/QAI.0000000000002947
https://doi.org/10.1097/QAI.0000000000002947.10.1097/QAI.0000000000002947
https://doi.org/10.1097/QAI.0000000000003108
https://doi.org/10.2196/43277
https://doi.org/10.1186/s41512-019-0064-7
https://doi.org/10.1186/s41512-019-0064-7
https://doi.org/10.1016/j.ebiom.2018.07.006
https://doi.org/10.1109/TPAMI.2022.3145392
https://doi.org/10.1186/s12916-023-03048-6
https://doi.org/10.1186/s12916-023-03048-6
https://doi.org/10.1007/978-3-030-05325-3_71-1
https://doi.org/10.1080/08839514.2016.1193714
https://doi.org/10.1155/2020/8894694
https://doi.org/10.14245/ns.1938386.193
https://doi.org/10.5772/intechopen.92613
https://doi.org/10.1016/j.mayocp.2023.01.013
https://doi.org/10.1371/journal.pone.0174944
https://doi.org/10.1101/2024.06.26.24309389
https://doi.org/10.1007/s44250-024-00092-8
https://doi.org/10.1007/s44250-024-00092-8
https://doi.org/10.1007/s11904-021-00552-3
https://doi.org/10.1515/med-2022-0508
https://www.unaids.org/en/regionscountries/countries/southafrica
https://www.unaids.org/en/regionscountries/countries/southafrica
https://doi.org/10.1007/3-540-45014-9_1
https://doi.org/10.1007/3-540-45014-9_1
https://doi.org/10.48185/smhs.v1i2.1225
https://doi.org/10.1007/s13198-021-01552-7
https://doi.org/10.48550/arXiv.1803.05854
https://doi.org/10.48550/arXiv.1803.05854
https://doi.org/10.1109/TKDE.2005.50
https://doi.org/10.1109/TKDE.2005.50
https://doi.org/10.1001/jama.2017.12126
https://doi.org/10.1001/jama.2017.12126
https://doi.org/10.1186/s12874-022-01801-8
https://doi.org/10.1186/s12874-022-01801-8
https://doi.org/10.1016/j.jclinepi.2015.12.005
https://doi.org/10.1016/j.jclinepi.2021.11.023
https://doi.org/10.1016/j.jclinepi.2021.11.023
https://doi.org/10.2139/ssrn.3349586
https://doi.org/10.1177/1094428114548590
https://doi.org/10.1186/s12874-024-02310-6
https://doi.org/10.1186/s12874-024-02310-6


Page 15 of 15Kwarah et al. BMC Global and Public Health            (2025) 3:64 	

	68.	 Rios R, Miller RJ, Manral N, Sharir T, Einstein AJ, Fish MB, Ruddy TD, Kaufmann 
PA, Sinusas AJ, Miller EJ, Bateman TM, Dorbala S, Carli MD, Kriekinge SDV, 
Kavanagh PB, Parekh T, Liang JX, Dey D, Berman DS, Slomka PJ. Handling 
missing values in machine learning to predict patient-specific risk of adverse 
cardiac events: insights from REFINE SPECT registry. Comput Biol Med. 
2022;145: 105449. https://​doi.​org/​10.​1016/j.​compb​iomed.​2022.​105449.

	69.	 Vickers AJ, Van Claster B, Wynants L, Steyerberg EW. Decision curve analysis: 
confidence intervals and hypothesis testing for net benefit. Diagn Progn 
Res. 2023;7:11. https://​doi.​org/​10.​1186/​s41512-​023-​00148-y.

	70.	 Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent Reporting 
of a multivariable prediction model for Individual Prognosis Or Diagnosis 
(TRIPOD): the TRIPOD statement. BMJ. 2015;350: g7594. https://​doi.​org/​10.​
1136/​bmj.​g7594.

Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub‑
lished maps and institutional affiliations.

https://doi.org/10.1016/j.compbiomed.2022.105449
https://doi.org/10.1186/s41512-023-00148-y
https://doi.org/10.1136/bmj.g7594
https://doi.org/10.1136/bmj.g7594

	Evaluating predictive performance, validity, and applicability of machine learning models for predicting HIV treatment interruption: a systematic review
	Abstract 
	Background 
	Methods 
	Results 
	Conclusions 
	Systematic review registration 

	Background
	Methods
	Search strategy
	Eligibility criteria
	Selection process
	Data extraction
	Risk of bias and applicability assessment
	Synthesis and analysis

	Results
	Characteristics of included studies
	Model performance metrics
	Risk-of-bias assessment
	Applicability assessment
	Model validation

	Discussions
	Conclusions
	Acknowledgements
	References