F1000Research 2021, 10:60 Last updated: 31 MAR 2021 RESEARCH ARTICLE Local adaptation in populations of Mycobacterium tuberculosis endemic to the Indian Ocean Rim [version 1; peer review: 3 approved] Fabrizio Menardo 1,2, Liliana K. Rutaihwa1,2, Michaela Zwyer1,2, Sonia Borrell1,2, Iñaki Comas3, Emilyn Costa Conceição 4,5, Mireia Coscolla 6, Helen Cox7, Moses Joloba8, Horng-Yunn Dou9, Julia Feldmann1,2, Lukas Fenner 1,2,10, Janet Fyfe11, Qian Gao12, Darío García de Viedma13,14, Alberto L. Garcia-Basteiro15,16, Sebastian M. Gygli1,2, Jerry Hella1,2,17, Hellen Hiza1,2, Levan Jugheli1,2, Lujeko Kamwela1,2,17, Midori Kato-Maeda18, Qingyun Liu12, Serej D. Ley1,2,19, Chloe Loiseau1,2, Surakameth Mahasirimongkol20,21, Bijaya Malla1,2, Prasit Palittapongarnpim20,21, Niaina Rakotosamimanana22, Voahangy Rasolofo22, Miriam Reinhard1,2, Klaus Reither2,23, Mohamed Sasamalo 1,2,17, Rafael Silva Duarte4, Christophe Sola24,25, Philip Suffys26, Karla Valeria Batista Lima27,28, Dorothy Yeboah-Manu 29, Christian Beisel30, Daniela Brites1,2, Sebastien Gagneux 1,2 1Department of Medical Parasitology and Infection Biology, Swiss Tropical and Public Health Institute, Basel, Switzerland 2University of Basel, Basel, Switzerland 3Institute of Biomedicine of Valencia, Valencia, Spain 4Instituto de Microbiologia, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil 5Instituto Nacional de Infectologia Evandro Chagas, Fundação Oswaldo Cruz, Rio de Janeiro, Brazil 6I2SysBio, University of Valencia, Valencia, Spain 7Institute of Infectious Diseases and Molecular Medicine, University of Cape Town, Cape Town, South Africa 8Department of Medical Microbiology, Makerere University, Kampala, Uganda 9National Institute of Infectious Diseases and Vaccinology, National Health Research Institute, Zhunan, Taiwan 10Institute for Social and Preventive Medicine, University of Bern, Bern, Switzerland 11Victorian Infectious Diseases Reference Laboratory, Melbourne, Australia 12Institute of Medical Microbiology, School of Basic Medical Science of Fudan University, Shanghai, China 13Instituto de Investigación Sanitaria Gregorio Marañón, Madrid, Spain 14CIBER Enfermedades Respiratorias, Madrid, Spain 15Barcelona Institute for Global Health, Barcelona, Spain 16Centro de Investigação em Saúde de Manhiça, Maputo, Mozambique 17Ifakara Health Institute, Bagamoyo, Tanzania 18School of Medicine, University of California, San Francisco, USA 19Papua New Guinea Institute of Medical Research, Goroka, Papua New Guinea 20Department of Microbiology, Mahidol University, Bangkok, Thailand 21National Science and Technology Development Agency, Bangkok, Thailand 22Institut Pasteur de Madagascar, Antananarivo, Madagascar 23Department of Medicine, Swiss Tropical and Public Health Institute, Basel, Switzerland 24Université Paris-Saclay, Paris, France 25INSERM-Université de Paris, Paris, France   Page 1 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 26Laboratório de Biologia Molecular Aplicada a Micobactérias, Fundação Oswaldo Cruz, Rio de Janeiro, Brazil 27Centro de Ciências Biológicas e da Saúde, Universidade do Estado do Pará, Belém, Brazil 28Instituto Evandro Chagas, Ananindeua, Brazil 29Noguchi Memorial Institute for Medical Research, University of Ghana, Accra, Ghana 30Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland v1 First published: 01 Feb 2021, 10:60 Open Peer Review https://doi.org/10.12688/f1000research.28318.1 Latest published: 29 Mar 2021, 10:60 https://doi.org/10.12688/f1000research.28318.2 Reviewer Status Invited Reviewers Abstract Background: Lineage 1 (L1) and 3 (L3) are two lineages of the 1 2 3 Mycobacterium tuberculosis complex (MTBC) causing tuberculosis (TB) in humans. L1 and L3 are prevalent around the rim of the Indian version 2 Ocean, the region that accounts for most of the world’s new TB cases. (revision) Despite their relevance for this region, L1 and L3 remain 29 Mar 2021 understudied. Methods: We analyzed 2,938 L1 and 2,030 L3 whole genome version 1 sequences originating from 69 countries. We reconstructed the 01 Feb 2021 report report report evolutionary history of these two lineages and identified genes under positive selection. Results: We found a strongly asymmetric pattern of migration from 1. Tanvi Honap , University of Oklahoma, South Asia toward neighboring regions, highlighting the historical role Norman, USA of South Asia in the dispersion of L1 and L3. Moreover, we found that several genes were under positive selection, including genes involved 2. Miguel Viveiros , NOVA University Lisbon, in virulence and resistance to antibiotics . For L1 we identified Lisbon, Portugal signatures of local adaptation at the esxH locus, a gene coding for a João Perdigão, University of Lisbon, Lisbon, secreted effector that targets the human endosomal sorting complex, and is included in several vaccine candidates. Portugal Conclusions: Our study highlights the importance of genetic diversity in the MTBC, and sheds new light on two of the most important MTBC 3. Ola B. Brynildsrud , Norwegian Institute lineages affecting humans. of Public Health, Oslo, Norway Keywords Any reports and responses or comments on the Mycobacterium tuberculosis, adaptation, coevolution article can be found at the end of the article. This article is included in the Antimicrobial Resistance collection.   Page 2 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 Corresponding authors: Fabrizio Menardo (fabrizio.menardo@swisstph.ch), Sebastien Gagneux (sebastien.gagneux@swisstph.ch) Author roles: Menardo F: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing; Rutaihwa LK: Data Curation, Investigation, Writing – Review & Editing; Zwyer M: Data Curation, Writing – Review & Editing; Borrell S: Resources, Supervision, Writing – Review & Editing; Comas I: Resources, Writing – Review & Editing; Conceição EC: Resources; Coscolla M: Data Curation; Cox H: Resources, Writing – Review & Editing; Joloba M : Resources; Dou HY: Resources; Feldmann J: Investigation; Fenner L: Resources; Fyfe J: Resources, Writing – Review & Editing; Gao Q: Data Curation; García de Viedma D: Resources, Writing – Review & Editing; Garcia-Basteiro AL: Resources, Writing – Review & Editing; Gygli SM: Data Curation, Writing – Review & Editing; Hella J: Resources; Hiza H: Resources; Jugheli L: Resources; Kamwela L: Resources; Kato-Maeda M: Resources; Liu Q: Resources; Ley SD: Resources, Writing – Review & Editing; Loiseau C: Data Curation, Software; Mahasirimongkol S: Resources; Malla B: Resources; Palittapongarnpim P: Resources, Writing – Review & Editing; Rakotosamimanana N: Resources; Rasolofo V: Resources; Reinhard M: Investigation; Reither K: Resources; Sasamalo M: Resources; Silva Duarte R: Supervision, Writing – Review & Editing; Sola C: Resources, Writing – Review & Editing; Suffys P: Resources; Batista Lima KV: Resources; Yeboah-Manu D: Resources; Beisel C: Resources, Supervision, Writing – Review & Editing; Brites D: Data Curation, Writing – Review & Editing; Gagneux S: Conceptualization, Funding Acquisition, Project Administration, Supervision, Writing – Review & Editing Competing interests: No competing interests were disclosed. Grant information: This work was supported by the Swiss National Science Foundation (grants 310030_188888, CRSII5_177163, IZRJZ3_164171 and IZLSZ3_170834) and the European Research Council (309540 EVODRTB and 883582-ECOEVODRTB). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Copyright: © 2021 Menardo F et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. How to cite this article: Menardo F, Rutaihwa LK, Zwyer M et al. Local adaptation in populations of Mycobacterium tuberculosis endemic to the Indian Ocean Rim [version 1; peer review: 3 approved] F1000Research 2021, 10:60 https://doi.org/10.12688/f1000research.28318.1 First published: 01 Feb 2021, 10:60 https://doi.org/10.12688/f1000research.28318.1   Page 3 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 Introduction Results The global tuberculosis (TB) epidemic is a major public health We screened a large collection of publicly available MTBC emergency, disproportionately affecting vulnerable populations genome sequences and selected those belonging to L1 and L3. and worsening existing inequalities. Although the estimated Additionally, we newly sequenced 767 strains to cover regions incidence and the number of fatalities have slowly decreased that were not well represented in the public data. We applied a over time, every year ten million people develop the disease, series of bioinformatic filters to exclude low quality sequences and 1.4 million TB patients die (WHO, 2020). Moreover, the and mixed infections (Methods), and obtained a curated Covid-19 pandemic will likely set back the progress toward TB genome dataset of 4,968 strains (2,938 L1, 2,030 L3). While the eradication for several years (Glaziou, 2020; Hogan et al., 2020; phylogenetic tree of L3 showed a ladder-like topology with- McQuaid et al., 2020; Saunders & Evans, 2020). TB is spread out an evident division in sublineages, L1 comprised five clearly worldwide, but not all regions are equally affected: 95% of new distinct sublineages coalescing close to the root of the tree TB cases occur in Africa and Asia, and the eight countries with (Extended data: Figures 1 and 2). the largest burden account for two thirds of the total number of cases (WHO, 2020). The genome sequences included in our final dataset repre- sented 69 countries and covered the known geographic range Human TB is caused by members of the Mycobacterium tuber- of L1 and L3 (Methods; Extended data: Table 1 and Figures 3 culosis complex (MTBC), which includes nine phylogenetic and 4). Because we were interested in studying L1 and L3 lineages with different geographic distributions. Five of these in their endemic range, we excluded strains originating from lineages are restricted to Africa (L5, L6, L7, L8, and L9), while North America, Europe, and Australia from the biogeographic the remaining four (L1, L2, L3, L4) are more broadly distributed analysis, as in these low-burden regions, most TB patients are (Chihota et al., 2018; Coscollá et al., 2020; Gagneux, 2018; recent migrants who were infected in their country of origin Ngabonziza et al., 2020). While L2 and L4 strains occur across (White et al., 2017). We assigned strains to different geographic the world, L1 strains are predominantly found around the rim regions following a modified version of the United Nation of the Indian Ocean (East Africa, South Asia, and Southeast geographic scheme (Methods, Figure 1 and Figure 2). Map- Asia). L3 has a distinct geographic range (East Africa, Cen- ping the regions of origin onto the phylogenetic trees revealed tral Asia, Western Asia, and South Asia) that overlaps partially that two sublineages of L1 are found almost exclusively in with L1 (Couvin et al., 2019; Wiens et al., 2018). While many Southeast Asia (L1.1.1 and L1.2.1), while the others are spread MTBC lineages also occur in Northern Europe, North America, over the complete geographic range of L1 (Figure 1; Extended Australia and New Zealand, in these low-burden regions, the Data: Figures 5–6). majority of TB cases are imported through recent migrations, We performed a phylogeographic analysis using two alterna- and local transmission is rare (White et al., 2017). tive approaches (Mascot and PASTML), which are based on dif- ferent models and inference methods (Ishikawa et al., 2019; Beyond their geographic distribution, MTBC lineages also dif- Müller et al., 2018). We found that South Asia was predicted fer in virulence, transmissibility, association with drug resist- to be the geographic range of the Most Recent Common Ances- ance, and the host immune responses they elicit (Peters et al., tor (MRCA) of both L1 and L3 (Extended Data: Information). 2020). There is an increasing notion that MTBC genetic variation Most interestingly, we found a strongly asymmetric pattern should be considered in the development of new antibiotics and of migration. For L1, PASTML identified most migration vaccines, and when studying the epidemiology and pathogen- events from South Asia toward other regions, followed by esis of the disease (Gagneux, 2017; Peters et al., 2020; Wiens further dispersion, but almost no back migration to South Asia et al., 2018). Yet, much of TB research to date has focused on (Figure 1; Extended data: Figure 7 and Files 1 and 2). When the laboratory strain H37Rv (belonging to L4) and a few other we estimated the migration rates with Mascot, we found that the strains belonging to L2 and L4, because of their broad geographic forward migration rate from South Asia toward the rest of the range, and their association with drug resistance. By contrast, world were 3 to 17 times larger than the migration rates in the L1 and L3 have largely been neglected, and only few studies opposite direction, confirming the results of PASTML (Figure 1; have investigated the global populations of these two lineages Extended data: Table 2). We found a similar scenario for L3: (Couvin et al., 2019; O’Neill et al., 2019). This knowledge PASTML inferred the largest number of migrations from gap is particularly severe as L1 and L3 are endemic to some of South Asia toward East Africa, further spread from East Africa the world regions with the heaviest TB burden. For example, toward neighboring regions, but essentially no migration back L1 and L3 cause the majority of TB cases in India, the country to South Asia (Figure 2; Extended data: Files 3–4). The with the highest number of new TB cases in the world, and L1 is Mascot analysis showed that the forward migration rates from by far the most important cause of TB in the Philippines, the South Asia toward East Africa were 26 times larger than the country with the 4th highest global TB burden (Wiens et al., migration rates in the opposite direction (Figure 2; Extended 2018). The aim of this study was to characterize the global popula- data: Table 2). tion structure of L1 and L3 using large-scale population genom- ics, and to investigate the evolutionary history and selective We performed tip dating to estimate the age of the trees but forces acting on these two lineages. got inconsistent results for L1, due to a lack of a reliable Page 4 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 a c 50 0 −50 −100 0 100 200 Longitude d Sample size 1 10 100 e b 50 L1.2.1 L1.2.2 0 L1.1.3 L1.1.2 −50 L1.1.1 −100 0 100 200 Longitude East Africa North Africa South America Southeast Asia (island) West Africa East Asia South Africa South Asia Southeast Asia (mainland) West Asia Figure 1. Results of the biogeography analysis: L1. a) Heatmap indicating the origin of the 2,061 L1 strains included in the dataset used for the biogeography analysis. b) The geographic regions used in the biogeography analysis of L1. c) Results of PASTML (compressed tree), the color code follows the legend of panel (b), the size of the circles is proportional to the number of tips, and the size of the arrows is proportional to the number of times the pattern of migration was observed on the tree. This plot excluded nodes that represented less than three strains. The same plot including all nodes can be found as Extended Data File 1. d) Tree obtained with the Mascot analysis; the branches are colored following the legend in panel (b) and represent the inferred ancestral region with the largest posterior probability. e) Representation of the relative effective population sizes (circles) and migration rates (lines connecting circles) estimated by Mascot. Lines representing migration rates are colored based on the region of origin (interpreted forward in time). For example, the dark blue line connecting the dark blue circle with the red circle represents the forward migration rate between South Asia and East Africa, which is the largest migration rate estimated by Mascot. The parameter estimates are reported in Extended Data Table 2. temporal signal (Extended data: Information, Figures 8 and 9, With respect to the geographical aspects, we identified several Tables 3 and 4). However, the results of a previous study interesting instances of long-range dispersal. First, we found suggested a relatively fast evolutionary rate for L1 (~ 1.4×10-7 a clade of L1, composed of 11 strains sampled in five differ- nucleotide changes per site per year; Menardo et al., 2019), ent West African countries. This was surprising because West and that the MRCA of L1 lived around the 12th century AD Africa is usually not considered part of the geographic range of (Extended data: Information). By contrast, L3 had a good tempo- L1. This clade is nested within sublineage L1.1.1, which is essen- ral signal, and different methods estimated that its MRCA lived tially only found in mainland Southeast Asia, and the PASTML between the 2nd and the 13th century AD. However, the uncer- analysis inferred a direct introduction from Southeast Asia to tainty around all of these estimates was very large (Extended data: West Africa. This introduction is unlikely to have happened Information, Figure 8, Tables 5 and 6). Together, these results before the 16th century, when the Portuguese established the corroborate the findings of previous studies (Bos et al., 2014; maritime route between Europe and Asia by circumnavigating Duchêne et al., 2016; Menardo et al., 2019). Calibrating MTBC Africa. Assuming that this migration did not occur before the trees that are hundreds or thousands of years old, with sequences 16th century, we benchmarked the molecular clock of L1 and sampled in the last few decades is notoriously challenging found that this scenario is indeed compatible with a clock rate and subject to limitations (Menardo et al., 2019). Therefore, equal to, or larger than 1.4×10-7 nucleotide changes per site per we refrain from any strong interpretation of the results of the year, but not with lower ones (Extended data: Information, molecular clock analyses of L1 and L3. We emphasize that the Figure 10). ages reported here are the most likely estimates supported by the available data. Additional data, or alternative methods, might Second, we found that L1 was introduced to South America on result in different temporal scenarios. at least 11 independent occasions (Figure 1; Extended data: Page 5 of 24 Latitude Latitude F1000Research 2021, 10:60 Last updated: 31 MAR 2021 a c 50 0 −50 −100 0 100 200 Longitude d Sample size 1 10 100 b e 50 0 −50 −100 0 100 200 Longitude East Africa North Africa South Asia West Africa East Asia South Africa Southeast Asia (mainland) West Asia Figure 2. Results of the biogeography analysis: L3. a) Heatmap indicating the origin of 1,021 L3 strains included in the dataset used for the biogeography analysis. b) The geographic regions used in the biogeography analysis of L3. c) Results of PASTML (compressed tree), the color code follows the legend of panel (b), the size of the circles is proportional to the number of tips, and the size of the arrows is proportional to the number of times the pattern of migration was observed on the tree. This plot excluded nodes that represent less than three strains, the same plot including all nodes can be found as Extended Data File 3. d) Tree obtained with the Mascot analysis; the branches are colored following the legend in panel (b) and represent the inferred ancestral region with the largest posterior probability. e) Representation of the relative effective population sizes (circles) and migration rates (lines connecting circles) estimated by Mascot. Lines representing migration rates are colored based on the region of origin (interpreted forward in time). The effective population size of South Asia was estimated to be much larger than the one of East Africa. Additionally, the forward migration rate between South Asia and East Africa was much larger than the one in the opposite direction. The parameter estimates are reported in Extended Data Table 2. Files 1 and 2). Assuming a clock rate of 1.4×10-7, the earliest 16th century. An alternative explanation could be the Austrone- introduction was between 1620 and 1830 AD from East Africa, sian expansion. Austronesians are thought to have reached the while subsequent introductions occurred from East Africa, Comoros islands and Madagascar between the 9th and the South Africa, and South Asia. These results support the 13th century AD, possibly through direct navigation from hypothesis that L1 was first introduced to Brazil through the Southeast Asian islands. Malagasy speak an Austronesian lan- slave trade from East Africa (Allen, 2013; Conceição et al., guage, and Austronesian genetic signatures are found in human 2019). Interestingly, this is in contrast with L5 and L6, which are populations in the Comoros, Madagascar, and to a small extent endemic to West Africa, but did not establish themselves firmly also in the Horn of Africa (Blench, 2010; Boivin et al., 2013; in South America (De Jong et al., 2010; Rabahi et al., 2020). Brucato et al., 2016; Brucato et al., 2018; Brucato et al., 2019; Pierron et al., 2014). Third, similar to the West African clade of L1.1.1, we found an East African clade embedded within sublineage L1.2.1, which L1 and L3 coexist in many regions around the Indian Ocean. otherwise is found almost exclusively in Southeast Asia. This Yet, in their evolutionary history, these lineages colonized East African clade is composed of 11 strains from five coun- areas occupied by different human populations. Human genetic tries, and its sister clade is found in East Timor and Papua New variation has been shown to influence the susceptibility to TB Guinea. We inferred a direct migration from the islands of (Qu et al., 2011). Most notably, HLA genes play a crucial role Southeast Asia to East Africa that occurred between the 13th and in the activation of the immune responses to the MTBC by the 16th century AD (assuming a clock rate of 1.4×10-7). This exposing bacterial peptides (epitopes) to the surface of an would be compatible with early Portuguese expeditions, infected cell, where they can be recognized by T cells. HLA which reached East Timor and Papua New Guinea in the early genes are extremely polymorphic in human populations, and Page 6 of 24 Latitude Latitude F1000Research 2021, 10:60 Last updated: 31 MAR 2021 several alleles of different HLA genes are associated with TB sequences with 36,316 variable sites for L3). The epitope that susceptibility in different populations (Brahmajothi et al., 1991; accumulated most mutations was located at the N terminus of Salie et al., 2014; Sveinbjornsson et al., 2016; Vejbaesya esxH (Rv0288). This peptide is a T cell epitope for both classes, et al., 2002; Yuliwulandari et al., 2010). MHCI and MHCII; it is also a B cell epitope and was previ- ously identified as one of the few T cell epitopes that were not Previous studies have shown that T cell epitopes are hyper- hyper-conserved (Comas et al., 2010). We found 15 derived conserved in the MTBC, suggesting that immune escape does haplotypes (at the amino acid level), generated by 28 independ- not provide an advantage and that, contrary to other patho- ent replacements at five positions in a peptide of seven amino gens, the MTBC needs to be recognized by the immune system acids (Figure 3). By contrast, the second most variable epitope and to cause disease in order to transmit (Comas et al., 2010; accumulated only seven amino acid changes (Extended data: Coscolla et al., 2015). Our large dataset of L1 and L3 Table 7). Interestingly, we did not find this signature in L3, genome sequences from different geographic regions pro- as all strains carried the ancestral haplotype at the N-terminal vided an opportunity to scan for lineage- and/or region-specific esxH epitope. Moreover, when we extended the analysis to two signatures of selection at T cell epitopes in L1 and L3. large genomic datasets of L2 and L4 strains, we found a much weaker signal: while 21% of L1 strains carried a derived hap- We reconstructed the mutational history of T cell epitopes in lotype for this epitope, only 1% of L2 and L4 strains, and no L1 and L3, and found that 51% of all epitopes were variable L3 strain had a derived haplotype. Despite analyzing datasets (at the amino acid level) in at least one L1 strain, while only with more strains (6,752 and 10,466 for L2 and L4, respec- 20% were variable in at least one L3 strain (Extended data: tively), and more polymorphic positions (140,309 and 277,648) Table 7). However, this difference can be explained by the dif- compared to L1, we found only three amino acid replace- ferent size and diversity of the two datasets (2,061 genome ments at the N-terminal epitope of esxH in L2, and seven in L4 sequences with 136,023 variable sites for L1; 1,021 genome (Figure 3). The most frequently mutated position was the tenth Figure 3. The hypervariable epitope at the N terminus of esxH (aa 1–18). The ancestral epitope is reported in top position, the derived haplotypes are reported below: mutations that were present in more than one lineage are highlighted in yellow, haplotypes that were found exclusively in L1, L2 or L4 are highlighted in pink, blue and red, respectively. Asterisks indicate the position inferred to be under positive selection by PAML: * posterior probability > 0.95: ** posterior probability > 0.99. The table on the right reports for each lineage the number of strains harboring the corresponding haplotype, and between parentheses, the number of independent parallel occurrences of the mutations as inferred by PAUP. Page 7 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 amino acid, where we found 12 independent replacements in whether replacements in the hypervariable esxH epitope were L1, and two in L2 and L4 (Figure 3). The amino acid replace- significantly associated with a particular geographic region ment A10V occurred eight times in parallel in L1, once in L2 (Methods). We found that South African strains were less likely and twice in L4. The most abundant derived haplotype was to harbor a derived haplotype in the N-terminal esxH epitope caused by a different amino acid replacement at position ten than expected by chance (empirical p-value = 0.013; Table 1). (A10T), which occurred once in L1 and once in L2 (Figure 3). While East African L1 strains were not associated with the Overall, the replacements in all lineages occurred in eight derived haplotypes (empirical p-value = 0.276), we noticed residues in a peptide of 13 amino acids (Figure 3). important differences between countries: of the 29 East African strains harboring a derived haplotype, 28 were sampled in We evaluated the robustness of these results by formally testing Madagascar. When we excluded Madagascar, we found that for positive selection on the complete sequence of esxH using East Africa had a strong negative association with the derived PAML (Yang, 2007). We found that esxH was indeed under haplotypes (i.e. East African strains harbored less derived positive selection in L1 (p-value = 0.00004) but not in the other haplotypes than expected by chance; empirical p-value = 0.0004). lineages (p-values = 0.39, 1.00 and 0.65 for L2, L3 and L4, We then tested the most frequently replaced position (position respectively). PAML identified four codons that have been under ten) alone, and again found that East African strains were nega- positive selection in L1 (posterior probability > 0.95), all of tively associated with the derived haplotypes, with and without them within the N-terminal epitope (codons 7, 9, 10 and 13; excluding Malagasy strains (empirical p-values = 0.0176 and Figure 3). Codon 76, which is part of a different T cell epitope, 0.034, respectively). Finally, we tested the derived haplotype had a posterior probability > 0.9, mutated three times in caused by the most frequent amino acid replacement (A10V; parallel and was possibly also under positive selection. 8 parallel occurrences). Again, we found a negative association with East African strains (empirical p-values = 0.046 and Our results further revealed that the derived haplotypes of 0.079, respectively including and excluding Malagasy strains; the N-terminal esxH epitope were not distributed randomly Table 1). across the geographic range of L1. Twenty-two of the 28 (79%) amino acid replacements occurred in sublineages L1.1.1 and We hypothesized that the accumulation of missense mutations L1.2.1, which are almost exclusively present in Southeast Asia at the N-terminal epitope of esxH was due to immune escape. (Extended data: Figure 11). We constructed a statistical test of Therefore, we performed in silico prediction of the binding association similar to phyC (Farhat et al., 2013) to determine affinity of the ancestral haplotype and of the two most frequent Table 1. Results of the test of association between haplotypes of the N-terminal esxH epitope and the geographic region of origin. Region1 Strains2 Association3 esxH (A10V)4 esxH (A10X)5 esxH (N terminus)6 Strains p-value Strains p-value Strains p-value South Asia 219 + 1 0.4852 6 0.385 17 0.453 Southeast Asia (islands) 235 + 10 0.154 156 0.081 171 0.174 Southeast Asia (mainland) 982 + 25 0.200 183 0.095 210 0.196 East Africa 466 - 0 0.046 1 0.034 29 0.276 East Africa (no Madagascar) 391 - 0 0.070 0 0.018 1 0.0004 South Africa 49 - 0 0.305 0 0.165 0 0.013 South America 77 - 0 0.496 0 0.352 0 0.095 1 We included only regions for which the sample size was 25 or more. 2 Total number of L1 strains from each region. 3 All p-values are one-tailed empirical p-values. This column indicates the direction of the association: + we tested for positive association between the derived haplotype(s) and the geographic region; - we tested for negative association between derived haplotype(s) and the geographic region. 4 Number of strains with the derived haplotype caused by the replacement A10V, and results of the test of association. P-values < than 0.1 are shown in bold. 5 Number of strains with the derived haplotypes caused by any replacement at position 10 (A10X), and results of the test of association. P-values < than 0.1 are shown in bold. 6 Number of strains with the derived haplotypes caused by any mutations occurred in the first 18 amino acids of EsxH, and results of the test of association. P-values < than 0.1 are shown in bold. Page 8 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 derived haplotypes (caused by the amino acid replacements two hypotheses can be excluded with the available data. None- A10V and A10T) with different HLA-A, HLA-B and theless, our discovery of a West African clade that originated HLA-DRB1 alleles (Methods). We performed this analysis through a direct introduction from Southeast Asia supports a for: faster clock rate of L1, and therefore a more recent MRCA 1) Alleles found at high frequency (> 10%) in South- and (Extended data: Information). As we already mentioned, Southeast Asia, but not in East Africa. tip-dating analyses of MTBC trees with roots that are sev- eral hundreds of years old is extremely challenging, and the 2) A lleles found at high frequency (> 10%) in East Africa, results of such analyses should be taken with caution (Menardo but not in South- and Southeast Asia. et al., 2019). Moreover, all MTBC molecular clock studies so far assumed that evolutionary rate estimates do not depend on 3) Alleles found at high frequency (> 10%) in both regions. the age of the calibration points (Ho et al., 2005). There are some indications that the effect of time dependency in MTBC datasets is negligible (Pepperell et al., 2013; Menardo et al., However, we found no differences in the predicted binding 2019). However, this assumption has not yet been thoroughly affinities between alleles with different geographic distributions tested due to the lack of appropriate samples. (Extended data: Table 8). We found evidence for a strong selective pressure acting on While esxH was the most striking example of a selective pres- the N terminus of esxH in L1. In contrast, this selective pres- sure specific for one lineage, our analysis suggests that it was sure seems to be much reduced, if not absent, in the “modern” not an isolated case. We performed a genome-wide scan for lineages (L2, L3 and L4). It is known that L1 strains interact selection with PAML, and identified 17 genes under positive differently with the infected hosts compared to other lineages. selection, of which five in common between L1 and L3 For example, L1 strains show a lower virulence in animal (Extended data: Table 9). We found two genes coding for trans- models (Bottai et al., 2020), transmit less efficiently in some membrane proteins, members of the Esx-1 secretion system, clinical settings, and infect older patients (Holt et al., 2018). which were under positive selection in L1 (eccB1 and eccCa1, It has been shown recently that the increased virulence of the Bonferroni corrected p-values = 0.02 and 0.03), and several so-called “modern” MTBC strains is due to the loss of the genes involved in antibiotic resistance that were under posi- genomic region TbD1, which remains present in L1 (Bottai tive selection in both lineages (Extended data: Information). We et al., 2020). However, it was also reported that in some pop- further characterized the profile of drug resistance mutations ulations, L1 was associated with higher patient mortality of L1 and L3, and found that L1 strains harbored a greater pro- (Smittipat et al., 2019). Given these differences with the “mod- portion of inhA promoter mutations (conferring resistance to ern” lineages, it is likely that L1 is subject to different selective isoniazid) compared to L3 strains, confirming previous findings pressures, and it is possible that the greater selective pressure (Fenner et al., 2012; Extended data: Information, Figure 12, on esxH was caused by some epistatic effect specific to L1. Table 10). The signatures of selection at the N-terminal epitope of esxH Discussion were associated with strains sampled in South- and Southeast Our results highlight the central role of South Asia in the Asia, and were almost completely absent in East Africa (exclud- dispersion of L1 and L3. First, we confirmed that the two ing Madagascar). Region-specific signatures of positive selec- lineages probably expanded from South Asia (O’Neill et al., tion are a hallmark of local adaptation; in this case, adaptation 2019). Second, contrary to previous studies that assumed sym- of L1 strains to human hosts with South- and Southeast metric migration rates between regions (O’Neill et al., 2019), Asian genotypes. This corroborates previous studies reporting we found that these migrations occurred mostly in one direction. that in Taiwan, L1 is associated with indigenous populations South Asia was a source of migrant strains that fueled the epi- with Austronesian ancestry (reviewed in Dou et al., 2015). demics in other regions, especially in East Africa. Historically, This hypothesis is also supported by the L1 population in a network of maritime trade, which followed the seasonal- Madagascar. Madagascar is geographically linked to East ity of the Monsoons, connected East Africa and South Asia. It is Africa, however, Malagasies are genetically distinct from unclear how this could have promoted the spread of TB in Africans, as they have mixed African and Southeast Asian one direction, but not in the other. A possible explanation is ancestry due to the Austronesian colonization of Madagascar that strains originating in South Asia were more efficient in (Brucato et al., 2016). Madagascar was the region with transmitting in East Africa, compared to East African strains the second highest prevalence of derived haplotypes in the that migrated to South Asia. N-terminal epitope of esxH after Southeast Asia (islands), 37.3% and 72.8% respectively, opposed to 0.3% (one single strain) Another difference compared to previous studies is the tempo- in the rest of East Africa. ral framework for the evolutionary history of L1. O’Neill et al. (2019) estimated that the MRCA of L1 lived in the 4th century Although all the codons under positive selection in esxH were BC and that the migration rates decreased after the 7th century contained in one single T cell epitope, the selective pressure act- AD. Our results indicate that the MRCA of L1 did not exist ing on esxH could be due to some other factor, and not to the before the 12th century AD. This discrepancy is due to differ- binding of the epitope by the MHC, or to its recognition by ent assumptions about the clock rates of L1, but none of the T cells. A possible mechanism was suggested by experiments Page 9 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 in a mouse model, showing that the N-terminal epitope of EsxH Methods generates an immunodominant CD8 T cells response. The Strain cultures, DNA extraction and genome amino acid replacement A10T (which we found to be the most sequencing prevalent among the derived haplotypes of the N-terminal MTBC isolates previously identified as L1 and L3, either by EsxH epitope) did not change MHC binding or T cell rec- SNP-typing or spoligotyping, were grown in Middlebrook ognition, but it accelerated the degradation of the epitope, 7H9 liquid medium supplemented with ADC and incubated disrupting the immunodominant response (Sutiwisesak et al., at 37°C. Purified genomic DNA was obtained from cultures 2020; Woodworth et al., 2008). using a CTAB extraction method. Whole genome sequenc- ing was performed on libraries prepared from purified genomic An additional driver of selection could be the effector func- DNA using Illumina Nextera ® XT library and NEBNext ® tion of EsxH. EsxH is a small effector secreted by the Ultra TM II FS DNA Library Prep Kits. Sequencing was per- Esx-3 secretion system as dimer with EsxG (Ilghari et al., 2011). formed using the Illumina HiSeq 2500 or NextSeq 500 paired-end Within the host macrophages, the dimer EsxH-EsxG targets technology. Hrs (a component of the human endosomal sorting complex), impairing trafficking, and hindering phagosomal maturation The sequence data generated by this study has been depos- and antigen presentation, thus contributing to the survival of the ited on SRA under the accession numbers PRJNA630228 pathogen (Mehra et al., 2013; Mittal et al., 2018; Portal-Celhay and PRJNA670836. et al., 2016). The observed signatures of selection could be due to the adaptation of L1 strains to human genotypes of hrs preva- Bioinformatic pipeline lent in South- and Southeast Asia. A similar signature of selection We screened a large collection of publicly available whole was observed in another Esx effector (EsxW), in MTBC strains genome sequences (Illumina) of MTBC strains belonging to belonging to L2 (Holt et al., 2018). Holt and colleagues found L1 and L3, using the diagnostic SNPs described in Steiner et al. evidence for parallel evolution at one residue in the N-terminal (2014). To cover geographic regions that were under-represented loop of EsxW, outside the region covered by known epitopes, by this dataset, we additionally sequenced the genomes of 767 suggesting that the selective pressure on EsxW was not due clinical strains (360 L1, and 407 L3). to antigen recognition. For all samples, Illumina reads were trimmed with Trimmo- The sampling effort in this study was considerable, and it matic v0.33 (SLIDINGWINDOW: 5:20,ILLUMINACLIP:{a provided a more complete picture compared to previous stud- dapter}:2:30:10) (Bolger et al., 2014). Reads shorter than 20 bp ies. Nevertheless, the sampling was not population-based, and were excluded from the downstream analysis. Overlapping for some regions the coverage was scarce (e.g. the Arabian paired-end reads were then merged with SeqPrep (overlap size Peninsula). Because of these limitations, our biogeographical = 15; https://github.com/jstjohn/SeqPrep). The resulting reads analyses were limited to the subcontinental level. This approach were mapped to the reconstructed MTBC ancestral sequence revealed the global population structure and the main (Comas et al., 2013) with BWA v0.7.12 (mem algorithm; macroscopic patterns of diversity and migration of L1 and Li & Durbin, 2010). Duplicated reads were marked by the L3. However, MTBC populations are diverse also within sub- MarkDuplicates module of Picard v 2.1.1 (https://github.com/ continental regions. For example, the MTBC population in broadinstitute/picard). The RealignerTargetCreator and Indel- Southern India is dominated by L1, while the most prevalent Realigner modules of GATK v.3.4.0 (McKenna et al., 2010) lineage in the North is L3 (Couvin et al., 2019). To inves- were used to perform local realignment of reads around Indels. tigate fine-scale processes in greater detail, including local Reads with alignment score lower than (0.93*read_length)-(read_ adaptation, large population-based studies will be necessary. length*4*0.07)) were excluded: this corresponds to more than 7 miss-matches per 100 bp. In conclusion, the results presented here improve our knowl- edge about the TB epidemic around the Indian Ocean. A better SNPs were called with Samtools v1.2 mpileup (Li et al., 2009) understanding of the evolutionary dynamics of different and VarScan v2.4.1 (Koboldt et al., 2012) using the following MTBC populations might inform the development of control thresholds: minimum mapping quality of 20, minimum base strategies for different regions. For example, esxH is part of quality at a position of 20, minimum read depth at a position several new vaccine candidates (Abel et al., 2010; Hoang of 7X, minimum percentage of reads supporting the call 90%. et al., 2013; Radošević et al., 2007), and at least one of these, SNPs in previously defined repetitive regions were excluded H4:IC31, is under clinical development in South Africa (i.e. PPE and PE-PGRS genes, phages, insertion sequences and (Bekker et al., 2020; Nemes et al., 2018). In the light of our repeats longer than 50 bp) as described before (Brites et al., findings, and to develop a globally effective vaccine, it would 2018). be important to know if the results of the clinical trials in South Africa can be replicated in Southeast Asia, where there is a We applied the following filters: genomes were excluded if high prevalence of derived esxH haplotypes in the circulating they had 1) an average coverage <15x, 2) more than 50% of MTBC populations. their SNPs excluded due to the strand bias filter, 3) more than Page 10 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 50% (or more than 1,000 in absolute number) of their SNPs hav- subsampling strategy, we used Treemmer v0.3 (Menardo et al., ing a percentage of reads supporting the call between 10% and 2018) with options -X 400, -pr, -lm, and -mc 25. For both sub- 90%, 4) contained single nucleotide polymorphisms diagnos- sampling schemes, we generated three subsets of the original tic for different MTBC lineages (Steiner et al., 2014), as this datasets, resulting in six subsets for each lineage. We indicated that a mix of genomes was sequenced, 5) had more assembled SNP alignments including only variable posi- than 5,000 SNPs of difference compared to the reconstructed tions with less than 10% of missing data, and used jModel- ancestral genome of the MTBC (Comas et al., 2013). Addi- Test 2.1.10 v20160303 (Darriba et al., 2012) to identify the tionally, when multiple strains were sampled from the same best fitting nucleotide substitution model (according to the patient, we retained only one. We further excluded all strains Akaike information criterion) among 11 possible schemes that had less SNPs than (average - (3 * standard deviation)) of including unequal nucleotide frequencies (total models = 22, the respective lineage (calculated after all previous filter- options -s 11 and -f). ing steps). We built SNP alignments for L1 and L3 separately, including only variable positions with less than 10% of missing We performed Bayesian inference with Beast 2.5 (Bouckaert data, and finally, we excluded all genomes with more than et al., 2019). We corrected the xml files to specify the number 10% of missing data in the alignment of the respective line- of invariant sites as indicated here: https://groups.google.com/ age. After all filtering steps, we were able to retrieve 4,968 forum/#!topic/beast-users/QfBHMOqImFE, and used the tip sam- strains with high quality genome sequences for further analyses pling year as calibration. We assumed the best fitting nucleotide (2,938 L1, 2,030 L3; Extended data: Table 1). substitution model as identified by jModelTest, a relaxed log- normal clock model (Drummond et al., 2006) and an expo- Analysis of sublineages nential population size coalescent prior. We chose a 1/x prior We used the curated datasets and inferred phylogenetic trees for the population size [0–109], a 1/x prior for the mean of the based on all polymorphic positions (excluding the ones in lognormal distribution of the clock rate [10−10–10−5], a nor- repetitive regions, see above) with raxml 8.2.11 (Stamatakis, mal(0,1) prior for the standard deviation of the lognormal 2014; -m GTRCAT and -V options). We then identified subline- distribution of the clock rate [0 –infinity]. For the exponen- ages following the classification (and using the diagnostic SNPs) tial growth rate prior, we used the standard Laplace distribu- of Coll et al. (2014). tion [-infinity–infinity]. For all datasets, we ran at least two runs, we used Tracer 1.7.1 (Rambaut et al., 2018) to identify and Molecular clock analyses with LSD exclude the burn-in, to evaluate convergence among runs and We selected all strains for which the year of sampling was to calculate the estimated sample size. We stopped the runs known (2,499 strains, 1,672 L1, 827 L3). For both line- when at least two chains reached convergence, and the ESS of ages, we built SNP alignments including only variable posi- the posterior and of all parameters were larger than 200. tions with less than 10% of missing data. We inferred phy- logenetic trees with RAxML 8.2.11 (Stamatakis, 2014), Since we detected a strong temporal signal only for L3, we using the GTR model (-m GTRCAT -V options). Since the align- performed a set of additional analyses of the subsets of L3. ments contain only variable positions, we rescaled the branch We repeated the Beast analysis with an extended Bayesian lengths of the trees: rescaled_branch_length = ((branch_length Skyline Plot (BSP) prior instead of the exponential growth * alignment_lengths) / (alignment_length + invariant_sites)). prior, and performed a nested sampling analysis (Russel We rooted the trees using the genome sequence of a L2 et al., 2019) to identify which of these two models (exponential strain as outgroup (accession number SAMEA4441446). growth and extended BSP) fitted the data best. The nested sam- pling was run with chainLength = 20000, particleCount= 4, and We used the least square method implemented in LSD v0.3- subChainLength = 10000. beta (To et al., 2016) to estimate the molecular clock rate with the QPD algorithm and calculating the confidence interval All xml input files are available as Supplementary Files in (options -f 100 and -s). We also performed a date randomization Extended data. test by randomly reassigning the sampling dates among taxa 100 times and estimating the clock rate from the rand- Datasets for biogeography analyses omized and observed datasets. All LSD analyses were performed For the biogeography analysis, we considered only genome on the two lineages individually (L1 and L3), and on the five sequences obtained from strains for which the locality of sampling sublineages of L1. was known. When the country of sampling did not correspond to the country of origin of the patient (or was unknown), Molecular clock analyses with Beast we considered as sampling locality the country of origin of Bayesian molecular clock analyses are computationally demand- the patient (this affected 187 strains, 121 L1 and 66 L3). Fur- ing and they would be impossible to apply onto the complete thermore, similarly to other studies (O’Neill et al., 2019), we datasets. Therefore, we sub-sampled the L1 and L3 datasets excluded all genomes that were sampled from Europe, North used for the LSD analysis to 400 genomes with two different America and Australia, as most contemporary infections in strategies: 1) random subsampling 2) random subsam- these regions affect recent migrants. The final dataset for the pling keeping at least 25 genomes for each year of sampling biogeography analyses comprised 3,082 strains (2,061 L1 (where possible; “weighted subsampling”). For this second and 1,021 L3). We assigned the different isolates to different Page 11 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 subcontinental geographic regions according to sampling local- We performed Bayesian inference with Beast2.5 (Bouckaert ity. To do this, we followed a modified version of the United et al., 2019). We corrected the xml files to specify the number of Nations geographic scheme (https://unstats.un.org/unsd/methodol- invariant sites as indicated here: https://groups.google.com/forum/ ogy/m49/; Extended data: Table 1 and Figures 1–3). #!topic/beast-users/QfBHMOqImFE. We assumed a lognormal uncorrelated clock and we fixed the mean of the lognormal dis- Phylogeography analysis with PASTML tribution of the clock rate to 1.4×10-7 (L1) and 9×10-8 (L3). We For both lineages, we built SNP alignments including only assigned the tip sampling years to the different strains, and variable positions with less than 10% of missing data, and when the sampling time was unknown, we assumed a uniform all strains with known sampling locality (excluding strains prior from 1995 to 2018 (similarly to what done in the from North America, Europe and Australia; see above, 2,061 PASTML analysis). We further assumed the best fitting nucle- L1 genomes and 1,021 L3 genomes). We inferred phyloge- otide substitution model as identified by jModelTest, a gamma netic trees with raxml 8.2.11 (Stamatakis, 2014), using the GTR prior for the standard deviation of the lognormal distribution model (-m GTRCAT -V options). Since the alignments contain of the clock rate [0 –infinity], and a lognormal prior for the only variable positions, we re-scaled the branch lengths of the population size with standard deviation = 0.2, and mean esti- trees: rescaled_branch_length = ((branch_length * alignment_ mated in real space. Finally, we used an exponential distribution lengths) / (alignment_length + invariant_sites)). with mean = 10-4 as prior for the migration rates. For each anal- ysis, we ran at least two runs. We used Tracer 1.7.1 (Rambaut Since PASTML needs a time tree as input, we calibrated the et al., 2018) to identify and exclude the burn-in, to evaluate phylogenies with LSD, assuming a clock rate of 1.4×10-7 for convergence among runs, and to calculate the estimated sam- L1, and 9×10-8 for L3. In this analysis, genomes for which the ple size. We stopped the runs when at least two chains reached sampling date was not known were assumed to have been sam- convergence, and the ESS of the posterior, prior and of the pled between 1995 and 2018, which is the period in which all parameters of interest (population sizes and migration rates) strains with known date of isolation were sampled. Importantly, were larger than 200. using different clock rates for this analysis would only change the time scale of the trees, but not the reconstruction of the The xml input files are available as Supplementary Files in ancestral characters. Extended data. We assigned to each strain the subcontinental geographic region L2 and L4 datasets of origin as character, and used PASTML (Ishikawa et al., For the analysis of esxH, we wanted to compare the results 2019) to reconstruct the ancestral geographical ranges and their obtained for L1 and L3 with the other two major human-adapted changes along the trees of L1 and L3. We used the MPPA as MTBC lineages L2 and L4. Therefore, we compiled two datasets prediction method (standard settings) and added the character of publicly available genome sequences for these two lineages. predicted by the joint reconstruction even if it was not selected We applied the same bioinformatic pipeline described above: by the Brier score (option -forced_joint). Additionally, we genomes were excluded if they had 1) an average coverage repeated the PASTML analysis for the sublineages of L1 <15x, 2) more than 50% (or more than 1,000 in absolute individually. number) of their SNPs having a percentage of reads supporting the call between 10% and 90%, or 3) contained single nucleotide Phylogeography analysis with Mascot polymorphisms diagnostic for different MTBC lineages (Steiner As a complementary method to reconstruct the ancestral range et al., 2014), as this could indicate mixed infection. The final and the migration pattern of different populations, we used dataset consisted of 6,752 L2 genome sequences (with 140,309 the Beast package Mascot (Müller et al., 2018). We assumed polymorphic positions) and 10,466 L4 genome sequences (with that strains sampled in the different subcontinental regions 277,648 polymorphic positions; Extended data: Table 1). We represent distinct subpopulations, and we considered only popu- reconstructed the phylogenetic tree of L2 with raxml 8.2.11 lations for which we had at least 75 genome sequences: four (Stamatakis, 2014) as described above. Due to the large size populations for L1 (East Africa, South Asia, Southeast Asia of the dataset, we used FastTree (Price et al., 2010) with (mainland) and Southeast Asia (islands)), and two popula- options -nt -nocat -nosupport to reconstruct the phylogenetic tions for L3 (East Africa and South Asia). For computational tree of L4. reasons, we subsampled the two datasets (L1 and L3) to ~300 strains. We sampled an equal number of strains from each Epitopes analysis geographic region (where possible), and within regions, we We downloaded the amino acid sequence of all MTBC randomly sampled an equal number of strains from each country epitopes described for Homo sapiens from the immune epitope (where possible). This sub-sampling scheme resulted in two database (https://www.iedb.org/; downloaded on the 03.08.2020). subsets of 303 and 300 strains for L1 and L3, respectively. We considered separately MHCI epitopes and MHCII epitopes. We mapped the epitope sequences onto the H37Rv genome We assembled SNP alignments including only variable positions (GCF_000195955.2) using tblastn and excluded sequences with less than 10% of missing data, and used jModelTest mapping equally well to multiple loci in the H37Rv genome. 2.1.10 v20160303 (Darriba et al., 2012) to identify the best Additionally, we retained only epitopes that mapped with fitting nucleotide substitution model as described above. two mismatches or less over the whole epitope length. This Page 12 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 resulted in a final list of 539 MHCI epitopes, and 1,144 MHCII Prediction of binding affinity between T cell epitopes epitopes. (Extended data: Table 7). We used the datasets and HLA alleles obtained for the biogeography analysis (2,061 genome sequences We considered three HLA loci: HLA-A, HLA-B and HLA- for L1, and 1,021 genome sequences for L3). DRB1. We used the allele frequency database (Gonzalez-Galarza et al., 2020) to identify alleles that are prevalent in East Africa For each lineage, we independently assembled a multiple and not in South Asia and Southeast Asia, or the other way sequence alignment for each epitope. We then translated the around. Because the coverage of the allele frequency database sequence to amino acids and used PAUP 4.a (Wilgenbusch & is patchy, we focused on the following countries: Kenya and Swofford, 2003) to reconstruct the replacement history of all Zimbabwe as representatives of East Africa; India, Thailand and polymorphic positions on the rooted phylogenetic trees. We Taiwan as representatives of South- and Southeast Asia. We used two maximum parsimony algorithms (ACCTRAN and identified: DELTRAN) and considered only the events reconstructed by both algorithms. 1) Alleles that had a frequency of 10% or more in at least one population in South- and Southeast Asia, but had frequencies Analysis of esxH lower than 10% in all East African populations. We considered the first 18 amino acids of esxH (MSQIMY- NYPAMLGHAGDM), which was by far the epitope with 2) Alleles that had a frequency of 10% or more in at least one most amino acid replacements in L1. We expanded the analy- population in East Africa, but had frequencies lower than 10% sis of this epitope to the L2 and L4 datasets, so that we could in all populations in South- and Southeast Asia. compare the results with L1 and L3. For the PAML analysis, we randomly selected 500 strains from each MTBC lineage. We 3) Alleles that had a frequency of 10% or more in at least used the phylogenetic tree reconstructed by RAxML (same set- one population both in South- and Southeast Asia and in East tings as above), and the gene alignment to estimate the branch Africa. lengths of the tree using the M0 codon model implemented in PAML 4.9e (Yang, 2007). This step was necessary to obtain For all these alleles, we performed in silico binding predic- a tree with the branch length in expected substitutions per tion with three epitopes: the ancestral epitope at the esxH N codon. We then fitted two alternative codon models (M1a and terminus (MSQIMYNYPAMLGHAGDM), and the two most M2a) to the trees and alignments. M1a allows ω to be variable frequently observed derived epitopes (MSQIMYNYPTMLGH- across sites, and it assumes two different ω (0 < ω < 1, and ω 0 1 AGDM, and MSQIMYNYPVMLGHAGDM). For HLA-A and = 1), modeling nearly neutral evolution; M2a assumes one addi- ω ω HLA-B alleles, we used the NetMHCPan4.1 server (Reynisson tional ( > 1) compared to M1a, thus modeling positive 2 et al., 2020) with standard settings. For HLA-DRB1 alleles, selection. We performed a likelihood ratio test between the two we used the prediction tool of the immune epitope database models as described in Zhang et al. (2005). Templates for the con- (https://www.iedb.org/). trol files of the codeml analyses (M0, M1a, M2a) are available as Supplementary Files in Extended data. The codon under posi- tive selection were identified with the Bayes empirical Bayes Genome wide scan for positive selection with PAML method (Yang et al., 2005). For this analysis, we used the subsets generated for the Mascot analysis. These datasets are representative of the populations To test whether the derived haplotypes of this epitope were of L1 and L3 in their core geographic ranges, and are compu- associated with a specific geographic region, we constructed a tationally treatable. We generated sequence alignments for all statistical test analogous to PhyC (Farhat et al., 2013), which is genes individually, excluding genes in repetitive regions of the normally used to test for association between a variant and pheno- genome (see above). Because some genes are deleted in L1 typic drug resistance. We simulated mutations on the phylogenetic but not in L3, or the other way around, we obtained a slightly tree of L1 and counted how many strains from each region different number of gene alignments for the two lineages resulted to have a derived state according to the simulation. (L1: 3,623, L3: 3,622). For each gene, we performed a test for Under this procedure, mutations occur randomly on the tree, positive selection with PAML as described above for esxH. We and therefore independently from the geographic region where considered as under positive selection all genes, for which the tips were sampled. For each test, we performed 10,000 sim- the likelihood ratio test resulted in a Bonferroni-corrected ulations to obtain a null distribution, and we compared this dis- p-value < 0.05. tribution to the observed data to obtain empirical p-values. We used the same geographic region used for the biogeography Drug resistance mutations profiles analysis. Additionally, we considered East Africa excluding We considered 196 mutations conferring resistance to different Madagascar. antibiotics (Payne et al., 2019). We extracted the respective genomic positions from the vcf file of the 4,968 genomes of the R code and input files to perform this test are available as complete curated data set (2,938 L1, 2,030 L3) and assembled Supplementary Files in Extended data. them in phylip format. To determine the number of independent Page 13 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 mutations, we reconstructed the nucleotide changes on the • S upplementary_file_3.html: PASTML results for L3, com- phylogenetic tree rooted with the L2 strain (SAMEA4441446). pressed tree. To do this, we used the maximum parsimony ACCTRAN and DELTRAN algorithms implemented in PAUP 4.0a (Wilgenbusch • S upplementary_file_4.html: PASTML results for L3, com- & Swofford, 2003), and considered only the events reconstructed plete tree. by both algorithms. • T he folder Dating_Beast contains the xml files for the dating analyses. Data availability Underlying data • T he folder EsxH_haplotype_association_test contains The sequence data generated by this study has been deposited the code and input files to perfomr the test of asso- on SRA (https://www.ncbi.nlm.nih.gov/sra) under the accession ciation between different haplotypes and geographic numbers PRJNA630228 and PRJNA670836. regions. • T he folder EsxH_PAML contains the control files and Extended data input files to perform the positive selection analysis on Extended data is available here: https://github.com/fmenardo/ EsxH. MTBC_L1_L3. • T he folder Mascot contains the xml files for the DOI: https://doi.org/10.5281/zenodo.4435760 (Menardo, 2021). Mascot analyses. This project contains the following extended data: Extended data is available under a GNU GENERAL PUBLIC LICENSE. • T he folder Supplementary_text_figures_tables contains supplementary text, figures and tables. • S upplementary_file_1.html: PASTML results for L1, com- pressed tree. Acknowledgements Calculations were performed at sciCORE (http://scicore.uni- • S upplementary_file_2.html: PASTML results for L1, com- bas.ch/) scientific computing core facility at the University of plete tree. Basel. References Abel B, Tameris M, Mansoor N, et al.: The novel tuberculosis vaccine, AERAS- Brahmajothi V, Pitchappan RM, Kakkanaiah VN, et al.: Association of 402, induces robust and polyfunctional CD4+ and CD8+ T cells in adults. Am J pulmonary tuberculosis and HLA in south India. Tubercle. 1991; 72(2): Respir Crit Care Med. 2010; 181(12): 1407–1417. 123–132. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text Allen RB: Indian Ocean transoceanic migration, 16th–19th century. The Brites D, Loiseau C, Menardo F, et al.: A new phylogenetic framework for the Encyclopedia of Global Human Migration. 2013. animal-adapted Mycobacterium tuberculosis complex. Front Microbiol. 2018; Publisher Full Text 9: 2820. Bekker LG, Dintwe O, Fiore-Gartland A, et al.: A phase 1b randomized study PubMed Abstract | Publisher Full Text | Free Full Text of the safety and immunological responses to vaccination with H4:IC31, Brucato N, Fernandes V, Kusuma P, et al.: Evidence of Austronesian Genetic H56:IC31, and BCG revaccination in Mycobacterium tuberculosis-uninfected Lineages in East Africa and South Arabia: Complex Dispersal from adolescents in Cape Town, South Africa. EClinicalMedicine. 2020; 21: 100313. Madagascar and Southeast Asia. Genome Biol Evol. 2019; 11(3): 748–758. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Blench R: Evidence for the Austronesian voyages in the Indian Ocean. The Brucato N, Fernandes V, Mazières S, et al.: The Comoros Show the Earliest Global Origins and Development of Seafaring. 2010; 239: 48. Austronesian Gene Flow into the Swahili Corridor. Am J Hum Genet. 2018; Reference Source 102(1): 58–68. Boivin N, Crowther A, Helm R, et al.: East Africa and Madagascar in the Indian PubMed Abstract | Publisher Full Text | Free Full Text Ocean world. J World Prehist. 2013; 26(3): 213–281. Brucato N, Kusuma P, Cox MP, et al.: Malagasy Genetic Ancestry Comes from Publisher Full Text an Historical Malay Trading Post in Southeast Borneo. Mol Biol Evol. 2016; Bolger AM, Lohse M, Usadel B: Trimmomatic: a flexible trimmer for Illumina 33(9): 2396–2400. sequence data. Bioinformatics. 2014; 30(15): 2114–2120. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Chihota VN, Niehaus A, Streicher EM, et al.: Geospatial distribution of Bos KI, Harkins KM, Herbig A, et al.: Pre-Columbian mycobacterial genomes Mycobacterium tuberculosis genotypes in Africa. PLoS One. 2018; 13(8): reveal seals as a source of New World human tuberculosis. Nature. 2014; e0200632. 514(7523): 494–497. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Coll F, McNerney R, Guerra-Assuncao JA, et al.: A robust SNP barcode for Bottai D, Frigui W, Sayes F, et al.: TbD1 deletion as a driver of the typing Mycobacterium tuberculosis complex strains. Nat Commun. 2014; 5(1): evolutionary success of modern epidemic Mycobacterium tuberculosis 4812. lineages. Nat Commun. 2020; 11(1): 684. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Comas I, Chakravartti J, Small PM, et al.: Human T cell epitopes of Bouckaert R, Vaughan TG, Barido-Sottani J, et al.: BEAST 2.5: An advanced Mycobacterium tuberculosis are evolutionarily hyperconserved. Nat Genet. software platform for Bayesian evolutionary analysis. PLoS Comput Biol. 2010; 42(6): 498–503. 2019; 15(4): e1006650. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Comas I, Coscolla M, Luo T, et al.: Out-of-Africa migration and Neolithic Page 14 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 coexpansion of Mycobacterium tuberculosis with modern humans. Nat Genet Ishikawa SA, Zhukova A, Iwasaki W, et al.: A fast likelihood method to 2013; 45(10): 1176–1182. reconstruct and visualize ancestral scenarios. Mol Biol Evol. 2019; 36(9): PubMed Abstract | Publisher Full Text | Free Full Text 2069–2085. Conceição EC, Refregier G, Gomes HM, et al.: Mycobacterium tuberculosis PubMed Abstract | Publisher Full Text | Free Full Text lineage 1 genetic diversity in Pará, Brazil, suggests common ancestry with Koboldt DC, Zhang Q, Larson DE, et al.: VarScan 2: somatic mutation and copy east-African isolates potentially linked to historical slave trade. Infect Genet number alteration discovery in cancer by exome sequencing. Genome Res. Evol. 2019; 73: 337–341. 2012; 22(3): 568–576. PubMed Abstract | Publisher Full Text PubMed Abstract | Publisher Full Text | Free Full Text Coscollá M, Brites D, Menardo F, et al.: Phylogenomics of Mycobacterium Li H, Durbin R: Fast and accurate long-read alignment with Burrows- africanum reveals a new lineage and a complex evolutionary history. Wheeler transform. Bioinformatics. 2010; 26(5): 589–595. bioRxiv. 2020. PubMed Abstract | Publisher Full Text | Free Full Text Publisher Full Text Li H, Handsaker B, Wysoker A, et al.: The sequence alignment/map format Coscolla M, Copin R, Sutherland J, et al.: M. tuberculosis T Cell Epitope Analysis and SAMtools. Bioinformatics. 2009; 25(16): 2078–2079. Reveals Paucity of Antigenic Variation and Identifies Rare Variable TB PubMed Abstract | Publisher Full Text | Free Full Text Antigens. Cell Host Microbe. 2015; 18(5): 538–548. McKenna A, Hanna M, Banks E, et al.: The Genome Analysis Toolkit: a PubMed Abstract | Publisher Full Text | Free Full Text MapReduce framework for analyzing next-generation DNA sequencing Couvin D, Reynaud Y, Rastogi N: Two tales: Worldwide distribution of data. Genome Res. 2010; 20(9): 1297–1303. Central Asian (CAS) versus ancestral East-African Indian (EAI) lineages PubMed Abstract | Publisher Full Text | Free Full Text of Mycobacterium tuberculosis underlines a remarkable cleavage for McQuaid CF, McCreesh N, Read JM, et al.: The potential impact of COVID-19- phylogeographical, epidemiological and demographical characteristics. related disruption on tuberculosis burden. Eur Respir J. 2020; 56(2): 2001718. PLoS One. 2019; 14(7): e0219706. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Mehra A, Zahra A, Thompson V, et al.: Mycobacterium tuberculosis type VII Darriba D, Taboada GL, Doallo R, et al.: jModelTest 2: more models, new secreted effector EsxH targets host ESCRT to impair trafficking. PLoS heuristics and parallel computing. Nat Methods. 2012; 9(8): 772. Pathog. 2013; 9(10): e1003734. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text De Jong BC, Antonio M, Gagneux S: Mycobacterium africanum--review of an Menardo F: fmenardo/MTBC_L1_L3: MTBC L1 L3 (Version v2). Zenodo. 2021. important cause of human tuberculosis in West Africa. PLoS Negl Trop Dis. http://www.doi.org/10.5281/zenodo.4435760 2010; 4(9): e744. PubMed Abstract | Publisher Full Text | Free Full Text Menardo F, Duchêne S, Brites D, et al.: The molecular clock of Mycobacterium tuberculosis. PLoS Pathog. 2019; 15(9): e1008067. Dou HY, Chen YY, Kou SC, et al.: Prevalence of Mycobacterium tuberculosis PubMed Abstract | Publisher Full Text | Free Full Text strain genotypes in Taiwan reveals a close link to ethnic and population migration. J Formos Med Assoc. 2015; 114(6): 484–488. Menardo F, Loiseau C, Brites D, et al.: Treemmer: a tool to reduce large PubMed Abstract | Publisher Full Text phylogenetic datasets with minimal loss of diversity. BMC Bioinformatics. 2018; 19(1): 164. Drummond AJ, Ho SYW, Phillips MJ, et al.: Relaxed phylogenetics and dating PubMed Abstract | Publisher Full Text | Free Full Text with confidence. PLoS Biol. 2006; 4(5): e88. PubMed Abstract | Publisher Full Text | Free Full Text Mittal E, Skowyra ML, Uwase G, et al.: Mycobacterium tuberculosis type VII secretion system effectors differentially impact the ESCRT endomembrane Duchêne S, Holt KE, Weill FX, et al.: Genome-scale rates of evolutionary damage response. mBio. 2018; 9(6): e01765–18. change in bacteria. Microb Genom. 2016; 2(11): e000094. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Müller NF, Rasmussen D, Stadler T: MASCOT: parameter and state inference Farhat MR, Shapiro BJ, Kieser KJ, et al.: Genomic analysis identifies targets of under the marginal structured coalescent approximation. Bioinformatics. convergent positive selection in drug-resistant Mycobacterium tuberculosis. 2018; 34(22): 3843–3848. Nat Genet. 2013; 45(10): 1183–1189. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Nemes E, Geldenhuys H, Rozot V, et al.: Prevention of M. tuberculosis Fenner L, Egger M, Bodmer T, et al.: Effect of mutation and genetic infection with H4:IC31 vaccine or BCG revaccination. N Engl J Med. 2018; background on drug resistance in Mycobacterium tuberculosis. Antimicrob 379(2): 138–149. Agents Chemother. 2012; 56(6): 3047–3053. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Ngabonziza JCS, Loiseau C, Marceau M, et al.: A sister lineage of the Gagneux S (Ed.): Strain variation in the Mycobacterium tuberculosis Mycobacterium tuberculosis complex discovered in the African Great Lakes complex: its role in biology, Epidemiology and Control. Springer. 2017; 1019. region. Nat Commun. 2020; 11(1): 2917. Publisher Full Text PubMed Abstract | Publisher Full Text | Free Full Text Gagneux S: Ecology and evolution of Mycobacterium tuberculosis. Nat Rev O'Neill MB, Shockey A, Zarley A, et al.: Lineage specific histories of Microbiol. 2018; 16(4): 202–213. Mycobacterium tuberculosis dispersal in Africa and Eurasia. Mol Ecol. 2019; PubMed Abstract | Publisher Full Text 28(13): 3241–3256. Glaziou P: Predicted impact of the COVID-19 pandemic on global PubMed Abstract | Publisher Full Text | Free Full Text tuberculosis deaths in 2020. medRxiv. 2020. Publisher Full Text Qu HQ, Fisher-Hoch SP, McCormick JB: Knowledge gaining by human genetic studies on tuberculosis susceptibility. J Hum Genet. 2011; 56(3): 177–182. Gonzalez-Galarza FF, McCabe A, Santos EJMD, et al.: Allele frequency net PubMed Abstract | Publisher Full Text | Free Full Text database (AFND) 2020 update: gold-standard data classification, open access genotype data and new query tools. Nucleic Acids Res. 2020; 48(D1): Payne JL, Menardo F, Trauner A, et al.: Transition bias influences the D783–D788. evolution of antibiotic resistance in Mycobacterium tuberculosis. PLoS Biol. PubMed Abstract | Publisher Full Text | Free Full Text 2019; 17(5): e3000265. PubMed Abstract | Publisher Full Text | Free Full Text Ho SYW, Phillips MJ, Cooper A, et al.: Time dependency of molecular rate estimates and systematic overestimation of recent divergence times. Mol Pepperell CS, Casto AM, Kitchen A, et al.: The role of selection in shaping Biol Evol. 2005; 22(7): 1561–1568. diversity of natural M. tuberculosis populations. PLoS Pathog. 2013; 9(8): PubMed Abstract | Publisher Full Text e1003543. PubMed Abstract | Publisher Full Text | Free Full Text Hoang T, Aagaard C, Dietrich J, et al.: ESAT-6 (EsxA) and TB10.4 (EsxH) based vaccines for pre- and post-exposure tuberculosis vaccination. PLoS One. Peters JS, Ismail N, Dippenaar A, et al.: Genetic Diversity in Mycobacterium 2013; 8(12): e80579. tuberculosis Clinical Isolates and Resulting Outcomes of Tuberculosis PubMed Abstract | Publisher Full Text | Free Full Text Infection and Disease. Annu Rev Genet. 2020; 54: 511–537. PubMed Abstract | Publisher Full Text Hogan AB, Jewell BL, Sherrard-Smith E, et al.: Potential impact of the COVID-19 pandemic on HIV, tuberculosis, and malaria in low-income and Pierron D, Razafindrazaka H, Pagani L, et al.: Genome-wide evidence of middle-income countries: a modelling study. Lancet Glob Health. 2020; 8(9): Austronesian-Bantu admixture and cultural reversion in a hunter-gatherer e1132–e1141. group of Madagascar. Proc Natl Acad Sci U S A. 2014; 111(3): 936–941. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Holt KE, McAdam P, Thai PVK, et al.: Frequent transmission of the Portal-Celhay C, Tufariello JM, Srivastava S, et al.: Mycobacterium tuberculosis Mycobacterium tuberculosis Beijing lineage and positive selection for the EsxH inhibits ESCRT-dependent CD4 + T-cell activation. Nat Microbiol. 2016; EsxW Beijing variant in Vietnam. Nat Genet. 2018; 50(6): 849–856. 2(2): 16232. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Ilghari D, Lightbody KL, Veverka V, et al.: Solution Structure of the Price MN, Dehal PS, Arkin AP: FastTree 2--approximately maximum- Mycobacterium tuberculosis EsxG·EsxH Complex functional implications and likelihood trees for large alignments. PLoS One. 2010; 5(3): e9490. comparisons with other m. tuberculosis esx family complexes. J Biol Chem. PubMed Abstract | Publisher Full Text | Free Full Text 2011; 286(34): 29993–30002. Rabahi MF, Conceição EC, de Paiva LO, et al.: Characterization of PubMed Abstract | Publisher Full Text | Free Full Text Mycobacterium tuberculosis var. africanum isolated from a patient with Page 15 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 pulmonary tuberculosis in Brazil. Infect Genet Evol. 2020; 85: 104550. Sveinbjornsson G, Gudbjartsson DF, Halldorsson BV, et al.: HLA class II PubMed Abstract | Publisher Full Text sequence variants influence tuberculosis risk in populations of European Radošević K, Wieland CW, Rodriguez A, et al.: Protective immune responses ancestry. Nat Genet. 2016; 48(3): 318–322. to a recombinant adenovirus type 35 tuberculosis vaccine in two mouse PubMed Abstract | Publisher Full Text | Free Full Text strains: CD4 and CD8 T-cell epitope mapping and role of gamma interferon. To TH, Jung M, Lycett S, et al.: Fast dating using least-squares criteria and Infect Immun. 2007; 75(8): 4105–4115. algorithms. Syst Biol. 2016; 65(1): 82–97. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Rambaut A, Drummond AJ, Xie D, et al.: Posterior summarization in Bayesian Vejbaesya S, Chierakul N, Luangtrakool K, et al.: Associations of HLA class II phylogenetics using Tracer 1.7. Syst Biol. 2018; 67(5): 901–904. alleles with pulmonary tuberculosis in Thais. Eur J Immunogenet. 2002; 29(5): PubMed Abstract | Publisher Full Text | Free Full Text 431–434. PubMed Abstract | Publisher Full Text Reynisson B, Alvarez B, Paul S, et al.: NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif White Z, Painter J, Douglas P, et al.: Immigrant arrival and tuberculosis deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids among large immigrant- and refugee-receiving countries, 2005–2009. Res. 2020; 48(W1): W449–W454. Tuberc Res Treat. 2017; 2017: 8567893. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Russel PM, Brewer BJ, Klaere S, et al.: Model selection and parameter Wiens KE, Woyczynski LP, Ledesma JR, et al.: Global variation in bacterial inference in phylogenetics using Nested Sampling. Syst Biol. 2019; 68(2): strains that cause tuberculosis disease: a systematic review and meta- 219–233. analysis. BMC Med. 2018; 16(1): 196. PubMed Abstract | Publisher Full Text PubMed Abstract | Publisher Full Text | Free Full Text Salie M, van der Merwe L, Möller M, et al.: Associations between human Wilgenbusch JC, Swofford D: Inferring evolutionary trees with PAUP. Curr leukocyte antigen class I variants and the Mycobacterium tuberculosis Protoc Bioinformatics. 2003; (1): 6–4. subtypes causing disease. J Infect Dis. 2014; 209(2): 216–223. PubMed Abstract | Publisher Full Text PubMed Abstract | Publisher Full Text | Free Full Text Woodworth JS, Wu Y, Behar SM: Mycobacterium tuberculosis-specific CD8+ Saunders MJ, Evans CA: COVID-19, tuberculosis and poverty: preventing a T cells require perforin to kill target cells and provide protection in vivo. perfect storm. Eur Respir J. 2020; 56(1): 2001348. J Immunol. 2008; 181(12): 8595–8603. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text Smittipat N, Miyahara R, Juthayothin T, et al.: Indo-Oceanic Mycobacterium World Health Organization: Global Tuberculosis Report. 2020. tuberculosis strains from Thailand associated with higher mortality. Int J Reference Source Tuberc Lung Dis. 2019; 23(9): 972–979. Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. PubMed Abstract | Publisher Full Text 2007; 24(8): 1586–1591. Stamatakis A: RAxML version 8: a tool for phylogenetic analysis and post- PubMed Abstract | Publisher Full Text analysis of large phylogenies. Bioinformatics. 2014; 30(9): 1312–1313. Yang Z, Wong WS, Nielsen R: Bayes empirical Bayes inference of amino acid PubMed Abstract | Publisher Full Text | Free Full Text sites under positive selection. Mol Biol Evol. 2005; 22(4): 1107–1118. Steiner A, Stucki D, Coscolla M, et al.: KvarQ: targeted and direct variant PubMed Abstract | Publisher Full Text calling from fastq reads of bacterial genomes. BMC Genomics. 2014; 15(1): Yuliwulandari R, Sachrowardi Q, Nakajima H, et al.: Association of HLA-A, -B, 881. and -DRB1 with pulmonary tuberculosis in western Javanese Indonesia. PubMed Abstract | Publisher Full Text | Free Full Text Hum Immunol. 2010; 71(7): 697–701. Sutiwisesak R, Hicks ND, Boyce S, et al.: A natural polymorphism of PubMed Abstract | Publisher Full Text Mycobacterium tuberculosis in the esxH gene disrupts immunodomination Zhang J, Nielsen R, Yang Z: Evaluation of an improved branch-site likelihood by the TB10.4-specific CD8 T cell response. PLoS Pathog. 2020; 16(10): method for detecting positive selection at the molecular level. Mol Biol Evol. e1009000. 2005; 22(12): 2472–2479. PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text Page 16 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 Open Peer Review Current Peer Review Status: Version 1 Reviewer Report 15 March 2021 https://doi.org/10.5256/f1000research.31323.r79710 © 2021 Brynildsrud O. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ola B. Brynildsrud Division of Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Oslo, Norway In the paper "Local adaptation in populations of Mycobacterium tuberculosis endemic to the Indian Ocean Rim", Menardo and colleagues explore diversity, phylogeographic relationships and patterns of adaptation in TB lineages 1 and 3, both of which are distributed mainly around the Indian Ocean. The paper demonstrates a strongly asymmetric trend of South Asian spread to other regions. Further, it describes a search for positive selection throughout these genomes, and identify one gene in particular, exsH, as being a target for T-cell dependent selection. The authors additionally calibrates molecular clocks for both L1 and L3 and use this to date certain events in the tree. The methodology is solid and well suited to answer the questions posed. I found the paper to be well written and helpful in shedding light on important questions related to the evolution of L1 and L3. I have only a few minor discussion points. Now, the clock analyses are not heavily emphasized, but I thought it might be a potential flaw that the calibration rests on the assumption that the foundation of a single sub-lineage in West Africa could not have happened prior to the establishment of Portuguese sailing routes around the tip of Africa. While I'm certainly not an expert on historic migrations or anthropology, I would assume there must be many possible alternative scenarios. (For example, pilgrimages across the Sahara are known to have happened much prior to this). As far as I can tell, this calibration is the major driving force for the difference in L1 TMRCA from O'Neill. It would be interesting to know how the degree to which this calibration affects the TMRCA and clock rate. On a somewhat related note, this article fails to cite some key references on the phylogeography of TB. In one section the authors evaluate association between esxH haplotypes and the geographic region of origin. They calculate this association using a home-made statistical test based on phyC. I could not gather from the description exactly what they measure. It is a simulation-based test that measures "how many strains from each region resulted to have a derived state according to   Page 17 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 the simulation". Is there a test statistic being ranked here? Is it single-region vs the rest or is the statistic a composite measure of multiple regions? How is the simulation for the null distribution set up? E.g. is it a forward-in-time simulation using a specified mutation rate or a permutation or something else? I see that the code for running this test is posted on github - However, I still think a few more details in the actual paper is in order for the biologists that don't read R. Speaking of this test, It is a bit unconvential for Table 1 to highlight all results with P<0.1, especially since a large number of associations are being tested here. I would not consider p<0.1 to be particularly strong evidence of association. I'd like to commend the authors for making so much of their data publicly available. Datasets are in immediately-usable formats which makes it simple to download all the raw data, and as mentioned code is also readily available.  Very minor: ○ HLA abbreviation never explained in manuscript.   ○ Tip date randomisation performed, but the results of this test is never stated.   ○ Why is PRJNA630228 posted? As far as I can tell it doesnt contain any SRA data, and the other accession contains data for all 767 strains. Is the work clearly and accurately presented and does it cite the current literature? Yes Is the study design appropriate and is the work technically sound? Yes Are sufficient details of methods and analysis provided to allow replication by others? Yes If applicable, is the statistical analysis and its interpretation appropriate? Yes Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes Competing Interests: No competing interests were disclosed. Reviewer Expertise: Bioinformatics, microbial genomics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. Author Response 19 Mar 2021   Page 18 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 Fabrizio Menardo, Swiss Tropical and Public Health Institute, Basel, Switzerland Thank you very much for your comments, and for reviewing our manuscript. We address all your points below. We prepared a revised version of the manuscript based on the feedback of all reviewers. Regarding the clock analysis of L1: We performed a tip dating analysis but found no temporal signal for L1. Therefore we used the clock rate estimate resulted from a previous study (Menardo et al. 2019) where we analyzed a different L1 dataset and found a good temporal signal and a clock rate of ~1.4x10-7. So far Menardo et al. 2019 is the only published study that estimated the clock rate for a L1 dataset. In a second step we tested whether this clock rate and two alternative values, 5x10-8 and 1.2x10-7,, (the lowest one was used by O’Neill et al. 2019) were compatible with the hypothesis of a West African introduction of L1.1.1 posterior to the Portuguese circumnavigation of Africa. We found that only the largest clock rate was compatible with this hypothesis. We are quite explicit in the manuscript about the limitation of this analysis, and how it should be interpreted in comparison to the results of O’Neill et al. 2019: “This discrepancy is due to different assumptions about the clock rates of L1, but none of the two hypotheses can be excluded with the available data”. Finally, it is possible that the West African L1.1.1 clade did not originate through a direct migration from South East Asia. However, a series of short range migrations seems less likely, as we did not find strains belonging to L1.1.1 in South and Central Asia, or in East Africa. As we write in the manuscript, the dating analyses for MTB in general, and for this L1 dataset in particular, have limitations. Here we discussed what we consider the most likely scenario given the data, we did not exclude alternative scenarios. We added a section in the discussion comparing our findings with the biogeography of other MTBC lineages, and referencing the relevant studies. We revised and expanded the Methods section regarding the phylogenetic test of association to provide more details. To answer the specific questions: the statistic ranked is the number of tips with a derived state; the test is performed for each region individually; the simulation does not use a mutation rate, but it distributes randomly a fixed number of mutations on the tree (the same number inferred from the observed data). We also removed special fonts from Table 1, now all p-values have standard formatting. Minor points: We now write HLA in full at the first occurrence.   Page 19 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 The results of the DRT are reported in the supplementary Information (now amended to enhance clarity), in Sup. Figs 8 and 9 and in Sup. Table 3. In the main text we refer more generally to the lack of temporal signal (or to a good temporal signal), as this reflects the results of the DRT, but also of the Beast analyses. We now refer only to PRJNA670836, which contains data for all the newly sequenced strains. Thanks. Competing Interests: No competing interests were disclosed. Reviewer Report 01 March 2021 https://doi.org/10.5256/f1000research.31323.r79237 © 2021 Viveiros M et al. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Miguel Viveiros Global Health and Tropical Medicine, GHTM, Institute of Hygiene and Tropical Medicine, IHMT, NOVA University Lisbon, Lisbon, Portugal João Perdigão Research Institute for Medicines (iMed.ULisboa), Faculty of Pharmacy, University of Lisbon, Lisbon, Portugal This study depicts a very interesting and relevant account on the evolutionary history and phylogeography of Mycobacterium tuberculosis of Lineages 1 and 3. Historically, and while forming an important part of the M. tuberculosis species diversity, the study of both lineages have been neglected in comparison to Lineages 2 and 4. The authors of this study assembled a large genomic dataset encompassing 2938 and 2030 L1 and L3 isolates, respectively, and by implementing distinct molecular evolutionary algorithms were able to put forward a scenario for the dissemination of these strains. A South Asian origin is proposed for both clades from where these strains appeared to have radiated via migration-driven dispersion towards East Africa (L3) and other regions (L1). The weakest aspect of the work is the lack of phylogeographic and historical context of L1 and L3 versus L2 and L4 dissemination, especially L4 in South-America and L2 in Asia in the context of the migratory waves. The authors used PASTML and Mascot to conduct the phylogeographic analyses, revealing the migratory pattern of these lineages from South Asia revealing the origins of these two lineages. Several genes involved in virulence and drug resistance were shown to be under positive selection in L1 and L3 strains with important adaptative consequences. Importantly, esxH derived haplotypes were not randomly distributed through the geographic range of L1 and the finding of positive selection acting on vaccine candidates, such as this, is important and deserves attention.   Page 20 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 The manuscript is very well written, easy to read and understand and the methods are robust and well applied and described. Important to note the absence of a few important references to other groups work on the third wave migration of migrations in the 16th century that impacted the dissemination of L1 in Africa and other regions. Some comments that can be useful for discussion: The analysis pertaining the introduction of L1 to South America is based on a more modest number of isolates (n=77), all but one from Brazil. This limits the robusteness of the conclusions for this region. It is possible that the evolutionary history of L1 in South America is far more complex than the one outlined in the article and there may be missing links. For example, EAI (L1) strains have been detected in several South and Central American countries (Mexico, Peru, Panama, Colombia, etc.) and besides slave trade, the Spain-operated Manila-Acapulco Galleon trade route (est. ca. 1568) connected Southeast Asia to Central and South America. Regarding the L1.1.1 sub-lineage, what precludes these strains to have been introduced to West Africa in a more recent period other than the XVI or the XVII century? As far as we can tell, this pertains a group of 10 isolates scattered over five countries, but what is the degree of divergence between these isolates? Eventually, can this also be linked with the indentured labor promoted by the British Empire in the XIX century? On the epitope analysis, upon downloading epitopes from IEDB did the authors include only epitopes with positive assays and was there a threshold for including only epitopes validated by independent assays (more reliable)? Also, how did overlapping epitopes been handled? Is the work clearly and accurately presented and does it cite the current literature? Yes Is the study design appropriate and is the work technically sound? Yes Are sufficient details of methods and analysis provided to allow replication by others? Yes If applicable, is the statistical analysis and its interpretation appropriate? Yes Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes Competing Interests: No competing interests were disclosed. Reviewer Expertise: Mycobacteriology, Tuberculosis, Laboratory Diagnosis and Molecular Epidemiology.   Page 21 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. Author Response 19 Mar 2021 Fabrizio Menardo, Swiss Tropical and Public Health Institute, Basel, Switzerland Thank you very much for reviewing and commenting our manuscript. We prepared a revised version of the manuscript based on the feedback of all reviewers. Regarding the comparison with other lineages: in this manuscript we focused on L1 and L3 because they are understudied. We now revised the text and added a section comparing our findings with the biogeography of other MTBC lineages, and referencing the relevant studies. Additionally, it is true that the evolutionary history of L1 in South America could be more complex, and we might fail to capture it because we have samples from a single country (Brazil). We now explicitly write this. Regarding the L1.1.1 sub-lineage: in the manuscript we write that the 16th century is the earliest possible time for the introduction of L1 in West Africa (assuming that the migration could not have happened before the circumnavigation of Africa by Portuguese), we do not exclude the possibility of a later introduction. However, this group of 11 L1 strains comprising the West African clade has a relatively old MRCA. To be compatible with an introduction in the 19th century, the molecular clock rate would need to be about 3x10-7 (this is a back of the envelope calculation, and not the result of a proper analysis). This would be a fast rate for MTB (Menardo et al. 2019), and we find this hypothesis unlikely. Nevertheless, because of the lack of a temporal signal we cannot exclude a more recent introduction of L1 in West Africa. Finally, in the epitope analysis we downloaded all T cell epitopes, to be as inclusive as possible. Overlapping epitopes were analyzed independently. We added this information to the manuscript. Competing Interests: No competing interests were disclosed. Reviewer Report 17 February 2021 https://doi.org/10.5256/f1000research.31323.r78953 © 2021 Honap T. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Tanvi Honap University of Oklahoma, Norman, OK, USA The aim of this study was to conduct a phylogeographic analysis of Mycobacterium tuberculosis   Page 22 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 complex (MTBC) strains belonging to Lineages 1 and 3, which are the most prevalent lineages in South Asian countries with a high TB burden, such as India. The authors compiled an excellent dataset of nearly 5,000 MTBC genomes from these two lineages, comprising publicly available as well as newly-sequenced MTBC strains. Their phylogenomic analyses showed that L1 comprises a further five distinct sublineages, whereas L3 does not contain distinct sublineages. The authors used two different methods (PASTML and Mascot) for conducting phylogeographic analyses, with results from both methods suggesting an interesting asymmetric pattern of migration: these lineages originated in and migrated from South Asia to other regions, with very little back- migration into South Asia. This underscores the important role South Asia has played in the dispersal of these lineages. The authors found several genes involved in virulence and antibiotic resistance to be under positive selection in L1 and L3 strains. Lastly, the authors also found evidence of local adaptation at the esxH epitope in L1 strains from South and Southeast Asia, which requires further study as this epitope is a part of several vaccine candidates.  Overall, I found this to be a well-designed and thorough study, and a very well-written manuscript. I have no major concerns and only a few minor suggestions for aesthetic improvement: ○ The Methods section titled "Strain cultures, DNA extraction and genome sequencing" is missing citations for the methods used (eg. CTAB extraction method).   ○ Figures 1A and 2A heatmaps would look better in color than grayscale.    ○ The Supplementary Information would benefit from a thorough proofreading for grammatical and typological errors.  Is the work clearly and accurately presented and does it cite the current literature? Yes Is the study design appropriate and is the work technically sound? Yes Are sufficient details of methods and analysis provided to allow replication by others? Yes If applicable, is the statistical analysis and its interpretation appropriate? Yes Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes Competing Interests: No competing interests were disclosed. Reviewer Expertise: Pathogen genomics I confirm that I have read this submission and believe that I have an appropriate level of   Page 23 of 24 F1000Research 2021, 10:60 Last updated: 31 MAR 2021 expertise to confirm that it is of an acceptable scientific standard. Author Response 19 Mar 2021 Fabrizio Menardo, Swiss Tropical and Public Health Institute, Basel, Switzerland Thank you very much for your comments, and for reviewing our manuscript. We prepared a revised version of the manuscript based on the feedback of all reviewers. We added the missing reference for the DNA extraction method, and we revised the supplementary material for errors and typos, thanks. Regarding Figures 1A and 2A: in both figures the other three panels have the same color code, where different colors correspond to different world regions. We decided to use a gray scale to avoid potential confusion of having two different color codes in the same figure. We did not modify the figures. Competing Interests: No competing interests were disclosed. The benefits of publishing with F1000Research: • Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com   Page 24 of 24