Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 RESEARCH ARTICLE    An open dataset of Plasmodium falciparum genome variation in 7,000 worldwide samples [version 2; peer review: 2 approved] MalariaGEN, Ambroise Ahouidi1, Mozam Ali2, Jacob Almagro-Garcia2,3, Alfred Amambua-Ngwa2,4, Chanaki Amaratunga5, Roberto Amato2,3, Lucas Amenga-Etego 6,7, Ben Andagalu8, Tim J. C. Anderson 9, Voahangy Andrianaranjaka10, Tobias Apinjoh11, Cristina Ariani2, Elizabeth A. Ashley 12, Sarah Auburn13,14, Gordon A. Awandare7,15, Hampate Ba 16, Vito Baraka 17,18, Alyssa E. Barry19-21, Philip Bejon 22, Gwladys I. Bertin 23, Maciej F. Boni14,24, Steffen Borrmann25, Teun Bousema 26,27, Oralee Branch28, Peter C. Bull22,29, George B. J. Busby3, Thanat Chookajorn 30, Kesinee Chotivanich30, Antoine Claessens 4,31, David Conway 26, Alister Craig 32,33, Umberto D'Alessandro 4, Souleymane Dama34, Nicholas PJ Day 12, Brigitte Denis33, Mahamadou Diakite 34, Abdoulaye Djimdé 34, Christiane Dolecek14, Arjen M Dondorp 12, Chris Drakeley 26, Eleanor Drury2, Patrick Duffy5, Diego F. Echeverry35,36, Thomas G. Egwang37, Berhanu Erko38, Rick M. Fairhurst39, Abdul Faiz 40, Caterina A. Fanello 12, Mark M. Fukuda41, Dionicia Gamboa 42, Anita Ghansah43, Lemu Golassa 38, Sonia Goncalves2, William L. Hamilton 2,44, G. L. Abby Harrison21, Lee Hart 3, Christa Henrichs3, Tran Tinh Hien 24,45, Catherine A. Hill46, Abraham Hodgson47, Christina Hubbart 48, Mallika Imwong30, Deus S. Ishengoma 17,49, Scott A. Jackson 50, Chris G. Jacob2, Ben Jeffery3, Anna E. Jeffreys 48, Kimberly J. Johnson 3, Dushyanth Jyothi 2, Claire Kamaliddin 23, Edwin Kamau 51, Mihir Kekre2, Krzysztof Kluczynski3, Theerarat Kochakarn2,30, Abibatou Konaté52, Dominic P. Kwiatkowski 2,3,48, Myat Phone Kyaw53,54, Pharath Lim5,55, Chanthap Lon41, Kovana M. Loua 56, Oumou Maïga-Ascofaré34,57,58, Cinzia Malangone 2, Magnus Manske2, Jutta Marfurt13, Kevin Marsh 14,59, Mayfong Mayxay 60,61, Alistair Miles2,3, Olivo Miotto 2,3,12, Victor Mobegi 62, Olugbenga A. Mokuolu63, Jacqui Montgomery64, Ivo Mueller21,65, Paul N. Newton66, Thuy Nguyen2, Thuy-Nhien Nguyen24, Harald Noedl67, François Nosten 14,68, Rintis Noviyanti69,   Page 1 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Alexis Nzila70, Lynette I. Ochola-Oyier22, Harold Ocholla71,72, Abraham Oduro 6, Irene Omedo 22, Marie A. Onyamboko73, Jean-Bosco Ouedraogo 74, Kolapo Oyebola 75,76, Richard D. Pearson 2,3, Norbert Peshu22, Aung Pyae Phyo 12,68, Chris V. Plowe77, Ric N. Price 12,13,45, Sasithon Pukrittayakamee30, Milijaona Randrianarivelojosia78,79, Julian C. Rayner 2, Pascal Ringwald80, Kirk A. Rockett 2,48, Katherine Rowlands48, Lastenia Ruiz81, David Saunders41, Alex Shayo 82, Peter Siba83, Victoria J. Simpson3, Jim Stalker2, Xin-zhuan Su 5, Colin Sutherland26, Shannon Takala-Harrison84, Livingstone Tavul83, Vandana Thathy22,85, Antoinette Tshefu86, Federica Verra87, Joseph Vinetz42,88, Thomas E. Wellems 5, Jason Wendler48, Nicholas J. White12, Ian Wright 3, William Yavo52,89, Htut Ye90 1Hopital Le Dantec, Universite Cheikh Anta Diop, Dakar, Senegal 2Wellcome Sanger Institute, Hinxton, UK 3MRC Centre for Genomics and Global Health, Big Data Institute, University of Oxford, Oxford, UK 4Medical Research Council Unit The Gambia, at the London School of Hygiene and Tropical Medicine, Banjul, The Gambia 5National Institute of Allergy and Infectious Diseases (NIAID), NIH, Bethesda, USA 6Navrongo Health Research Centre, Ghana Health Service, Navrongo, Ghana 7West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), University of Ghana, Accra, Ghana 8United States Army Medical Research Directorate-Africa, Kenya Medical Research Institute/Walter Reed Project, Kisumu, Kenya 9Texas Biomedical Research Institute, San Antonio, USA 10Université d'Antananarivo, Antananarivo, Madagascar 11University of Buea, Buea, Cameroon 12Mahidol-Oxford Tropical Medicine Research Unit (MORU), Bangkok, Thailand 13Menzies School of Health Research, Darwin, Australia 14Nuffield Department of Medicine, University of Oxford, Oxford, UK 15University of Ghana, Legon, Ghana 16Institut National de Recherche en Santé Publique, Nouakchott, Mauritania 17National Institute for Medical Research (NIMR), Dar es Salaam, Tanzania 18Department of Epidemiology, International Health Unit, University of Antwerp, Antwerp, Belgium 19Deakin University, Geelong, Australia 20Burnet Institute, Melbourne, Australia 21Walter and Eliza Hall Institute, Melbourne, Australia 22KEMRI Wellcome Trust Research Programme, Kilifi, Kenya 23Institute of Research for Development (IRD), Paris, France 24Oxford University Clinical Research Unit (OUCRU), Ho Chi Minh City, Vietnam 25Institute for Tropical Medicine, University of Tübingen, Tübingen, Germany 26London School of Hygiene and Tropical Medicine, London, UK 27Radboud University Medical Center, Nijmegen, The Netherlands 28NYU School of Medicine Langone Medical Center, New York, USA 29Department of Pathology, University of Cambridge, Cambridge, UK 30Mahidol University, Bangkok, Thailand 31LPHI, MIVEGEC, INSERM, CNRS, IRD, University of Montpellier, Montpellier, France 32Liverpool School of Tropical Medicine, Liverpool, UK 33Malawi-Liverpool-Wellcome Trust Clinical Research, Blantyre, Malawi 34Malaria Research and Training Centre, University of Science, Techniques and Technologies of Bamako, Bamako, Mali 35Centro Internacional de Entrenamiento e Investigaciones Médicas - CIDEIM, Cali, Colombia 36Universidad Icesi, Cali, Colombia   Page 2 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 37Biotech Laboratories, Kampala, Uganda 38Aklilu Lemma Institute of Pathobiology, Addis Ababa University, Addis Ababa, Ethiopia 39National Institutes of Health (NIH), Bethesda, USA 40Dev Care Foundation, Dhaka, Bangladesh 41Department of Immunology and Medicine, US Army Medical Component, Armed Forces Research Institute of Medical Sciences (USAMC-AFRIMS), Bangkok, Thailand 42Laboratorio ICEMR-Amazonia, Laboratorios de Investigacion y Desarrollo, Facultad de Ciencias y Filosofia, Universidad Peruana Cayetano Heredia, Lima, Peru 43Nogouchi Memorial Institute for Medical Research, Legon-Accra, Ghana 44Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK 45Centre for Tropical Medicine and Global Health, University of Oxford, Oxford, UK 46Department of Entomology, Purdue University, West Lafayette, USA 47Ghana Health Service, Ministry of Health, Accra, Ghana 48Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK 49East African Consortium for Clinical Research (EACCR), Dar es Salaam, Tanzania 50Center for Applied Genetic Technologies, University of Georgia, Athens, GA, USA 51Walter Reed Army Institute of Research, U.S. Military HIV Research Program, Silver Spring, MD, USA 52University Félix Houphouët-Boigny, Abidjan, Cote d'Ivoire 53The Myanmar Oxford Clinical Research Unit, University of Oxford, Yangon, Myanmar 54University of Public Health, Yangon, Myanmar 55Medical Care Development International, Maryland, USA 56Institut National de Santé Publique, Conakry, Guinea 57Bernhard Nocht Institute for Tropical Medicine, Hamburg, Germany 58Research in Tropical Medicine, Kwame Nkrumah University of Sciences and Technology, Kumasi, Ghana 59African Academy of Sciences, Nairobi, Kenya 60Lao-Oxford-Mahosot Hospital-Wellcome Trust Research Unit (LOMWRU), Vientiane, Lao People's Democratic Republic 61Institute of Research and Education Development (IRED), University of Health Sciences, Ministry of Health, Vientiane, Lao People's Democratic Republic 62School of Medicine, University of Nairobi, Nairobi, Kenya 63Department of Paediatrics and Child Health, University of Ilorin, Ilorin, Nigeria 64Institute of Vector-Borne Disease, Monash University, Clayton, Victoria, 3800, Australia 65Barcelona Centre for International Health Research, Barcelona, Spain 66Wellcome Trust-Mahosot Hospital-Oxford Tropical Medicine Research Collaboration, Vientiane, Lao People's Democratic Republic 67MARIB - Malaria Research Initiative Bandarban, Bandarban, Bangladesh 68Shoklo Malaria Research Unit, Bangkok, Thailand 69Eijkman Institute for Molecular Biology, Jakarta, Indonesia 70King Fahid University of Petroleum and Minerals (KFUMP), Dharhran, Saudi Arabia 71KEMRI - Centres for Disease Control and Prevention (CDC) Research Program, Kisumu, Kenya 72Centre for Bioinformatics and Biotechnology, University of Nairobi, Nairobi, Kenya 73Kinshasa School of Public Health, University of Kinshasa, Kinshasa, Congo, Democratic Republic 74Institut de Recherche en Sciences de la Santé, Ouagadougou, Burkina Faso 75Nigerian Institute of Medical Research, Lagos, Nigeria 76Parasitology and Bioinformatics Unit, Faculty of Science, University of Lagos, Lagos, Nigeria 77School of Medicine, University of Maryland, Baltimore, MD, USA 78Institut Pasteur de Madagascar, Antananarivo, Madagascar 79Universités d'Antananarivo et de Mahajanga, Antananarivo, Madagascar 80World Health Organization (WHO), Geneva, Switzerland 81Universidad Nacional de la Amazonia Peruana, Iquitos, Peru 82Nelson Mandela Institute of Science and Technology, Arusha, Tanzania 83Papua New Guinea Institute of Medical Research, Goroka, Papua New Guinea 84Center for Vaccine Development and Global Health, University of Maryland, School of Medicine, Baltimore, MD, USA 85Department of Microbiology and Immunology, Columbia University Irving Medical Center, New York, New York, USA 86University of Kinshasa, Kinshasa, Congo, Democratic Republic 87Sapienza University of Rome, Rome, Italy 88Yale School of Medicine, New Haven, CT, USA 89Malaria Research and Control Center of the National Institute of Public Health, Abidjan, Cote d'Ivoire 90   Page 3 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Department of Medical Research, Yangon, Myanmar v2 First published: 24 Feb 2021, 6:42 Open Peer Review https://doi.org/10.12688/wellcomeopenres.16168.1 Latest published: 13 Jul 2021, 6:42 https://doi.org/10.12688/wellcomeopenres.16168.2 Reviewer Status Invited Reviewers Abstract MalariaGEN is a data-sharing network that enables groups around the 1 2 world to work together on the genomic epidemiology of malaria. Here we describe a new release of curated genome variation data on 7,000 version 2 Plasmodium falciparum samples from MalariaGEN partner studies in (revision) 28 malaria-endemic countries. High-quality genotype calls on 3 million 13 Jul 2021 single nucleotide polymorphisms (SNPs) and short indels were produced using a standardised analysis pipeline. Copy number version 1 variants associated with drug resistance and structural variants that 24 Feb 2021 report report cause failure of rapid diagnostic tests were also analysed.  Almost all samples showed genetic evidence of resistance to at least one antimalarial drug, and some samples from Southeast Asia carried 1. Maria Isabel Veiga , University of Minho, markers of resistance to six commonly-used drugs. Genes expressed Braga, Portugal during the mosquito stage of the parasite life-cycle are prominent among loci that show strong geographic differentiation. By continuing Nuno S. Osório , University of Minho, to enlarge this open data resource we aim to facilitate research into Braga, Portugal the evolutionary processes affecting malaria control and to accelerate development of the surveillance toolkit required for malaria 2. Didier Menard , Institut Pasteur, Paris, elimination. France Keywords Any reports and responses or comments on the malaria, plasmodium falciparum, genomics, genomic epidemiology, article can be found at the end of the article. evolution, data resource, population genetics, drug resistance, rapid diagnostic test failure Corresponding author: MalariaGEN (support@malariagen.net) Author roles: Ahouidi A: Investigation, Resources, Writing – Review & Editing; Ali M: Investigation, Writing – Review & Editing; Almagro- Garcia J: Formal Analysis, Investigation, Writing – Review & Editing; Amambua-Ngwa A: Investigation, Resources, Writing – Review & Editing; Amaratunga C: Investigation, Resources, Writing – Review & Editing; Amato R: Conceptualization, Data Curation, Formal Analysis, Investigation, Project Administration, Software, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing; Amenga-Etego L: Investigation, Resources, Writing – Review & Editing; Andagalu B: Investigation, Resources, Writing – Review & Editing; Anderson TJC: Investigation, Resources, Writing – Review & Editing; Andrianaranjaka V: Investigation, Resources, Writing – Review & Editing; Apinjoh T: Investigation, Resources, Writing – Review & Editing; Ariani C: Investigation, Writing – Review & Editing; Ashley EA: Investigation, Resources, Writing – Review & Editing; Auburn S: Investigation, Resources, Writing – Review & Editing; Awandare GA: Investigation, Resources, Writing – Review & Editing; Ba H: Investigation, Resources, Writing – Review & Editing; Baraka V: Investigation, Resources, Writing – Review & Editing; Barry AE: Investigation, Resources, Writing – Review & Editing; Bejon P: Investigation, Resources, Writing – Review & Editing; Bertin GI: Investigation, Resources, Writing – Review & Editing; Boni MF: Investigation, Resources, Writing – Review & Editing; Borrmann S: Investigation, Resources, Writing – Review & Editing; Bousema T: Investigation, Resources, Writing – Review & Editing; Branch O: Investigation, Resources, Writing – Review & Editing; Bull PC: Investigation, Resources, Writing – Review & Editing; Busby GBJ: Investigation, Software, Writing – Review & Editing; Chookajorn T: Formal Analysis, Investigation, Writing – Review & Editing; Chotivanich K: Investigation, Resources, Writing – Review & Editing; Claessens A: Investigation, Resources, Writing – Review & Editing; Conway D: Investigation, Resources, Writing – Review & Editing; Craig A: Investigation, Resources, Writing – Review & Editing; D'Alessandro U: Investigation, Resources, Writing – Review & Editing; Dama S: Investigation, Resources, Writing – Review & Editing; Day NP: Investigation, Resources, Writing – Review & Editing; Denis B: Investigation, Resources, Writing – Review & Editing; Diakite M: Investigation, Resources, Writing – Review & Editing; Djimdé A: Investigation, Resources, Writing – Review & Editing; Dolecek C: Investigation, Resources, Writing – Review & Editing; Dondorp AM: Investigation, Resources, Writing – Review & Editing; Drakeley C: Investigation, Resources, Writing – Review & Editing; Drury E:   Page 4 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Investigation, Writing – Review & Editing; Duffy P: Investigation, Resources, Writing – Review & Editing; Echeverry DF: Investigation, Resources, Writing – Review & Editing; Egwang TG: Investigation, Resources, Writing – Review & Editing; Erko B: Investigation, Resources, Writing – Review & Editing; Fairhurst RM: Investigation, Resources, Writing – Review & Editing; Faiz A: Investigation, Resources, Writing – Review & Editing; Fanello CA: Investigation, Resources, Writing – Review & Editing; Fukuda MM: Investigation, Resources, Writing – Review & Editing; Gamboa D: Investigation, Resources, Writing – Review & Editing; Ghansah A: Investigation, Resources, Writing – Review & Editing; Golassa L: Investigation, Resources, Writing – Review & Editing; Goncalves S: Investigation, Project Administration, Writing – Review & Editing; Hamilton WL: Formal Analysis, Investigation, Writing – Original Draft Preparation, Writing – Review & Editing; Harrison GLA: Investigation, Resources, Writing – Review & Editing; Hart L: Investigation, Software, Writing – Review & Editing; Henrichs C: Investigation, Project Administration, Writing – Review & Editing; Hien TT: Investigation, Resources, Writing – Review & Editing; Hill CA: Investigation, Resources, Writing – Review & Editing; Hodgson A: Investigation, Resources, Writing – Review & Editing; Hubbart C: Investigation, Writing – Review & Editing; Imwong M: Investigation, Resources, Writing – Review & Editing; Ishengoma DS: Investigation, Resources, Writing – Review & Editing; Jackson SA: Investigation, Resources, Writing – Review & Editing; Jacob CG: Investigation, Writing – Review & Editing; Jeffery B: Investigation, Software, Writing – Review & Editing; Jeffreys AE: Investigation, Writing – Review & Editing; Johnson KJ: Investigation, Project Administration, Writing – Review & Editing; Jyothi D: Data Curation, Investigation, Software, Writing – Review & Editing; Kamaliddin C: Investigation, Resources, Writing – Review & Editing; Kamau E: Investigation, Resources, Writing – Review & Editing; Kekre M: Investigation, Writing – Review & Editing; Kluczynski K: Investigation, Software, Writing – Review & Editing; Kochakarn T: Formal Analysis, Investigation, Writing – Review & Editing; Konaté A: Investigation, Resources, Writing – Review & Editing; Kwiatkowski DP: Conceptualization, Formal Analysis, Funding Acquisition, Investigation, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing; Kyaw MP: Investigation, Resources, Writing – Review & Editing; Lim P: Investigation, Resources, Writing – Review & Editing; Lon C: Investigation, Resources, Writing – Review & Editing; Loua KM: Investigation, Resources, Writing – Review & Editing; Maïga-Ascofaré O: Investigation, Resources, Writing – Review & Editing; Malangone C: Data Curation, Investigation, Software, Writing – Review & Editing; Manske M: Investigation, Software, Writing – Review & Editing; Marfurt J: Investigation, Resources, Writing – Review & Editing; Marsh K: Investigation, Resources, Writing – Review & Editing; Mayxay M: Investigation, Resources, Writing – Review & Editing; Miles A: Investigation, Software, Writing – Review & Editing; Miotto O: Data Curation, Formal Analysis, Investigation, Project Administration, Software, Writing – Review & Editing; Mobegi V: Investigation, Resources, Writing – Review & Editing; Mokuolu OA: Investigation, Resources, Writing – Review & Editing; Montgomery J: Investigation, Resources, Writing – Review & Editing; Mueller I: Investigation, Resources, Writing – Review & Editing; Newton PN: Investigation, Resources, Writing – Review & Editing; Nguyen T: Data Curation, Investigation, Software, Writing – Review & Editing; Nguyen TN: Investigation, Resources, Writing – Review & Editing; Noedl H: Investigation, Resources, Writing – Review & Editing; Nosten F: Investigation, Resources, Writing – Review & Editing; Noviyanti R: Investigation, Resources, Writing – Review & Editing; Nzila A: Investigation, Resources, Writing – Review & Editing; Ochola-Oyier LI: Investigation, Resources, Writing – Review & Editing; Ocholla H: Investigation, Resources, Writing – Review & Editing; Oduro A: Investigation, Resources, Writing – Review & Editing; Omedo I: Investigation, Resources, Writing – Review & Editing; Onyamboko MA: Investigation, Resources, Writing – Review & Editing; Ouedraogo JB: Investigation, Resources, Writing – Review & Editing; Oyebola K: Investigation, Resources, Writing – Review & Editing; Pearson RD: Conceptualization, Data Curation, Formal Analysis, Investigation, Project Administration, Software, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing; Peshu N: Investigation, Resources, Writing – Review & Editing; Phyo AP: Investigation, Resources, Writing – Review & Editing; Plowe CV: Investigation, Resources, Writing – Review & Editing; Price RN: Investigation, Resources, Writing – Review & Editing; Pukrittayakamee S: Investigation, Resources, Writing – Review & Editing; Randrianarivelojosia M: Investigation, Resources, Writing – Review & Editing; Rayner JC: Investigation, Resources, Writing – Review & Editing; Ringwald P: Investigation, Resources, Writing – Review & Editing; Rockett KA: Investigation, Project Administration, Writing – Review & Editing; Rowlands K: Investigation, Writing – Review & Editing; Ruiz L: Investigation, Resources, Writing – Review & Editing; Saunders D: Investigation, Resources, Writing – Review & Editing; Shayo A: Investigation, Resources, Writing – Review & Editing; Siba P: Investigation, Resources, Writing – Review & Editing; Simpson VJ: Investigation, Project Administration, Writing – Review & Editing; Stalker J: Data Curation, Investigation, Software, Writing – Review & Editing; Su Xz: Investigation, Resources, Writing – Review & Editing; Sutherland C: Investigation, Resources, Writing – Review & Editing; Takala-Harrison S: Investigation, Resources, Writing – Review & Editing; Tavul L: Investigation, Resources, Writing – Review & Editing; Thathy V: Investigation, Resources, Writing – Review & Editing; Tshefu A: Investigation, Resources, Writing – Review & Editing; Verra F: Investigation, Resources, Writing – Review & Editing; Vinetz J: Investigation, Resources, Writing – Review & Editing; Wellems TE: Investigation, Resources, Writing – Review & Editing; Wendler J: Investigation, Resources, Writing – Review & Editing; White NJ: Investigation, Resources, Writing – Review & Editing; Wright I: Investigation, Software, Writing – Review & Editing; Yavo W: Investigation, Resources, Writing – Review & Editing; Ye H: Investigation, Resources, Writing – Review & Editing Competing interests: No competing interests were disclosed. Grant information: The sequencing, analysis, informatics and management of the Community Project are supported by Wellcome through Sanger Institute core funding (098051), a Strategic Award (090770/Z/09/Z) and the Wellcome Centre for Human Genetics core funding (203141/Z/16/Z), by the MRC Centre for Genomics and Global Health which is jointly funded by the Medical Research Council and the Department for International Development (DFID) (G0600718; M006212), and by the Bill & Melinda Gates Foundation (OPP1204628). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Copyright: © 2021 MalariaGEN et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The author(s) is/are employees of the US Government and therefore domestic copyright protection in USA does not apply to this work. The work may be protected under the copyright laws of other jurisdictions when used in those jurisdictions.   Page 5 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 How to cite this article: MalariaGEN, Ahouidi A, Ali M et al. An open dataset of Plasmodium falciparum genome variation in 7,000 worldwide samples [version 2; peer review: 2 approved] Wellcome Open Research 2021, 6:42 https://doi.org/10.12688/wellcomeopenres.16168.2 First published: 24 Feb 2021, 6:42 https://doi.org/10.12688/wellcomeopenres.16168.1   Page 6 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 The first phase of the project focused on developing simple  RE VI SE D  Amendments from Version 1 methods to obtain purified parasite genome DNA from small 5,6 We are grateful to the reviewers for their suggestions and blood samples collected in the field and on establishing reliable have updated the manuscript in response. We now include computational methods for variant discovery and genotype gene IDs every time a gene is mentioned for the first time in calling from short-read sequencing data7. This presented a the manuscript. We have replaced “complex rearrangements” number of analytical challenges due to long tracts of highly in the results section with an explicit description of the event. repetitive sequence and hypervariable regions within the We have added a paragraph to detail that sample collection is heterogeneous and due care is needed when interpreting the P. falciparum genome, and also because a single infection can results. No changes have been made to the data or figures. contain a complex mixture of genotypes. Once a reliable analy- sis pipeline was in place, a process was established for periodic Any further responses from the reviewers can be found at  data releases to partners, with continual improvements in data the end of the article quality as new analytical methods were developed. Data from the Pf Community Project were initially released through a companion project called Pf3k, whose goal was to Introduction bring together leading analysts from multiple institutions to A major obstacle to malaria elimination is the great capacity benchmark and standardise methods of variant discovery and of the parasite and vector populations to evolve in response to genotyping calling. A visual analytics web application was malaria control interventions. The widespread use of chloro- developed8 for researchers to explore the data. The open quine and DDT in the 1950’s led to high levels of drug and dataset was enlarged in 2016 when multiple partner stud- insecticide resistance, and the same pattern has been repeated ies contributed to a consortial publication on 3,488 samples for other first-line antimalarial drugs and insecticides. Over the from 23 countries9. past 15 years, mass distribution of pyrethroid-treated bednets in Africa and worldwide use of artemisinin combination ther- Data produced by the Pf Community Project have been used apy (ACT) has led to substantial reductions in malaria preva- to address a broad range of research questions, both by the lence and mortality, but there are rapidly increasing levels of groups that generated samples and data and by the wider resistance to ACT in Southeast Asian parasites and of pyre- research community, and have generated over 50 previous throid resistance in African mosquitoes. A deep understanding of publications (refs 5–55). These data have become a key resource local patterns of resistance and the continually changing nature for the epidemiology and population genetics of antimalarial of the local parasite and vector populations is necessary to drug resistance9–22 and an important platform for the discovery manage the use of drugs and insecticides and to deploy public of new genetic markers and mechanisms of resistance through health resources for maximum sustainability and impact. genome-wide association studies23–27 and combined genome- transcriptome analysis28. The data have also been used to study Current methods for genetic surveillance of the parasite gene deletions that cause failure of rapid diagnostic tests29; to population are largely based on targeted genotyping of specific characterise genetic variation in malaria vaccine antigens30,31; loci, e.g. known markers of drug resistance. Whole genome to screen for new vaccine candidates32; to investigate sequencing of malaria parasites is currently more expensive specific host-parasite interactions33,34; and to describe the and complex, particularly at the stage of data analysis, but it evolutionary adaptation and diversification of local parasite is an important adjunct to targeted genotyping, as it provides populations7,9,12,35–40. a more comprehensive picture of parasite genetic variation. It is particularly important for discovery of new drug resist- The Pf Community Project data also provide an important ance markers and for monitoring patterns of gene flow and resource for developing and testing new analytical and compu- evolutionary adaptation in the parasite population. tational methods. A key area of methods development is quanti- fication of within-host diversity7,41–46, estimation of inbreeding7,47, The Plasmodium falciparum Community Project (Pf and deconvolution of mixed infections into individual strains48,49. Community Project) was established with the aim of integrat- The data have also been used to develop and test methods for ing parasite genome sequencing into clinical and epidemio- estimating identity by descent50,51, imputation52, typing struc- logical studies of malaria (www.malariagen.net/projects). It tural variants53, designing other SNP genotyping platforms54 and forms part of the Malaria Genomic Epidemiology Network data visualisation8,55. In a companion study we performed (MalariaGEN), a global data-sharing network comprising whole genome sequencing of experimental genetic crosses of multiple partner studies, each with its own research objec- P. falciparum, and this provided a benchmark to test the accu- tives and led by a local investigator1. Genome sequencing was racy of our genotyping methods, and to conduct an in-depth performed centrally, and partner studies were free to analyse analysis of indels, structural variants and recombination events and publish the genetic data produced on their own samples, in which are complicated to ascertain in these population genetic line with MalariaGEN’s guiding principles on equitable data samples56. sharing1–3. A programme of capacity building for research into parasite genetics was developed at multiple sites in Africa Here we describe a new release of curated genome variation alongside the Pf Community Project4. data on 7,113 samples of P. falciparum collected by 49 partner Page 7 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 studies from 73 locations in Africa, Asia, South America and Supplementary Tables 4–6). We also used sequence reads cover- Oceania between 2002 and 2015 (Table 1, Supplementary Data; age to identify large structural variants that appear to delete or Supplementary Table 1 and 2). disrupt hrp2 (PF3D7_0831800) and hrp3 (PF3D7_1372200), an event that can cause rapid diagnostic tests to malfunction. Results Variant discovery and genotyping The population genetic analyses in this paper are based on We used the Illumina platform to produce genome sequencing the filtered dataset of high-quality SNP genotypes in 5,970 data on all samples and we mapped the sequence reads against samples. These data are openly available, together with anno- the P. falciparum 3D7 v3 reference genome. The median depth tated genotyping data on 6 million putative variants in all of coverage was 73 sequence reads averaged across the whole 7,113 samples, plus details of partner studies and sampling genome and across all samples. We constructed an analysis locations, at www.malariagen.net/resource/26. pipeline for variant discovery and genotyping, including strin- gent quality control filters that took into account the unusual Global population structure features of the P. falciparum genome, incorporating lessons The genetic structure of the global parasite population reflects its learnt from our previous work7,56 and the Pf3k project, as geographic regional structure7,9,10 as illustrated by a neighbour- outlined in the Methods section. joining tree and a principal component analysis of all samples based on their SNP genotypes (Figure 1). Based on these In the first stage of analysis we discovered variation at over observations we grouped the samples into eight geographic six million positions, corresponding to about a quarter of the regions: West Africa, Central Africa, East Africa, South 23 Mb P. falciparum genome (Supplementary Data; Supple- Asia, the western part of Southeast Asia, the eastern part of mentary Table 3). These included 3,168,721 single nucleotide Southeast Asia, Oceania and South America. Each of these can polymorphisms (SNPs): these were slightly more common in be viewed as a regional sub-population of parasites, which is coding than non-coding regions and were mostly biallelic. The more or less differentiated from other regional sub-populations remaining 2,882,975 variants were predominantly short indels depending on rates of gene flow and other factors. The differ- but also included more complex combinations of SNPs and ent regions encompass a range of epidemiological and environ- indels: these were much more abundant in non-coding than mental settings, varying in transmission intensity, vector species coding regions, and mostly had at least three alleles. The pre- and history of antimalarial drug usage. Note these regional dominance of indels in non-coding regions has been previously classifications are intentionally broad, and therefore overlook observed and is most likely a consequence of the extreme AT many interesting aspects of local population structure, e.g. a bias which leads to many short repetitive sequences56,57. distinctive Ethiopian sub-population can be identified by more detailed analysis of African samples12. For the purpose of this analysis, we excluded all variants in subtelomeric and internal hypervariable regions, mitochondrial Genetically mixed infections were considerably more common and apicoplast genomes, and some other regions of the genome in Africa than other regions, consistent with the high inten- where the mapping of short sequence reads is prone to a high sity of malaria transmission in Africa (Figure 2a). Analysis error rate due to extremely high rates of variation56. A total of F , a measure of within-host diversity7, shows that most WS of 1,838,733 SNPs (of which 1,626,886 were biallelic) and samples from Southeast Asia (1763/2341), South America 1,276,027 indels (or SNP/indel combinations) passed all (37/37) and Oceania (158/201) have F >0.95, which to a WS these filters. The pass rate for SNPs in coding regions (66%) first approximation indicates that the infection is dominated was considerably higher than that for SNPs in non-coding by a clonal population of parasite41. In contrast, nearly half of regions (47%), indels in coding regions (37%) and indels in samples from Africa (1625/3314) have F <0.95, indicating the WS non-coding regions (47%). Finally, we removed samples with a presence of more complex infections. Genetically mixed infec- low genotyping success rate or other quality control issues. We tions were also common in Bangladesh (41/77 samples have also removed replicates and 41 samples with genetic markers F <0.95), another area of high malaria transmission and the WS of infection by multiple Plasmodium species, leaving 5,970 only South Asian country represented in this dataset, but did high-quality samples from 28 countries (Table 1). not reach the extremely high levels of within-host diversity (F <0.2) observed in some samples from Africa. WS We used coverage and read pair analysis to determine duplication genotypes around mdr1 (PF3D7_0523000), The average nucleotide diversity across the global sample plasmepsin2/3 (PF3D7_1408000 and PF3D7_1408100) and collection was 0.040% (median=0.028%), i.e. two randomly- gch1 (PF3D7_1224000), each of which are associated with drug selected samples differ by an average of 4 nucleotide positions resistance. For each of these three genes we discovered many per 10kb. Levels of nucleotide diversity vary greatly across the different sets of breakpoints (29, 10 and 3 pairs of breakpoints genome56 and also geographically (Figure 2b). Distributions for mdr1, gch1, and plasmepsin 2/3, respectively), including a of values were highest in Africa, followed by Bangladesh, but large and complex structural rearrangement involving a trip- the scale of regional differences was relatively modest, ranging licated segment embedded within a duplication, in which the from an average of 0.030% in Eastern Southeast Asia to 0.040% triplicated segment is inverted (“dup-trpinv-dup”)58 that to in West Africa (median=0.019% and 0.028% respectively; the best of our knowledge has not been observed before in Figure 2b). In other words, the nucleotide diversity of each Plasmodium species (Supplementary Data; Supplementary Note, regional parasite population was not much less than that of Page 8 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Table 1. Count of samples in the dataset. Countries are grouped into eight geographic regions based on their geographic and genetic characteristics. For each country, the table reports: the number of distinct sampling locations; the total number of samples sequenced; the number of high-quality samples included in the analysis; and the percentage of samples collected between 2012–2015, the most recent sampling period in the dataset. Eight samples were obtained from travellers returning from an endemic country, but where the precise site of the infection could not be determined. These were reported from Ghana (3 sequenced samples/2 analysis set samples), Kenya (2/1), Uganda (2/1) and Mozambique (1/1). “Lab samples” contains all sequences obtained from long-term in vitro cultured and adapted isolates, e.g. laboratory strains. The breakdown by site is reported in Supplementary table 1 and the list of contributing studies in Supplementary table 2. Region Country Sampling  Sequenced  Analysis set  % analysis samples locations samples samples 2012–2015 South America Colombia 4 16 16 0% (SAM) Peru 2 23 21 0% Benin 1 102 36 100% Burkina Faso 1 57 56 0% Cameroon 1 239 235 100% Gambia 4 277 219 67% Ghana 3 1,003 849 56% West Africa (WAF) Guinea 2 197 149 0% Ivory Coast 3 70 70 100% Mali 5 449 426 80% Mauritania 4 86 76 100% Nigeria 2 42 29 97% Senegal 1 86 84 100% Central Africa (CAF) Congo DR 1 366 344 100% Ethiopia 2 34 21 100% Kenya 3 129 109 55% Madagascar 3 25 24 100% East Africa (EAF) Malawi 2 351 254 0% Tanzania 5 350 316 85% Uganda 1 14 12 0% South Asia (SAS) Bangladesh 2 93 77 64% Western Southeast Myanmar 5 250 211 71% Asia (WSEA) Western Thailand 2 962 868 24% Cambodia 5 1,214 896 32% Northeastern  Eastern Southeast Thailand 1 28 20 75% Asia (ESEA) Laos 2 131 120 21% Viet Nam 2 264 226 11% Indonesia 1 92 80 73% Oceania (OCE) Papua New Guinea 3 139 121 63% Returning travellers Various locations 0 8 5 0% Lab samples Various locations 0 16 0 0% Total 73 7,113 5,970 52% Page 9 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Figure 1. Population structure.  (A) Genome-wide unrooted neighbour-joining tree showing population structure across all sites, with sample branches coloured according to country groupings (Table 1): South America (green, n=37); West Africa (red, n=2231); Central Africa (orange, n=344); East Africa (yellow, n=739); South Asia (purple, n=77); West Southeast Asia (light blue; n=1079); East Southeast Asia (dark blue; n=1262); Oceania (magenta; n=201). The circular inset shows a magnified view of the part of the tree where the majority of samples from Africa coalesce, showing that the three African sub-regions are genetically close but distinct. (B, C) First three component of a genome- wide principal coordinate analysis. The first axis (PC1) captures the separation of African and South American from Asian samples. The following two axes (PC2 and PC3) capture finer levels of population structure due to geographical separation and selective forces. Each point represents a sample and the colour legend is the same as above. Page 10 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Figure 2. Characteristics of  the eight regional parasite populations.  (A) Distribution of within-host diversity, as measured by FWS, showing that genetically mixed infections were considerably more common in Africa than other regions, consistent with the high intensity of malaria transmission in Africa. (B) Distribution of per site nucleotide diversity calculated in non-overlapping 25kbp genomic windows. We only considered coding biallelic SNPs to reduce the ascertainment bias caused by poor accessibility of non-coding regions. In both previous panels, thick lines represent median values, boxes show the interquartile range, and whiskers represent the bulk of the distribution, discounting outliers. (C) Genome-wide median LD (y-axis, measured by r2) between pairs of SNPs as function of their physical distance (x-axis, in bp), showing a rapid decay in all regional parasite populations. The inset panel shows a magnified view of the decay, showing that in all populations r2 decayed below 0.1 (dashed horizontal line) within 500 bp. All panels utilise the same palette, with colours denoting each geographic region. the global parasite population. This is consistent with the idea These data reveal some interesting exceptions to the gen- that the global P. falciparum population has a common African eral rule that genome-wide F is correlated with geographic ST origin and that historically there must have been significant distance. For example, African parasites are more strongly dif- levels of migration. ferentiated from Southeast Asian parasites (genome-wide average F 0.20) than they are from parasites in neighbour- ST All regional sub-populations showed very low levels of ing Bangladesh (0.11). If this is examined in more detail, there linkage disequilibrium relative to human populations, e.g. r2 is an unexpectedly steep gradient of genetic differentiation at decayed to <0.1 within 500 bp (Figure 2c). As expected, the geographical boundary between South Asia and Southeast African populations had the highest rates of LD decay, implying Asia, i.e. parasites sampled in Myanmar and Western Thailand the highest levels of haplotype diversity. are much more strongly differentiated from parasites sampled in Bangladesh (genome-wide F 0.07) than would be expected ST Geographic patterns of population differentiation and given that these are neighbouring countries. As discussed later, gene flow Southeast Asia is the global epicentre of antimalarial drug Parasite sub-populations in different locations naturally tend resistance, and these observations add to a growing body of to differentiate over time unless there is sufficient gene flow to evidence that Southeast Asian parasites have acquired a counterbalance genetic drift. Genome-wide estimates of F wide range of genomic features that are likely due to natural ST provide an indicator of this process of genetic differentiation, selection rather than genetic drift23,40. which is partly determined by geographic distance (Figure 3). For example, we observe much greater genetic differentiation It is noteworthy that the level of genetic differentiation between South America and South Asia (genome-wide between western and eastern parts of Southeast Asia (genome- average F 0.22) or between Africa and Oceania (0.20) than wide F 0.05) is greater than between West Africa and East ST ST between sub-regions within Asia (<0.1) or within Africa Africa (0.02) although the geographic distances are much greater (<0.02). in Africa. This is likely due to the lower intensity of malaria Page 11 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Figure 3. Geographic patterns of population differentiation and gene flow. Each point represents one pairwise comparison between two regional parasite populations. The x-axis reports the geographic separation between the two populations, measured as great-circle distance between the centre of mass of each population and without taking into account natural barriers. The y-axis reports the genetic differentiation between the two populations, measured as average genome-wide FST. Points are coloured based on the regional populations they represent: between African populations (red); between Asian populations (blue); between Southeast Asia (as a whole) and Oceania, Africa or South America (purple); all the rest (orange). transmission in Southeast Asia, and in particular the presence of within the gene (see Methods). All genes are ranked accord- a malaria-free corridor running through Thailand, which act ing to their global differentiation score in the accompanying as barriers to gene flow across the region23,40. data release, and those with the highest score are listed in Supplementary Table 7 (Supplementary Data). The most Genes with high levels of geographic differentiation highly differentiated gene, p47 (PF3D7_1346800), is known to The F metric can also be calculated for individual vari- interact with the mosquito immune system59 and has two vari- ST ants to identify specific genes that have acquired high levels of ants (S242L and V247A) that are at fixation in South America geographic differentiation relative to the genome as a whole. but absent in other geographic regions. Also among the five This can be done either at the global level (to identify variants most highly differentiated genes are gig (PF3D7_0935600, that are highly differentiated between different regions of the implicated in gametocytogenesis60), pfs16, (PF3D7_0406200, world) or at the local level (to identify variants that are highly expressed on the surface of gametes61) and ctrp (PF3D7_0315200, differentiated between different sampling locations within a expressed on the ookinete cell surface and essential for mosquito region). infection62). Thus, four of the five most highly differentiated parasite genes are involved in the process of transmission by To identify variants that are strongly differentiated at the glo- the mosquito vector, raising the possibility that this reflects bal level, we began by estimating F for each SNP across evolutionary adaptation of the P. falciparum population to the ST all of the eight regional sub-populations. The group of SNPs different Anopheles species that transmit malaria in different with the highest global F levels were found to be strongly geographical regions. ST enriched for non-synonymous mutations, suggesting that the process of differentiation is at least in part due to natural selec- It is more difficult to characterise variants that are strongly tion (Figure 4). After ranking all SNPs according to their differentiated at the local level, due to smaller sample sizes global F value, we calculated a global differentiation score for ST and various sources of sampling bias, but a crude estimate can each gene based on the highest-ranking non-synonymous SNP be obtained by analysis of each of the six geographical regions Page 12 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Figure 4. SNPs geographic differentiation. Coloured lines show the proportions of SNPs in ten FST bins, stratified by genomic regions: non-synonymous (red), synonymous (yellow), intronic (green) and intergenic (blue). FST is calculated between all eight regional parasite populations and the number of SNPs in each bin is indicated in the background histogram. The y-axis on the right-hand side refers to the histogram and is on a log scale. with samples from multiple countries. F was estimated for ST observed higher prevalence of samples classified as resistant in each SNP across different sampling locations within each Southeast Asia than anywhere else, with multiple samples resist- geographical region, and the results for different regions were ant to all drugs considered. Note that samples were collected combined by a heuristic approach to obtain a local differentiation over a relatively long time period (2002–15) during which score for each gene (see Methods). A range of genes associated there were major changes in global patterns of drug resist- with drug resistance (crt (PF3D7_0709000), dhfr (PF3D7_ ance, and that the sampling locations represented in a given year 0417200), dhps (PF3D7_0810800), kelch13 (PF3D7_1343700), depended on which partner studies were operative at the mdr1 (PF3D7_0523000), mdr2 (PF3D7_1447900) and fd time. To alleviate this problem, we have also divided the data into (PF3D7_1318100)) were in the top centile of local differen- samples collected before and after 2011 (Supplementary Data; tiation scores (Supplementary Data; Supplementary Figure 1, Supplementary table 10), but temporal trends in aggregated data Supplementary Table 8, Supplementary Note). should be interpreted with due caution. Geographic patterns of drug resistance Below we summarise the overall profile of drug resistance types Classification of samples based on markers of drug resistance. in the regional sub-populations: this is intended simply Antimalarial drug resistance represents a major focus of to provide context for users of this dataset, and should not be research for many partner studies within the Pf Community regarded as a statement of the current epidemiological situa- Project, and this dataset therefore contains a significant body tion. The Supplementary Notes (Supplementary Data) contain of data that have appeared in previous reports on drug resist- a more detailed description of the geographical distribution of ance. Readers are referred to these publications for more haplotypes, CNV breakpoints, interactions between genes, and detailed analyses of local patterns of resistance9–14,16–22 and variants associated with less commonly used antimalarial drugs. of resistance to specific drugs including chloroquine16,21, In the accompanying data release, we also identify samples with sulfadoxine-pyrimethamine16,19,21 and artemisinin combination mdr1, plasmepsin2/3 and gch1 gene amplifications that can therapy9–11,13–15,17,18,21,22. affect drug resistance. Here we have classified all samples into different types of drug Chloroquine resistance. Samples were classified as chloro- resistance based on published genetic markers and current quine resistant if they carried the crt 76T allele. As shown in knowledge of the molecular mechanisms (see www.malari- Table 2, this was found in almost all samples from South- agen.net/resource/26 for details of the heuristic used). Table 2 east Asia, South America and Oceania. It was also found across summarises the frequency of different types of drug resistance Africa but at lower frequencies, particularly in East Africa in samples from different geographical regions. Overall, we where chloroquine resistance is known to have declined since Page 13 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Page 14 of 31 Table 2. Cumulative frequency of different types of drug resistance in samples from different geographical regions. All samples were classified into different types of drug resistance based on published genetic markers, and represent best attempt based on the available data. Each type of resistance was considered to be either present, absent or unknown for a given sample. For each resistance type, the table reports: the genetic markers considered; the drug they are associated with; the proportion of samples in each region classified as resistant out of the samples where the type was not unknown. The number of samples classified as either resistant or not resistant varies for each type of resistance considered (e.g. due to different levels of genomic accessibility); numbers in brackets reports the minimum and maximum number analysed while the exact numbers considered are reported in Supplementary table 9. SP: sulfadoxine- pyrimethamine; treatment: SP used for the clinical treatment of uncomplicated malaria; IPTp: SP used for intermittent preventive treatment in pregnancy; AS-MQ: artesunate + mefloquine combination therapy; DHA-PPQ: dihydroartemisinin + piperaquine combination therapy. Details of the rules used to infer resistance status from genetic markers can be found on the resource page at www.malariagen.net/resource/26. Associated with  South  West Africa  Central  East Africa  South  Western  Eastern Marker resistance to America  (n=1851–2231) Africa  (n=678–739) Asia  Southeast Asia  Southeast Asia  Oceania  (n=33–37) (n=262–344) (n=62–77) (n=906–1079) (n=867–1256) (n=185–201) crt 76T Chloroquine 100% 41% 66% 14% 93% 100% 97% 99% dhfr 108N Pyrimethamine 97% 84% 100% 98% 100% 100% 100% 100% dhps 437G Sulfadoxine 30% 75% 97% 93% 97% 100% 87% 61% mdr1 2+ copies Mefloquine 0% 0% 0% 0% 0% 44% 12% 1% kelch13 WHO list Artemisinin 0% 0% 0% 0% 0% 28% 46% 0% plasmepsin 2-3 2+ copies Piperaquine 0% 0% 0% 0% 0% 0% 17% 0% dhfr triple mutant SP (treatment) 0% 75% 82% 91% 43% 90% 92% 0% dhfr and dhps sextuple mutant SP (IPTp) 0% 0% 1% 10% 19% 82% 19% 0% kelch13 and mdr1 AS-MQ 0% 0% 0% 0% 0% 13% 9% 0% kelch13 and plasmepsin 2-3 DHA-PPQ 0% 0% 0% 0% 0% 0% 15% 0% Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 chloroquine was discontinued63–65. Supplementary Table 11 HRP2/3 deletions that affect rapid diagnostic tests (Supplementary Data) shows the geographical distribution of Rapid diagnostic tests (RDTs) provide a simple and inexpensive different crt haplotypes (based on amino acid positions 72–76) way to test for parasites in the blood of patients who are sus- which is consistent with the theory that chloroquine resistance pected to have malaria, and have become a vital tool for malaria spread from Southeast Asia to Africa with multiple independent control73,74. The most widely used RDTs are designed to detect origins in South America and Oceania66,67. The crt locus is also P. falciparum histidine-rich protein 2 and cross-react with relevant to other types of drug resistance, e.g. crt variants that are histidine-rich protein 3, encoded by the hrp2 and hrp3 genes relatively specific to Southeast Asia form the genetic background respectively. Parasites with gene deletions of hrp2 and/or of artemisinin resistance, and newly emerging crt alleles have hrp3 have emerged as an important cause of RDT failure in a been associated with the spread of ACT failure due to number of locations75–79. It is difficult to devise a simple genetic piperaquine resistance13,14,22,68. assay to monitor for risk of RDT failure because hrp2 and hrp3 deletions comprise a diverse mixture of large structural Sulfadoxine-pyrimethamine resistance. Clinical resistance variations with multiple independent origins, and both genes are to sulfadoxine-pyrimethamine (SP) is determined by multiple located in subtelomeric regions of the genome with very high mutations and their interactions, so following current practice69 levels of natural variation29,80–83. In the absence of a well- we classified SP resistant samples into four overlapping types: validated algorithmic method, we visually inspected sequence (i) carrying the dhfr 108N allele, associated with pyrimeth- read coverage and identified samples with clear evidence of large amine resistance; (ii) the dhps 437G allele, associated with structural variants that disrupted or deleted the hrp2 and hrp3 sulfadoxine resistance; (iii) carrying the dhfr triple mutant, genes. We took a conservative approach: samples that appeared which is strongly associated with SP failure; (iv) carrying the to have a mixture of deleted and non-deleted genotypes were dhfr/dhps sextuple mutant, which confers a higher level of SP classified as non-deleted. resistance. As shown in Table 2, dhfr 108N was found in almost all samples in all regions apart from West Africa, while dhps Deletions were found at relatively high frequency in Peru (8 437G was at very high frequency throughout most of Africa of 21 samples had hrp2 deletions, 14 had hrp3 deletions and 6 and Asia, and at lower frequencies in South America and Oce- had both) but were not seen in samples from Colombia and were ania (see also Supplementary Data; Supplementary Table 12). relatively rare outside South America. Oceania was the only Triple mutant dhfr parasites were common throughout Africa other region where we observed hrp2 deletions, but at very and Asia, whereas sextuple mutant dhfr/dhps parasites were at low frequency (4%, n=3/80), and also had hrp3 deletions much lower frequency except in Western Southeast Asia. In the (25%) though no combined deletions were seen. Deletions of accompanying data release, we also identify samples with gch1 hrp3 only were more geographically widespread than hrp2 gene amplifications (Supplementary Data; Supplementary Table deletions, being common in Ethiopia (43%, n=9/21) and in 4) that can modulate SP resistance70, although their effect on Senegal (7%, n=6/84), and at relatively low frequency (<5%) in the clinical outcome and interaction with mutations in dhfr and Kenya, Cambodia, Laos, and Vietnam (Supplementary Data; dhps is not fully established. Supplementary Table 13). Note that these findings might under- estimate the true prevalence of hrp2/hrp3 deletions, due to Resistance to artemisinin combination therapy. We classi- sampling bias (our samples were primarily collected from fied samples as artemisinin resistant based on the World Health RDT-positive cases) and also because we focused on large Organization classification of non-synonymous mutations in structural variants and did not consider polymorphisms that the propeller region of the kelch13 gene that have been asso- might also cause RDT failure but would require more sophis- ciated with delayed parasite clearance71. By this definition, ticated analytical approaches. There is a need for more artemisinin resistance was confined to Southeast Asia but, reliable diagnostics of hrp2 and hrp3 deletions, and we hope that as previously reported, this dataset contains a substantial these open data will accelerate this important area of applied number of non-synonymous kelch13 propeller SNPs occurring methodological research. at <5% frequency in Africa and elsewhere9. The most common ACT formulations in Southeast Asia are artesunate-mefloquine Discussion (AS-MQ) and dihydroartemisinin-piperaquine (DHA-PPQ). This open dataset comprises sequence reads and genotype We classified samples as mefloquine resistant if they had calls on over 7,000 P. falciparum samples from MalariaGEN mdr1 amplification72 or as piperaquine resistant if they had partner studies in 28 countries. After excluding variants and plasmepsin 2/3 amplification25. Mefloquine resistance was samples that failed to meet stringent quality control criteria, the observed throughout Southeast Asia and was most common in dataset contains high-quality genotype calls for 3 million poly- the western part. Piperaquine resistance was confined to east- morphisms including SNPs, indels, CNVs and large structural ern Southeast Asia with a notable concentration in western variations, in almost 6,000 samples. The data can be analysed Cambodia. Elsewhere11,13 we describe the kel1/pla1 lineage of in their entirety or can be filtered to select for specific genes, artemisinin- and piperaquine-resistant parasites that expanded or geographical locations, or samples with particular geno- in western Cambodia during 2008–13, and then spread to other types. This is twice the sample size of our previous consortial countries during 2013–18, causing high rates of DHA-PPQ publication9 and is the largest available data resource for treatment failure across eastern Southeast Asia: since the analysis of P. falciparum population structure, gene flow and current dataset extends only to 2015 it captures only the first evolutionary adaptation. Each sample has been annotated to phase of the kel1/pla1 lineage expansion. show its profile of resistance to six major antimalarial drugs Page 15 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 and whether it carries structural variations that can cause RDT sample can be predicted from genome sequencing data failure. The classification scheme is heuristic and based on a will also improve as we gain better functional understanding subset of known genetic markers, so it should not be treated of the polygenic determinants of drug resistance. as a failsafe predictor of the phenotype of a particular sample. Our purpose in providing these annotations is to make it easy Thus, the next few years are likely to see major advances in for users without specialist training in genetics to explore the both the scale and information content of parasite genomic data. global dataset and to analyse any subset of samples for key The practical value for malaria control will be greatly enhanced features that are relevant to malaria control. Samples were col- by the progressive acquisition of longitudinal time-series lected by independent groups that were operative at a given time data, particularly if this is linked to other sources of epidemio- and in a given place with distinct objectives; while care needs logical data and translated into reliable, actionable information to be taken when interpreting results spanning multiple years with sufficient rapidity to allow control programmes to moni- and geographical settings (e.g. aggregated trends of drug tor the impact of their interventions on the parasite population resistance prevalence), this heterogeneity also allows for the in near real time. The Pf Community Project provides proof exploration of a wide range of epidemiological and transmission of concept that systems can be developed for groups in dif- settings. ferent countries to share data, to analyse it using standardised methods, and to make it readily accessible to other An important function of this curated dataset is to provide researchers and the malaria control community. information on the provenance and key features of samples associated with each partner study, thus allowing the findings Methods reported in different publications to be linked and compared. Here we summarise the bioinformatics methods used to Data produced by the Pf Community Project have been produce and analyse the data; further details are available at analysed in more than 50 publications (refs 5–55) and a few www.malariagen.net/resource/26. examples will serve to illustrate the diverse ways in which the data are being used. An analysis of samples collected across Ethical approval Africa by Amambua-Ngwa, Djimde and colleagues found All samples in this study were derived from blood samples evidence that parasite population structure overlaps with obtained from patients with P. falciparum malaria, collected with historical patterns of human migration and that the P. falciparum informed consent from the patient or a parent or guard- population in Ethiopia is significantly diverged from other ian. At each location, sample collection was approved by the parts of the continent12. A series of studies by Amato, Miotto appropriate local and institutional ethics committees. The fol- and colleagues have documented the evolution of a multidrug- lowing local and institutional committees gave ethical approval resistant lineage of P. falciparum that originated in Western for the partner studies: Human Research Ethics Committee of Cambodia over ten years ago and is now expanding rapidly the Northern Territory Department of Health & Families and across Southeast Asia, acquiring additional resistance mutations Menzies School of Health Research, Darwin, Australia; National as it spreads11,13,14. McVean and colleagues have developed a Research Ethics Committee of Bangladesh Medical Research computational method for deconvolution of the haplotypic struc- Council, Bangladesh; Comite d’Ethique de la Recherche - ture of mixed infections, allowing analysis of the pedigree Institut des Sciences Biomedicales Appliquees, Benin; Ministere structure of parasites that are cotransmitted by the same de la Sante – Republique du Benin, Benin; Comité d’Éthique, mosquito49. Bahlo and colleagues have developed a different Ministère de la Santé, Bobo-Dioulasso, Burkina Faso; haplotype-based method to describe the relatedness structure Institutional Review Board Centre Muraz, Burkina Faso; Ministry of the parasite population and to identify new genomic loci with of Health National Ethics Committee for Health Research, evidence of recent positive selection50. Cambodia; Institutional Review Board University of Buea, Cameroon; Comite Institucional de Etica de investigaciones en A recent report from the World Health Organization high- humanos de CIDEIM, Colombia; Comité National d’Ethique lights the need for improved surveillance systems in sustaining de la Recherche, Cote d’Ivoire; Comite d’Ethique Universite de malaria control and achieving the long-term goal of malaria Kinshasa, Democratic Republic of Congo; Armauer Hansen eradication84. To be of practical value for national malaria Research Institute Institutional Review Board, Ethiopia; Addis control programmes, genetic data must address well-defined Ababa University, Aklilu Lemma Institute of Pathobiology Institu- use cases and be readily accessible85. Amplicon sequencing tional Review Board, Ethiopia; Kintampo Health Research Centre technologies provide a powerful new tool for targeted genotyp- Institutional Ethics Committee, Ghana; Ghana Health Service ing that could feasibly be implemented locally in malaria-endemic Ethical Review Committee, Ghana; University of Ghana Noguchi countries86,87, but there remains a need for the international Medical Research Institute, Ghana; Navrongo Health Research malaria control community to generate and share whole Centre Institutional Review Board, Ghana; Comite d’Ethique genome sequencing data, e.g. to monitor for newly emerging National Pour la Recherché en Santé, Republique de Guinee; forms of drug resistance and to understand regional patterns of Indian Council of Medical Research, India; Eijkman Institute parasite migration. The next generation of long-read sequencing Research Ethics Commission, Eijkman Institute for Molecular technologies will improve the precision of population genomic Biology, Jakarta, Indonesia; KEMRI Scientific and Ethics Review inference, e.g. by enabling analysis of hypervariable regions Unit, Kenya; Ministry of Health National Ethics Commit- of the genome, and of pedigree structures within mixed infec- tee For Health Research, Laos; Ethical Review Committee of tions. The accuracy with which the resistance phenotype of a University of Ilorin Teaching Hospital, Nigeria; Comité National Page 16 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 d’Ethique auprès du Ministère de la Santé Publique, Madagascar; converted to ZARR format and subsequent analyses were mainly College of Medicine Regional Ethics Committee Univer- performed using scikit-allel version 1.1.18 and the sity of Malawi, Malawi; Faculté de Médecine, de Pharmacie ZARR files. et d’Odonto-Stomatologie, University of Bamako, Bamako, Mali; Ethics Committee of the Ministry of Health, Mali; Ethics We identified species using nucleotide sequence from reads map- committee of the Ministry of Health, Mauritania; Department of ping to six different loci in the mitochondrial genome, using Medical Research (Lower Myanmar); Ministry of Health, Govern- custom java code (available at https://github.com/malariagen/ ment of The Republic of the Union of Myanmar; : Institutional GeneticReportCard). The loci were located within the cox3 gene Review Board, Papua New Guinea Institute of Medical Research, (PF3D7_MIT01400), as described in a previously published Goroka, Papua New Guinea; PNG Medical Research Advisory species detection method91. Alleles at various mitochondrial Council (MRAC), Papua New Guinea; Institutional Review Board, positions within the six loci were genotyped and used for clas- Universidad Nacional de la Amazonia Peruana, Iquitos, Peru; sification as shown in Supplementary Table 14 (Supplementary Ethics Committee of the Ministry of Health, Senegal; National Data). Institute for Medical Research and Ministry of Health and Social Welfare, Tanzania; Medical Research Coordinating Committee We created a final analysis set of 5,970 samples after removing of the National Institute for Medical Research, Tanzania; Ethics replicate, low coverage, suspected contaminations or mislabelling Committee, Faculty of Tropical Medicine, Mahidol University, and mixed-species samples. Bangkok, Thailand; Ethics Committee at Institute for the Devel- opment of Human Research Protections, Thailand; Gambia Genotyping of drug resistance markers and samples Government/MRC Joint Ethics Committee, Banjul, The Gambia; classification London School of Hygiene and Tropical Medicine Ethics We used two complementary methods to determine tandem Committee, London, UK; Oxford Tropical Research Ethics duplication genotypes around mdr1, plasmepsin2/3 and gch1, Committee, Oxford, UK; Walter Reed Army Institute of Research, namely a coverage-based method and a method based on position USA; National Institute of Allergy and Infectious Diseases, and orientation of reads near discovered duplication breakpoints. Bethesda, MD, USA; Ethical Committee, Hospital for Tropi- In brief, the outline algorithm is: (1) Determine copy number cal Diseases, Ho Chi Minh City, Vietnam; Ministry of Health at each locus using a coverage based hidden Markov model Institute of Malariology-Parasitology-Entomology, Vietnam. (HMM); (2) Determine breakpoints of identified duplications by manual inspection of reads and face-away read pairs around Standard laboratory protocols were used to determine DNA all sets of breakpoints; (3) for each locus in each sample, initially quantity and proportion of human DNA in each sample as set copy number to that determined by the HMM if ≤ 10 CNVs previously described7,56. discovered in total, else consider undetermined; (4) if face- away pairs provide self-sufficient evidence for the presence or Data generation and curation absence of the amplification, override the HMM call; (5) for Reads mapping to the human reference genome were each locus in each sample, set the breakpoint to be that with discarded before all analyses, and the remaining reads were the highest proportion of face-away reads. mapped to the P. falciparum 3D7 v3 reference genome using bwa mem88 version 0.7.15. “Improved” BAMs were created using We genotyped deletions in hrp2 and hrp3 by manual inspection the Picard tools CleanSam, FixMateInformation and of sequence read coverage plots. MarkDuplicates version 2.6.0 and GATK v3 base quality score recalibration. All lanes for each sample were merged to The procedure used to map genetic markers to inferred resist- create sample-level BAM files. ance status classification is described in detail for each drug in the accompanying data release (https://www.malariagen. We discovered potential SNPs and indels by running GATK’s net/resource/26). HaplotypeCaller89 independently across each of the 7,182 sample-level BAM files and genotyped these for each In brief, we called amino acids at selected loci by first deter- of the 16 reference sequences (14 chromosomes, 1 apicoplast mining the reference amino acids and then, for each sample, and 1 mitochondria) using GATK’s CombineGVCFs and applying all variations using the GT field of the VCF file. The GenotypeGCVFs. amino acid and copy number calls generated were used to clas- sify all samples into different types of drug resistance. Our SNPs and indels were filtered using GATK’s Variant Quality methods of classification were heuristic and based on the Score Recalibration (VQSR). Variants with a VQSLOD available data and current knowledge of the molecular mecha- score ≤ 0 were filtered out. Functional annotations were nisms. Each type of resistance was considered to be either applied using snpEff90 version 4.1. Genome regions were anno- present, absent or unknown for a given sample. tated using vcftools version 0.1.10 and masked if they were outside the core genome. Unless otherwise specified, we used Population-level analysis and characterisation biallelic SNPs that pass all quality filters for all the analysis. We calculate genetic distance between samples using biallelic SNPs that pass filters using a method previously described9. In We removed 69 samples from lab studies to create the release addition to calculating genetic distance between all pairs of VCF files which contain 7,113 samples. VCF files were samples from the current data set, we also calculated the genetic Page 17 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 distance between each sample and the lab strains 3D7, 7G8, nature, format and content of the data, appropriate mechanisms GB4, HB3 and Dd2 from the Pf3k project. have been utilised for data access, as detailed below. The matrix of genetic distances was used to generate This project contains the following underlying data that are neighbour-joining trees and principal coordinates. Based on these available as an online resource: www.malariagen.net/resource/26. observations we grouped the samples into eight geographic Data are also available from Figshare. regions: South America, West Africa, Central Africa, East Africa, South Asia, the western part of Southeast Asia, the eastern part Figshare: Supplementary data to: An open dataset of Plasmodium of Southeast Asia and Oceania, with samples assigned to region falciparum genome variation in 7,000 worldwide samples. https:// based on the geographic location of the sampling site. Five sam- doi.org/10.6084/m9.figshare.1338860392. ples from returning travellers were assigned to region based • Study information: Details of the 49 contributing partner on the reported country of travel. studies, including description, contact information and key people. F was calculated using custom python scripts using the WS method previously described7 • S ample provenance and sequencing metadata: sample . Nucleotide diversity (π) was cal- information including partner study information, location culated in non-overlapping 25 kbp genomic windows, only and year of collection, ENA accession numbers, and QC considering coding biallelic SNPs to reduce the ascertainment information for 7,113 samples from 28 countries. bias caused by poor accessibility of non-coding regions. LD decay (r2) was calculated using the method of Rogers and Huff • Measure of complexity of infections: characterisation of and biallelic SNPs with low missingness and regional allele within-host diversity (FWS) for 5,970 QC pass samples. frequency >10%. Mean F between populations was calculated • Drug resistance marker genotypes: genotypes at known ST using Hudson’s method. markers of drug resistance for 7,113 samples, containing amino acid and copy number genotypes at six loci: crt, Allele frequencies stratified by geographic regions and sam- dhfr, dhps, mdr1, kelch13, plasmepsin 2–3. pling sites were calculated using the genotype calls produced by • Inferred resistance status classification: classification of GATK. F was calculated between all 8 regions, and also 5,970 QC pass samples into different types of resistance ST between all sites with at least 25 QC pass samples. F between to 10 drugs or combinations of drugs and to RDT detection: ST different locations for individual SNPs was calculated using chloroquine, pyrimethamine, sulfadoxine, mefloquine, Weir and Cockerham’s method. artemisinin, piperaquine, sulfadoxine- pyrimethamine for treatment of uncomplicated malaria, sulfadoxine- pyrimethamine for intermittent preventive treatment in We defined the global differentiation score for a gene N pregnancy, artesunate-mefloquine, dihydroartemisinin- as 1− max (N ) , where is the rank of the non-synonymous SNP piperaquine, hrp2 and hrp3 genes deletions. with the highest global F value within that gene. To define the • Drug resistance markers to inferred resistance status: ST local differentiation score, we first calculated for each region details of the heuristics utilised to map genetic markers containing multiple sites (WAF, EAF, SAS, WSEA, ESEA and to resistance status classification. OCE) F for each SNP between sites within that region. For ST • Gene differentiation: estimates of global and local each gene, we then calculated the rank of the highest F non- ST differentiation for 5,561 genes. synonymous SNP within that gene for each of the six regions. • Short variants genotypes: Genotype calls on 6,051,696 We defined the local differentiation score for each gene using SNPs and short indels in 7,113 samples from 29 countries, the second highest of these six ranks (N), to ensure that the available both as VCF and zarr files. gene was highly ranked in at least two populations, i.e. to mini- mise the chance of artefactually ranked a gene highly due to a Extended data single variant in a single population. The final local differen- This project contains the following underlying supplementary tiation score was normalised to ensure that the range of possible data available as a single document download: www.malari- scores was between 0 and 1, local differentiation score was N agen.net/resource/26. Extended data are also available from defined as 1− . Figshare. max (N ) An earlier version of this article can be found on bioRxiv Figshare: Supplementary data to: An open dataset of Plasmodium (DOI: https://doi.org/10.1101/824730). falciparum genome variation in 7,000 worldwide samples. https:// doi.org/10.6084/m9.figshare.1338860392. Data availability Underlying data ‘File9_Pf_6_supplementary’ contains the Supplementary Note, Data are available under the MalariaGEN terms of use for the Supplementary Tables and Supplementary Figure: Pf Community Project: https://www.malariagen.net/data/terms- • S upplementary Note use/p-falciparum-community-project-terms-use. Depending on the ○ Analysis of local differentiation score Page 18 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 ○ The classic 76T chloroquine resistance mutation in Local study design, implementation and sample crt is found on multiple haplotypes collection ○ S uplhadoxine-pyrimethamine resistance is widespread Ahouidi, A, Amambua-Ngwa, A, Amaratunga, C, Amenga-Etego, and associated with many haplotypes L, Andagalu, B, Anderson, TJC, Apinjoh, T, Ashley, EA, Auburn, S, Awandare, G, Ba, H, Baraka, V, Barry, AE, Bejon, P, Bertin, ○ mdr1 duplications have many different breakpoints GI, Boni, MF, Borrmann, S, Bousema, T, Branch, O, Bull, PC, ○ Artemisinin, piperaquine, and mefloquine resistance Chotivanich, K, Claessens, A, Conway, D, Craig, A, D’Alessandro, U, Dama, S, Day, N, Denis, B, Diakite, M, Djimdé, A, Dolecek, ○ No evidence of resistance to less commonly used C, Dondorp, A, Drakeley, C, Duffy, P, Echeverry, DF, Egwang, antimalarials TG, Erko, B, Fairhurst, RM, Faiz, A, Fanello, CA, Fukuda, MM, • Supplementary Table 1. Breakdown of analysis set samples Gamboa, D, Ghansah, A, Golassa, L, Harrison, GLA, Hien, TT, by geography. Hill, CA, Hodgson, A, Imwong, M, Ishengoma, DS, Jackson, SA, • Supplementary Table 2. Studies contributing samples. Kamaliddin, C, Kamau, E, Konaté, A, Kyaw, MP, Lim, P, Lon, C, Loua, KM, Maïga-Ascofaré, O, Marfurt, J, Marsh, K, Mayxay, M, • Supplementary Table 3. Summary of discovered variant Mobegi, V, Mokuolu, OA, Montgomery, J, Mueller, I, Newton, PN, positions. Nguyen, TN, Noedl, H, Nosten, F, Noviyanti, R, Nzila, A, Ochola- • Supplementary Table 4. Breakpoints of duplications of Oyier, LI, Ocholla, H, Oduro, A, Omedo, I, Onyamboko, MA, gch1. Ouedraogo, J, Oyebola, K, Peshu, N, Phyo, AP, Plowe, CV, • Supplementary Table 5. Breakpoints of duplications of Price, RN, Pukrittayakamee, S, Randrianarivelojosia, M, Rayner, mdr1. JC, Ringwald, P, Ruiz, L, Saunders, D, Shayo, A, Siba, P, Su, X, Sutherland, C, Takala-Harrison, S, Tavul, L, Thathy, V, Tshefu, A, • Supplementary Table 6. Breakpoints of duplications of Verra, F, Vinetz, J, Wellems, TE, Wendler, J, White, NJ, Yavo, W, plasmepsin 2–3. Ye, H • Supplementary Table 7. Genes ranked by global differentiation score. Sequencing, data production and informatics Pearson, RD, Stalker, J, Ali, M, Amato, R, Ariani, C, Busby, G, • S upplementary Table 8. Genes ranked by local Drury, E, Hart, L, Hubbart, C, Jacob, CG, Jeffery, B, Jeffreys, differentiation score. AE, Jyothi, D, Kekre, M, Kluczynski, K, Malangone, C, Manske, • S upplementary Table 9. Number of samples used to M, Miles, A, Nguyen, T, Rowlands, K, Wright, I, Goncalves, S, determine proportions in Table 2. Rockett, KA • Supplementary Table 10. Frequencies of mutations Partner study support and coordination associated with mono- and multi-drug resistance pre- and Simpson, VJ, Miotto, O, Amato, R, Goncalves, S, Henrichs, C, post-2011. Johnson, KJ, Pearson, RD, Rockett, KA, Kwiatkowski, DP • S upplementary Table 11. Frequency of crt amino acid 72–76 haplotypes. Acknowledgements • Supplementary Table 12. Frequencies of dhfr (51, 59, This study was conducted by the MalariaGEN Plasmodium 108, 164) and dhps (437, 540, 581, 613) multi-locus falciparum Community Project, and was made possible by haplotypes. clinical parasite samples contributed by partner studies, whose • Supplementary Table 13. Frequency of HRP2 and HRP3 investigators are represented in the author list and in the associ- deletions by country. ated data release (https://www.malariagen.net/resource/26). This research was supported in part by the Intramural Research • S upplementary Table 14. Alleles at six mitochondrial Programme of the NIH, NIAID. In addition, the authors would positions used for the species identification. like to thank the following individuals who contributed to partner studies, making this study possible: Dr Eugene Laman • Supplementary Figure 1. Histogram of local differentiation for work in sample collection in the Republic of Guinea; score for all genes. Dr Abderahmane Tandia and Dr Yacine Deh and Dr Samuel Assefa for work in sample collection in Mauritania; Dr Ibrahim Data hosted with Figshare are available under the terms of the Sanogo for work in sample collection in Mali; Dr James Creative Commons Attribution 4.0 International license (CC- Abugri and Dr Nicholas Amoako for work coordinating sample BY 4.0). collection in Ghana. Genome sequencing was undertaken by the Wellcome Sanger Institute and we thank the staff of the Data analysis group Wellcome Sanger Institute Sample Logistics, Sequencing, and Pearson, RD*, Amato, R*, Hamilton, WL, Almagro-Garcia, J, Informatics facilities for their contribution. The authors would Chookajorn, T, Kochakarn, T, Miotto, O, Kwiatkowski, DP like to thank Erin Courtier for her assistance with the journal sub- mission. The views expressed here are solely those of the authors *Joint analysis lead and do not reflect the views, policies or positions of the U.S. Page 19 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Government or Department of Defense. Material has been of Defense. The investigators have adhered to the policies for reviewed by the Walter Reed Army Institute of Research. There protection of human subjects as prescribed in AR 70–25. PR is is no objection to its presentation and/or publication. The opin- a staff member of the World Health Organization. PR alone is ions or assertions contained herein are the private views of the responsible for the views expressed in this publication and author, and are not to be construed as official, or as reflecting they do not necessarily represent the decisions, policy or true views of the Department of the Army or the Department views of the World Health Organization. References 1. Malaria Genomic Epidemiology Network: A global network for investigating  17. Ashley EA, Dhorda M, Fairhurst RM, et al.: Spread of Artemisinin Resistance in  the genomic epidemiology of malaria. Nature. 2008; 456(7223): 732–7. Plasmodium falciparum Malaria. N Engl J Med. 2014; 371(5): 411–23. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  2. Chokshi DA, Parker M, Kwiatkowski DP: Data sharing and intellectual  18. Kamau E, Campino S, Amenga-Etego L, et al.: K13-propeller polymorphisms in  property in a genomic epidemiology network: policies for large-scale  Plasmodium falciparum parasites from sub-Saharan Africa. J Infect Dis. 2015; research collaboration. Bull World Health Organ. 2006; 84(5): 382–7. 211(8): 1352–5. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  3. Parker M, Bull SJ, de Vries J, et al.: Ethical data release in genome-wide  19. Ravenhall M, Benavente ED, Mipando M, et al.: Characterizing the impact  association studies in developing countries. PLoS Med. 2009; 6(11): e1000143. of sustained sulfadoxine/pyrimethamine use upon the Plasmodium PubMed Abstract | Publisher Full Text | Free Full Text  falciparum population in Malawi. Malar J. 2016; 15(1): 575. 4. Ghansah A, Amenga-Etego L, Amambua-Ngwa A, et al.: Monitoring parasite  PubMed Abstract | Publisher Full Text | Free Full Text  diversity for malaria elimination in sub-Saharan Africa. Science. 2014; 20. Gomes AR, Ravenhall M, Benavente ED, et al.: Genetic diversity of next  345(6202): 1297–8. generation antimalarial targets: A baseline for drug resistance  PubMed Abstract | Publisher Full Text | Free Full Text  surveillance programmes. Int J Parasitol Drugs Drug Resist. 2017; 7(2): 174–180. 5. Auburn S, Campino S, Clark TG, et al.: An effective method to purify PubMed Abstract | Publisher Full Text | Free Full Text  Plasmodium falciparum DNA directly from clinical blood samples for whole  21. Apinjoh TO, Mugri RN, Miotto O, et al.: Molecular markers for artemisinin  genome high-throughput sequencing. PLoS One. 2011; 6(7): e22213. and partner drug resistance in natural Plasmodium falciparum populations  PubMed Abstract | Publisher Full Text | Free Full Text  following increased insecticide treated net coverage along the slope of  6. Venkatesan M, Amaratunga C, Campino S, et al.: Using CF11 cellulose columns  mount Cameroon: Cross-sectional study. Infect Dis Poverty. 2017; 6(1): 136. to inexpensively and effectively remove human DNA from Plasmodium PubMed Abstract | Publisher Full Text | Free Full Text  falciparum-infected whole blood samples. Malar J. 2012; 11: 41. 22. Ross LS, Dhingra SK, Mok S, et al.: Emerging Southeast Asian PfCRT  PubMed Abstract | Publisher Full Text | Free Full Text  mutations confer Plasmodium falciparum resistance to the first-line 7. Manske M, Miotto O, Campino S, et al.: Analysis of Plasmodium falciparum  antimalarial piperaquine. Nat Commun. 2018; 9(1): 3314. diversity in natural infections by deep sequencing. Nature. 2012; 487(7407): PubMed Abstract | Publisher Full Text | Free Full Text  375–9. 23. Miotto O, Amato R, Ashley EA, et al.: Genetic architecture of artemisinin- PubMed Abstract | Publisher Full Text | Free Full Text  resistant Plasmodium falciparum. Nat Genet. 2015; 47(3): 226–34. 8. Vauterin P, Jeffery B, Miles A, et al.: Panoptes: Web-based exploration of large  PubMed Abstract | Publisher Full Text | Free Full Text  scale genome variation data. Bioinformatics. 2017; 33(20): 3243–3249. 24. Takala-Harrison S, Jacob CG, Arze C, et al.: Independent Emergence of  PubMed Abstract | Publisher Full Text | Free Full Text  Artemisinin Resistance Mutations Among Plasmodium falciparum in  9. MalariaGEN Plasmodium falciparum Community Project: Genomic  Southeast Asia. J Infect Dis. 2015; 211(5): 670–9. epidemiology of artemisinin resistant malaria. eLife. 2016; 5: e08714. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  25. Amato R, Lim P, Miotto O, et al.: Genetic markers associated with  10. Miotto O, Almagro-Garcia J, Manske M, et al.: Multiple populations of  dihydroartemisinin-piperaquine failure in Plasmodium falciparum malaria  artemisinin-resistant Plasmodium falciparum in Cambodia. Nat Genet. 2013; in Cambodia: a genotype-phenotype association study. Lancet Infect Dis. 45(6): 648–55. 2017; 17(2): 164–73. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  11. Amato R, Pearson RD, Almagro-Garcia J, et al.: Origins of the current outbreak  26. Borrmann S, Straimer J, Mwai L, et al.: Genome-wide screen identifies new of multidrug-resistant malaria in southeast Asia: a retrospective genetic  candidate genes associated with artemisinin susceptibility in Plasmodium study. Lancet Infect Dis. 2018; 18(3): 337–45. falciparum in Kenya. Sci Rep. 2013; 3: 3318. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  12. Amambua-Ngwa A, Amenga-Etego L, Kamau E, et al.: Major subpopulations  27. Wendler JP, Okombo J, Amato R, et al.: A Genome Wide Association Study of  of Plasmodium falciparum in sub-Saharan Africa. Science. 2019; 365(6455): Plasmodium falciparum Susceptibility to 22 Antimalarial Drugs in Kenya. 813–6. PLoS One. 2014; 9(5): e96486. PubMed Abstract | Publisher Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  13. Hamilton WL, Amato R, van der Pluijm RW, et al.: Evolution and expansion of  28. Zhu L, Tripathi J, Rocamora FM, et al.: The origins of malaria artemisinin  multidrug-resistant malaria in southeast Asia: a genomic epidemiology  resistance defined by a genetic and transcriptomic background. Nat study. Lancet Infect Dis. 2019; 19(9): 943–51. Commun. 2018; 9(1): 5158. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  14. van der Pluijm RW, Imwong M, Chau NH, et al.: Determinants of  29. Sepúlveda N, Phelan J, Diez-Benavente E, et al.: Global analysis of Plasmodium dihydroartemisinin-piperaquine treatment failure in Plasmodium falciparum histidine-rich protein-2 (pfhrp2) and pfhrp3 gene deletions using  falciparum malaria in Cambodia, Thailand, and Vietnam: a prospective  whole-genome sequencing data and meta-analysis. Infect Genet Evol. 2018; clinical, pharmacological, and genetic study. Lancet Infect Dis. 2019; 19(9): 62: 211–9. 952–61. PubMed Abstract | Publisher Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  30. Williams AR, Douglas AD, Miura K, et al.: Enhancing blockade of Plasmodium 15. Ariey F, Witkowski B, Amaratunga C, et al.: A molecular marker of artemisinin- falciparum erythrocyte invasion: assessing combinations of antibodies  resistant Plasmodium falciparum malaria. Nature. 2014; 505(7481): 50–5. against PfRH5 and other merozoite antigens. PLoS Pathog. 2012; 8(11): PubMed Abstract | Publisher Full Text | Free Full Text  e1002991. 16. Nwakanma DC, Duffy CW, Amambua-Ngwa A, et al.: Changes in malaria  PubMed Abstract | Publisher Full Text | Free Full Text  parasite drug resistance in an endemic population over a 25-year period  31. Benavente ED, Oresegun DR, de Sessions PF, et al.: Global genetic diversity  with resulting genomic evidence of selection. J Infect Dis. 2014; 209(7): of var2csa in Plasmodium falciparum with implications for malaria in  1126–35. pregnancy and vaccine development. Sci Rep. 2018; 8(1): 15429. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  Page 20 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 32. Amambua-Ngwa A, Tetteh KKA, Manske M, et al.: Population genomic  identity by descent between haploid genotypes. Malar J. 2018; 17(1): 196. scan for candidate signatures of balancing selection to guide antigen  PubMed Abstract | Publisher Full Text | Free Full Text  characterization in malaria parasites. PLoS Genet. 2012; 8(11): e1002992. 52. Samad H, Coll F, Preston MD, et al.: Imputation-Based Population Genetics  PubMed Abstract | Publisher Full Text | Free Full Text  Analysis of Plasmodium falciparum Malaria Parasites. PLoS Genet. 2015; 11(4): 33. Campino S, Marin-Menendez A, Kemp A, et al.: A forward genetic screen  e1005131. reveals a primary role for Plasmodium falciparum Reticulocyte Binding  PubMed Abstract | Publisher Full Text | Free Full Text  Protein Homologue 2a and 2b in determining alternative erythrocyte  53. Ravenhall M, Campino S, Clark TG: SV-Pop: population-based structural  invasion pathways. PLoS Pathog. 2018; 14(11): e1007436. variant analysis and visualization. BMC Bioinformatics. 2019; 20(1): 136. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  34. Crosnier C, Iqbal Z, Knuepfer E, et al.: Binding of Plasmodium falciparum  54. Jacob CG, Tan JC, Miller BA, et al.: A microarray platform and novel SNP  merozoite surface proteins DBLMSP and DBLMSP2 to human  calling algorithm to evaluate Plasmodium falciparum field samples of low immunoglobulin M is conserved among broadly diverged sequence  DNA quantity. BMC Genomics. 2014; 15(1): 719. variants. J Biol Chem. 2016; 291(27): 14285–99. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  55. Preston MD, Assefa SA, Ocholla H, et al.: PlasmoView: A Web-based Resource  35. Amambua-Ngwa A, Jeffries D, Amato R, et al.: Consistent signatures  to Visualise Global Plasmodium falciparum Genomic Variation. J Infect Dis. of selection from genomic analysis of pairs of temporal and spatial  2014; 209(11): 1808–15. Plasmodium falciparum populations from the Gambia. Sci Rep. 2018; 8(1): PubMed Abstract | Publisher Full Text | Free Full Text  9687. PubMed Abstract | Publisher Full Text | Free Full Text  56. Miles A, Iqbal Z, Vauterin P, et al.: Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome 36. Duffy CW, Amambua-Ngwa A, Ahouidi AD, et al.: Multi-population genomic  Res. 2016; 26(9): 1288–99. analysis of malaria parasites indicates local selection and differentiation PubMed Abstract | Publisher Full Text | Free Full Text  at the gdv1 locus regulating sexual development. Sci Rep. 2018; 8(1): 15763. PubMed Abstract | Publisher Full Text | Free Full Text  57. Hamilton WL, Claessens A, Otto TD, et al.: Extreme mutation bias and high AT  content in Plasmodium falciparum. Nucleic Acids Res. 2017; 45(4): 1889–901. 37. Duffy CW, Ba H, Assefa S, et al.: Population genetic structure and adaptation  PubMed Abstract | Publisher Full Text | Free Full Text  of malaria parasites on the edge of endemic distribution. Mol Ecol. 2017; 26(11): 2880–2894. 58. Carvalho CMB, Ramocki MB, Pehlivan D, et al.: Inverted genomic segments  PubMed Abstract | Publisher Full Text | Free Full Text  and complex triplication rearrangements are mediated by inverted  38. Duffy CW, Assefa SA, Abugri J, et al.: Comparison of genomic signatures of  repeats in the human genome. Nat Genet. 2011; 43(11): 1074–81. selection on Plasmodium falciparum between different regions of a country PubMed Abstract | Publisher Full Text | Free Full Text  with high malaria endemicity. BMC Genomics. 2015; 16(1): 527. 59. Molina-Cruz A, Garver LS, Alabaster A, et al.: The human malaria parasite  PubMed Abstract | Publisher Full Text | Free Full Text  Pfs47 gene mediates evasion of the mosquito immune system. Science. 39. Mobegi VA, Duffy CW, Amambua-Ngwa A, et al.: Genome-wide analysis of  2013; 340(6135): 984–7. selection on the malaria parasite Plasmodium falciparum in West African  PubMed Abstract | Publisher Full Text | Free Full Text  populations of differing infection endemicity. Mol Biol Evol. 2014; 31(6): 60. Gardiner DL, Dixon MWA, Spielmann T, et al.: Implication of a Plasmodium 1490–9. falciparum gene in the switch between asexual reproduction and  PubMed Abstract | Publisher Full Text | Free Full Text  gametocytogenesis. Mol Biochem Parasitol. 2005; 140(2): 153–60. 40. Shetty AC, Jacob CG, Huang F, et al.: Genomic structure and diversity of  PubMed Abstract | Publisher Full Text  Plasmodium falciparum in Southeast Asia reveal recent parasite migration  61. Moelans II, Meis JF, Kocken C, et al.: A novel protein antigen of the malaria  patterns. Nat Commun. 2019; 10(1): 2665. parasite Plasmodium falciparum, located on the surface of gametes and  PubMed Abstract | Publisher Full Text | Free Full Text  sporozoites. Mol Biochem Parasitol. 1991; 45(2): 193–204. 41. Auburn S, Campino S, Miotto O, et al.: Characterization of within-host  PubMed Abstract | Publisher Full Text  Plasmodium falciparum diversity using next-generation sequence data. 62. Dessens JT, Beetsma AL, Dimopoulos G, et al.: CTRP is essential for mosquito  PLoS One. 2012; 7(2): e32891. infection by malaria ookinetes. EMBO J. 1999; 18(22): 6221–7. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Free Full Text  42. Assefa SA, Preston MD, Campino S, et al.: estMOI: estimating multiplicity of  63. Laufer MK, Thesing PC, Eddington ND, et al.: Return of Chloroquine  infection using parasite deep sequencing data. Bioinformatics. 2014; 30(9): Antimalarial Efficacy in Malawi. N Engl J Med. 2006; 355(19): 1959–66. 1292–4. PubMed Abstract | Publisher Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  64. Laufer MK, Takala‐Harrison S, Dzinjalamala FK, et al.: Return of Chloroquine‐ 43. Murray L, Mobegi VA, Duffy CW, et al.: Microsatellite genotyping and  Susceptible Falciparum Malaria in Malawi Was a Reexpansion of Diverse  genome-wide single nucleotide polymorphism-based indices of  Susceptible Parasites. J Infect Dis. 2010; 202(5): 801–8. Plasmodium falciparum diversity within clinical infections. Malar J. 2016; PubMed Abstract | Publisher Full Text | Free Full Text  15(1): 275. 65. Frosch AEP, Laufer MK, Mathanga DP, et al.: Return of Widespread  PubMed Abstract | Publisher Full Text | Free Full Text  Chloroquine-Sensitive Plasmodium falciparum to Malawi. J Infect Dis. 2014; 44. Chang HH, Worby CJ, Yeka A, et al.: THE REAL McCOIL: A method for the  210(7): 1110–4. concurrent estimation of the complexity of infection and SNP allele  PubMed Abstract | Publisher Full Text | Free Full Text  frequency for malaria parasites. PLoS Comput Biol. 2017; 13(1): e1005348. 66. Wootton JC, Feng X, Ferdig MT, et al.: Genetic diversity and chloroquine  PubMed Abstract | Publisher Full Text | Free Full Text  selective sweeps in Plasmodium falciparum. Nature. 2002; 418(6895): 320–3. 45. O’Brien JD, Iqbal Z, Wendler J, et al.: Inferring Strain Mixture within Clinical  PubMed Abstract | Publisher Full Text  Plasmodium falciparum Isolates from Genomic Sequence Data. PLoS Comput 67. Mita T, Tanabe K, Kita K: Spread and evolution of Plasmodium falciparum  Biol. 2016; 12(6): e1004824. drug resistance. Elsevier, Parasitol Int. 2009; 58(3): 201–9. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text  46. Robinson T, Campino SG, Auburn S, et al.: Drug-resistant genotypes and  68. Agrawal S, Moser KA, Morton L, et al.: Association of a Novel Mutation in  multi-clonality in Plasmodium falciparum analysed by direct genome  the Plasmodium falciparum Chloroquine Resistance Transporter With  sequencing from peripheral blood of malaria patients. in press. PLoS One. Decreased Piperaquine Sensitivity. J Infect Dis. 2017; 216(4): 468–76. 2011; 6(8): e23204. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text | Free Full Text  69. Naidoo I, Roper C: Mapping ‘partially resistant’, ‘fully resistant’, and ‘super  47. O’Brien JD, Amenga-Etego L, Li R: Approaches to estimating inbreeding  resistant’ malaria. Trends Parasitol. 2013; 29(10): 505–15. coefficients in clinical isolates of Plasmodium falciparum from genomic  PubMed Abstract | Publisher Full Text  sequence data. Malar J. 2016; 15: 473. 70. Heinberg A, Kirkman L: The molecular basis of antifolate resistance in  PubMed Abstract | Publisher Full Text | Free Full Text  Plasmodium falciparum: looking beyond point mutations. Ann N Y Acad Sci. 48. Zhu SJ, Almagro-Garcia J, McVean G: Deconvolution of multiple infections  2015; 1342(1): 10–8. in Plasmodium falciparum from high throughput sequencing data. PubMed Abstract | Publisher Full Text | Free Full Text  Bioinformatics. 2018; 34(1): 9–15. 71. World Health Organization: Artemisinin and artemisinin-based combination  PubMed Abstract | Publisher Full Text | Free Full Text  therapy resistance: status report. 2018. 49. Zhu SJ, Hendry JA, Almagro-Garcia J, et al.: The origins and relatedness  Reference Source structure of mixed infections vary with local prevalence of P. falciparum  72. Price RN, Uhlemann AC, Brockman A, et al.: Mefloquine resistance in malaria. eLife. 2019; 8: e40845. Plasmodium falciparum and increased pfmdr1 gene copy number. Lancet. PubMed Abstract | Publisher Full Text | Free Full Text  2004; 364(9432): 438–47. 50. Henden L, Lee S, Mueller I, et al.: Identity-by-descent analyses for measuring  PubMed Abstract | Publisher Full Text | Free Full Text  population dynamics and selection in recombining pathogens. PLoS Genet. 73. Cheng Q, Gatton ML, Barnwell J, et al.: Plasmodium falciparum parasites  2018; 14(5): e1007279. lacking histidine-rich protein 2 and 3: a review and recommendations for  PubMed Abstract | Publisher Full Text | Free Full Text  accurate reporting. Malar J. 2014; 13: 283. 51. Schaffner SF, Taylor AR, Wong W, et al.: hmmIBD: software to infer pairwise  PubMed Abstract | Publisher Full Text | Free Full Text  Page 21 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 74. WHO: Malaria rapid diagnostic test performance. Results of WHO product  pfhrp2- and pfhrp3-negative Plasmodium falciparum. Malar J. 2018; 17(1): 137. testing of malaria RDTs: round 8 (2016-2018). WHO, 2018; (accessed Aug 22, PubMed Abstract | Publisher Full Text | Free Full Text  2019). 84. World Health Organisation: WHO Strategic Advisory Group on Malaria  Reference Source Eradication. Malaria eradication: benefits, future scenarios and feasibility. 75. Gamboa D, Ho MF, Bendezu J, et al.: A Large Proportion of P. falciparum  Executive Summary. WHO Strategic Advisory Group on Malaria Eradication. Isolates in the Amazon Region of Peru Lack pfhrp2 and pfhrp3: Implications  Executive Summary. Geneva: World Health Organisation, 2019. for Malaria Rapid Diagnostic Tests. PLoS One. 2010; 5(1): e8091. Reference Source PubMed Abstract | Publisher Full Text | Free Full Text  85. Dalmat R, Naughton B, Kwan-Gett TS, et al.: Use cases for genetic  76. Rachid Viana GM, Akinyi Okoth S, Silva-Flannery L, et al.: Histidine-rich protein  epidemiology in malaria elimination. Malar J. 2019; 18(1): 163. 2 (pfhrp2) and pfhrp3 gene deletions in Plasmodium falciparum isolates  PubMed Abstract | Publisher Full Text | Free Full Text  from select sites in Brazil and Bolivia. PLoS One. 2017; 12(3): e0171150. PubMed Abstract | Publisher Full Text | Free Full Text  86. Early AM, Daniels RF, Farrell TM, et al.: Detection of low-density Plasmodium falciparum infections using amplicon deep sequencing. Malar J. 2019; 18(1): 77. Parr JB, Verity R, Doctor SM, et al.: Pfhrp2-deleted Plasmodium falciparum  219. parasites in the democratic republic of the congo: a national cross- PubMed Abstract | Publisher Full Text | Free Full Text  sectional survey. J Infect Dis. 2017; 216(1): 36–44. PubMed Abstract | Publisher Full Text | Free Full Text  87. Boyce RM, Hathaway N, Fulton T, et al.: Reuse of malaria rapid diagnostic  78. Menegon M, L’Episcopia M, Nurahmed AM, et al.: Identification of Plasmodium tests for amplicon deep sequencing to estimate Plasmodium falciparum  falciparum isolates lacking histidine-rich protein 2 and 3 in Eritrea. Infect transmission intensity in western Uganda. Sci Rep. 2018; 8(1): 10159. Genet Evol. 2017; 55: 131–4. PubMed Abstract | Publisher Full Text | Free Full Text  PubMed Abstract | Publisher Full Text  88. Li H, Durbin R: Fast and accurate short read alignment with Burrows- 79. Bharti PK, Chandel HS, Ahmad A, et al.: Prevalence of pfhrp2 and/or pfhrp3  Wheeler transform. Bioinformatics. 2009; 25(14): 1754–60. Gene Deletion in Plasmodium falciparum Population in Eight Highly  PubMed Abstract | Publisher Full Text | Free Full Text  Endemic States in India. PLoS One. 2016; 11(8): e0157949. 89. DePristo MA, Banks E, Poplin R, et al.: A framework for variation discovery  PubMed Abstract | Publisher Full Text | Free Full Text  and genotyping using next-generation DNA sequencing data. Nat Genet. 2011; 43(5): 491–8. 80. Baker J, Ho MF, Pelecanos A, et al.: Global sequence variation in the histidine- PubMed Abstract | Publisher Full Text | Free Full Text  rich proteins 2 and 3 of Plasmodium falciparum: implications for the  performance of malaria rapid diagnostic tests. Malar J. 2010; 9: 129. 90. Cingolani P, Platts A, Wang LL, et al.: A program for annotating and  PubMed Abstract | Publisher Full Text | Free Full Text  predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 81. Akinyi S, Hayden T, Gamboa D, et al.: Multiple genetic origins of histidine-rich  (Austin). 2012; 6(2): 80–92. protein 2 gene deletion in Plasmodium falciparum parasites from Peru. Sci PubMed Abstract | Publisher Full Text | Free Full Text  Rep. 2013; 3: 2797. PubMed Abstract | Publisher Full Text | Free Full Text  91. Echeverry DF, Deason NA, Davidson J, et al.: Human malaria diagnosis using a single-step direct-PCR based on the Plasmodium cytochrome oxidase III  82. Akinyi Okoth S, Abdallah JF, Ceron N, et al.: Variation in Plasmodium gene. Malar J. 2016; 15: 128. falciparum Histidine-Rich Protein 2 (Pfhrp2) and Plasmodium falciparum  PubMed Abstract | Publisher Full Text | Free Full Text  Histidine-Rich Protein 3 (Pfhrp3) Gene Deletions in Guyana and Suriname. PLoS One. 2015; 10(5): e0126805. 92. MalariaGEN: Supplementary data to: An open dataset of Plasmodium  PubMed Abstract | Publisher Full Text | Free Full Text  falciparum genome variation in 7,000 worldwide samples. figshare. Dataset. 2021. 83. Parr JB, Anderson O, Juliano JJ, et al.: Streamlined, PCR-based testing for  http://www.doi.org/10.6084/m9.figshare.13388603.v1  Page 22 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Open Peer Review Current Peer Review Status: Version 1 Reviewer Report 25 March 2021 https://doi.org/10.21956/wellcomeopenres.17752.r42796 © 2021 Menard D. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Didier Menard Malaria Genetics and Resistance Unit, Parasites and Insect Vectors Department, Institut Pasteur, Paris, France This manuscript from the MalariaGEN consortium, a data-sharing community of teams working on Plasmodium falciparum genomic epidemiology, presents the new release of curated P. falciparum genomes from isolates collected in 73 locations in Africa, Asia, South America and Oceania.   Based on robust and perfectly detailed methods (ranging from the treatment of the blood samples, the DNA extraction, the Illumina and computational platforms developed to produce genome sequencing for variant discovery and genotype calling), they analyzed 7000 P. falciparum genome sequences and provided numerous exciting data. For instance, they found that variations (SNPs and indels) in P. falciparum genome affected about a quarter of the 23 Mb genome (and mostly coding regions), or that duplication genotypes are frequent around mdr1, plasmepsin2/3 and gch1, which are known to be associated with antimalarial drug resistance (including mefloquine, piperaquine and sulfadoxine/pyrimethamine). Moreover, population genetic analyses conducted on this largest available data resource, depict a comprehensive picture of P. falciparum parasite populations globally and sub populations at continental level. In the results, a large section is devoted to the description of the geographic patterns of validated molecular markers (SNPs and CNVs) associated with antimalarial drug resistance. By compiling data on all samples collected from 2002–2015, they present clear profiles of drug resistance by regional sub-populations for the most used antimalarial drugs. Finally, they reveal a global landscape regarding a major challenge for malaria elimination, that are deletions in hrp2 and 3 genes linked with false negative results of HRP2-based malaria RDT.   Written in a very clear way, it must be point out that the authors have made huge efforts so that these data are understandable for a general audience, especially for the non-experts in genomics or for policy makers in malaria endemic countries. Their data effectively depict the main challenges currently encountered in the fight against malaria: the monitoring of the strategies deployed by the assessment of the impact on P. falciparum parasite populations, the geographical evolution of antimalarial drug resistances and the effectiveness of diagnostic tools used in malaria   Page 23 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 endemic areas (i.e. malaria RDT).   Of note, the authors fairly expose the main issues and drawbacks related to the methods used (i.e. the analytical challenges due to long tracts of highly repetitive sequence and hypervariable regions within the P. falciparum genome, and the challenges of studying a complex mixture of genotypes from polyclonal infections),   Although, I am impressed by the work done by the consortium, I have several minor comments that could improve the manuscript: ○ Sample collection - P. falciparum samples investigated are not from systematic sampling collections dedicated to this study but rather from multiple studies conducted by groups with different objectives and from heterogeneous populations (patients living in malaria endemic areas, travelers, etc.) . I think this issue should be discussed in the manuscript.   ○ Likewise, the long time period covering the samples collection (2002–2015) is also a major bias which can alter the final results.   ○ I guess that all samples were collected from symptomatic patients seen at health facilities level? Unfortunately, this makes that data presented capture only P. falciparum populations infected this population. With the rise of new technologies, I am wondering whether the MalariaGEN consortium could investigate samples collected from asymptomatic individuals and explore the genomic profiles of this hidden reservoir but representing the major parasite biomass?   ○ I am aware that the authors have performed a difficult and complex exercise by providing high quality genomic data and comprehensive description of their data for a large audience. The major challenge that is not addressed in the manuscript is how these important data can be translated into concrete actions in the field by health providers.   ○ Last comment regarding the database. It will be helpful to provide for each sample/genome sequence, the location (country) and the date of collection. Is the work clearly and accurately presented and does it cite the current literature? Yes Is the study design appropriate and is the work technically sound? Yes Are sufficient details of methods and analysis provided to allow replication by others? Yes If applicable, is the statistical analysis and its interpretation appropriate? I cannot comment. A qualified statistician is required. Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results?   Page 24 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Yes Competing Interests: No competing interests were disclosed. Reviewer Expertise: Expert in antimalarial drug resistance I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. Author Response 06 Jul 2021 Richard Pearson, Wellcome Sanger Institute, Hinxton, UK We thank the reviewers for the extremely positive and supportive feedback. In their comments and suggestions both reviewers have well captured the spirit of this data resource and of the large collaborative network behind it. We are pleased to submit detailed responses and a revised version of the manuscript that addresses their comments. 2.1) Sample collection - P. falciparum samples investigated are not from systematic sampling collections dedicated to this study but rather from multiple studies conducted by groups with different objectives and from heterogeneous populations (patients living in malaria endemic areas, travelers, etc.). I think this issue should be discussed in the manuscript. Thanks for raising this point. On one hand, the heterogeneity of sampling approaches offers a unique opportunity to investigate questions in a variety of epidemiological settings in a systematic way. Specifics of each study are provided in ftp://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/Pf_6_partner_studies.pdf and users of the resource can contact individual investigators for further details. At the same time, we agree that this can also act as a confounder in some analysis, which is why we’ve devoted significant time to the curation of the dataset to make it “analysis ready”. As suggested, we have amended the manuscript in version 2 to include the considerations above in the paragraph: “Samples were collected by independent groups that were operative at a given time and in a given place with distinct objectives; while care needs to be taken when interpreting results spanning multiple years and geographical settings (e.g. aggregated trends of drug resistance prevalence), this heterogeneity also allows for the exploration of a wide range of epidemiological and transmission settings.” 2.2) Likewise, the long time period covering the samples collection (2002–2015) is also a major bias which can alter the final results. This is an important point in particular for interpreting drug resistance results, and one we explicitly bring out in the paragraph: “Note that samples were collected over a relatively long time period (2002–15) during which there were major changes in global patterns of drug resistance, and that the sampling locations represented in a given year depended on which partner studies were operative at the time. To alleviate this problem, we have also divided the data into samples collected before and after 2011 (Supplementary Data;   Page 25 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Supplementary table 10), but temporal trends in aggregated data should be interpreted with due caution.”. Following the reviewer’s suggestion, we have now stressed this point further in our reply to point (2.1) above. 2.3) I guess that all samples were collected from symptomatic patients seen at health facilities level? Unfortunately, this makes that data presented capture only P. falciparum populations infected this population. With the rise of new technologies, I am wondering whether the MalariaGEN consortium could investigate samples collected from asymptomatic individuals and explore the genomic profiles of this hidden reservoir but representing the major parasite biomass? Asymptomatic infections are indeed an incredibly significant reservoir that needs to be explicitly considered to achieve a complete and accurate picture of the transmission landscape. The development of new technologies has begun to dig deeper and deeper in this area and initial results seem to be very encouraging that good quality data can indeed be obtained from asymptomatic and/or low parasitemia subjects. MalariaGEN would certainly be supportive of this kind of effort and we have indeed active collaborations with partners exploring these questions. To the best of our knowledge, though, some of these methodologies are still of limited sensitivity and in part experimental and will require further work in order to be deployed on the large scale required by this scientific question, but that is certainly an area for future investigation. 2.4) I am aware that the authors have performed a difficult and complex exercise by providing high quality genomic data and comprehensive description of their data for a large audience. The major challenge that is not addressed in the manuscript is how these important data can be translated into concrete actions in the field by health providers. This data resource represents a clear step towards the ultimate objective of translating genomic surveillance outputs into actionable actions, although it is fair to say that this is a long journey with many different components. The ability for multiple groups to share data, to analyse it using standardised methods, and to make it readily accessible is the foundation for translational impact to reach maturity. In the discussion we highlighted a series of future translational directions which have been and will be facilitated by resources like this one (and future ones) but it is certainly true that these results require careful interpretation due to the caveats highlighted in the paper and by the reviewer, which inevitably limit their impact. At the same time this dataset does create a systematic framework to enact and contextualize future discoveries of that nature and, indirectly, contributes to them. Ultimately, the practical value for malaria control will be greatly enhanced by the progressive acquisition of longitudinal time-series data and their integration with other sources of epidemiological data which will allow control programmes to monitor the impact of their interventions on the parasite population in near real time. 2.5) Last comment regarding the database. It will be helpful to provide for each sample/genome sequence, the location (country) and the date of collection.   Page 26 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 This information is included in the “Sample provenance and sequencing metadata” file available at the resource page https://www.malariagen.net/resource/26 Competing Interests: No competing interests were disclosed. Reviewer Report 22 March 2021 https://doi.org/10.21956/wellcomeopenres.17752.r42794 © 2021 Veiga M et al. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Maria Isabel Veiga ICVS/3B's - PT Government Associate Laboratory, University of Minho, Braga, Portugal Nuno S. Osório ICVS/3B's - PT Government Associate Laboratory, University of Minho, Braga, Portugal The analysis of whole-genome sequences obtained from Plasmodium falciparum is particularly challenging due to the presence of hypervariable regions, highly repetitive sequences, and frequent mixture of parasites due to multiple infections of the host. The authors of this study describe a curated list of over three million high-confidence polymorphisms obtained from the genome sequence analysis of more than 7000 samples of P. falciparum collected by several studies in 73 locations in Africa, Asia, South America and Oceania. This work, reporting a laudable effort to substantially enrich publicly available genome data of P. falciparum worldwide, is of paramount importance for the field. The contribution goes in line with authors' previous consortia publications, extending largely the number of available data that can be analysed via web with powerful data analysis pipelines. By providing open access to a curated list of polymorphisms based on reproducible and high-quality protocols for the sequencing and analysis of P. falciparum genomes this study is likely to decrease the difficulties that have delayed the research on genomic epidemiology and population genomics of P. falciparum. Among other advances, studies in this area are likely to have important implications for a better understanding of the evolution towards drug resistance of the different global parasite populations ultimately contributing for a better control of this devastating disease. The manuscript is very well written and clear. It presents eight genetically distinct populations of parasites each endemic to different word regions, including South America, West Africa, Central Africa, East Africa, South Asia, West Southeast Asia, East Southeast Asia and Oceania. An interesting genetic and geographic characterization of the eight parasite populations is also shown. Of note, the finding of higher within-host diversity in the parasite populations endemic to Africa, the identification of single nucleotide polymorphism with high levels of geographic differentiation, and further characterization of geographic patterns of drug resistance and polymorphisms with potential impact in rapid diagnostic tests. We do not have major criticisms of the study.   Page 27 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Our minor suggestions for the improval of the manuscript focus on: ○ Increasing the accessibility of the table listing polymorphisms in supplementary data. The authors do provide the data in VCF and zarr files, which are not very user friendly nor allow a fast search of a specific polymorphism. We understand that developing a web interface for this purpose would be a challenge beyond this research article but possibly exporting the VCF file data into tables that could be available in online repositories.   ○ Add to the supplementary file 4, describing the drug resistance markers genotype, the PfMDR1 N86Y. This SNP is a well-known modulator of antimalarial response and considered a risk factor for the treatment of artemether-lumefantrine.   ○ Add the ID of the genes most mentioned in the main article. The gene ID (PF3D7_xxxxxxx), is provided in supplementary file 7, but to clarify the reader, we recommend to add it also in the main article when first describing the genes.   ○ In the results section, when describing gene amplification and different sets of breakpoints, the authors describe complex rearrangements that have not been observed before in Plasmodium species. In regards to pfmdr1 duplication events has been described to vary in size while spanning different genes in different parasites1,2,3,4. In a genome walking like approach, it has been described different amplicon sizes containing the pfmdr1 in clinical isolates from Southeast Asia where they also investigated if the type (i.e., which genes are included) and size of the amplicon influence drug susceptibility phenotypes5. References 1. Foote S, Thompson J, Cowman A, Kemp D: Amplification of the multidrug resistance gene in some chloroquine-resistant isolates of P. falciparum. Cell. 1989; 57 (6): 921-930 Publisher Full Text 2. Nair S, Nash D, Sudimack D, Jaidee A, et al.: Recurrent gene amplification and soft selective sweeps during evolution of multidrug resistance in malaria parasites.Mol Biol Evol. 2007; 24 (2): 562-73 PubMed Abstract | Publisher Full Text 3. Triglia T, Foote SJ, Kemp DJ, Cowman AF: Amplification of the multidrug resistance gene pfmdr1 in Plasmodium falciparum has arisen as multiple independent events.Mol Cell Biol. 1991; 11 (10): 5244-50 PubMed Abstract | Publisher Full Text 4. Ribacke U, Mok BW, Wirta V, Normark J, et al.: Genome wide gene amplifications and deletions in Plasmodium falciparum.Mol Biochem Parasitol. 2007; 155 (1): 33-44 PubMed Abstract | Publisher Full Text 5. Veiga MI, Ferreira PE, Malmberg M, Jörnhagen L, et al.: pfmdr1 amplification is related to increased Plasmodium falciparum in vitro sensitivity to the bisquinoline piperaquine.Antimicrob Agents Chemother. 2012; 56 (7): 3615-9 PubMed Abstract | Publisher Full Text Is the work clearly and accurately presented and does it cite the current literature? Yes Is the study design appropriate and is the work technically sound? Yes Are sufficient details of methods and analysis provided to allow replication by others?   Page 28 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 Yes If applicable, is the statistical analysis and its interpretation appropriate? Yes Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes Competing Interests: No competing interests were disclosed. Reviewer Expertise: Molecular epidemiology, antimalarial drug resistance We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. Author Response 06 Jul 2021 Richard Pearson, Wellcome Sanger Institute, Hinxton, UK We thank the reviewers for the extremely positive and supportive feedback. In their comments and suggestions both reviewers have well captured the spirit of this data resource and of the large collaborative network behind it. We are pleased to submit detailed responses and a revised version of the manuscript that addresses their comments. 1.1) Increasing the accessibility of the table listing polymorphisms in supplementary data. The authors do provide the data in VCF and zarr files, which are not very user friendly nor allow a fast search of a specific polymorphism. We understand that developing a web interface for this purpose would be a challenge beyond this research article but possibly exporting the VCF file data into tables that could be available in online repositories. We thank the reviewer for this important feedback on how to increase the reach of this resource. Since the publication of this article, we have been working on an initial web interface that allows users to navigate some aspects of the data: please see https://www.malariagen.net/apps/pf6. The current version mainly focuses on epidemiologically relevant data and emphasises the community behind the project and at the moment doesn’t provide access to the genomic variation information, which will require further work. Of course accessibility is a relative criteria and as such it requires balancing out different priorities. In the past we have provided tabular versions of the data ( www.malariagen.net/data) but the benefits have been very limited. For example, handling multiallelic and non-SNP variations requires somewhat arbitrary encoding decisions that significantly affect the simplicity and intuitiveness of the tabular format. Increasing the sample size has made these variations more common (e.g. in this release there are about 50% non-SNP variants and 50% multiallelic variants) to the point that there was no real   Page 29 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 advantage in maintaining the format. The decision of primarily utilising the VCF format comes from the recognition that these files are the standard de facto in the genomic community, which in turn has developed a large ecosystem of tools to handle them: please see the README at ftp://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/Pf_6_README_20191010 .txt for some examples, e.g. to subset the data. However we agree this might still be limiting for some use cases and we are working towards a more integrated solution. As an example of our direction of travel, please see  https://malariagen.github.io/vector-data/landing-page.html, which presents some simplified data access workflows for the MalariaGEN Anopheles gambiae 1000 Genomes Project. 1.2) Add to the supplementary file 4, describing the drug resistance markers genotype, the PfMDR1 N86Y. This SNP is a well-known modulator of antimalarial response and considered a risk factor for the treatment of artemether-lumefantrine. We recognise that there is growing evidence of the role of PfMDR1 N86Y in artemether- lumefantrine resistance. In particular, multiple studies have shown that lumefantrine appears to select for N86. Despite that, WHO still reports markers of resistance to lumefantrine as “Yet to be validated” (p. 22 - https://www.who.int/publications/i/item/9789240012813). In this release, supplementary file 4 only contains validated markers so it would be inconsistent to add the markers. However, we will consider adding putative markers in future releases where appropriate. 1.3) Add the ID of the genes most mentioned in the main article. The gene ID (PF3D7_xxxxxxx), is provided in supplementary file 7, but to clarify the reader, we recommend to add it also in the main article when first describing the genes. We have implemented the recommendation and added gene IDs every time a gene is mentioned for the first time in the manuscript version 2. 1.4) In the results section, when describing gene amplification and different sets of breakpoints, the authors describe complex rearrangements that have not been observed before in Plasmodium species. In regards to pfmdr1 duplication events has been described to vary in size while spanning different genes in different parasites1,2,3,4. In a genome walking like approach, it has been described different amplicon sizes containing the pfmdr1 in clinical isolates from Southeast Asia where they also investigated if the type (i.e., which genes are included) and size of the amplicon influence drug susceptibility phenotypes5. The complex rearrangements that have not been observed before which we were referring to here are “dup-trpinv-dup” rearrangements that to the best of our knowledge have only previously been described in human data (see ref 58). This complex and large structural rearrangement involves a triplicated segment embedded within a duplication, in which the triplicated segment is inverted. We recognise that the original wording in the text was   Page 30 of 31 Wellcome Open Research 2021, 6:42 Last updated: 16 NOV 2021 ambiguous and we’ve replaced “complex rearrangements” with an explicit description of the event. Competing Interests: No competing interests were disclosed.   Page 31 of 31