A molecular barcode and online tool to identify and map imported P. vivax infection

24 Sep 2019
Hidayat Trimarsanto, Roberto Amato, Richard D Pearson, Edwin Sutanto, Rintis Noviyanti, Leily Trianty, Jutta Marfurt, Zuleima Pava, Diego F Echeverry, Tatiana M Lopera-Mesa, Lidia Madeline Montenegro, Alberto Tobón-Castaño, Matthew J Grigg, Bridget Barber, Timothy William, Nicholas M Anstey, Sisay Getachew, Beyene Petros, Abraham Aseffa, Ashenafi Assefa, Awab Ghulam Rahim, Nguyen Hoang Chau, Tran Tinh Hien, Mohammad Shafiul Alam, Wasif A Khan, Benedikt Ley, Kamala Thriemer, Sonam Wangchuck, Yaghoob Hamedi, Ishag Adam, Yaobao Liu, Qi Gao, Kanlaya Sriprawat, Marcelo U Ferreira, Alyssa Barry, Ivo Mueller, Eleanor Drury, Sonia Goncalves, Victoria Simpson, Olivo Miotto, Alistair Miles, Nicholas J White, Francois Nosten, Dominic P Kwiatkowski, Ric N Price, Sarah Auburn

Imported cases present a considerable challenge to the elimination of malaria. Traditionally, patient travel history has been used to identify imported cases, but the long-latency liver stages confound this approach in Plasmodium vivax. Molecular tools to identify and map imported cases offer a more robust approach, that can be combined with drug resistance and other surveillance markers in high-throughput, population-based genotyping frameworks. Using a machine learning approach incorporating hierarchical FST (HFST) and decision tree (DT) analysis applied to 831 P. vivax genomes from 20 countries, we identified a 28-Single Nucleotide Polymorphism (SNP) barcode with high capacity to predict the country of origin. The Matthews correlation coefficient (MCC), which provides a measure of the quality of the classifications, ranging from −1 (total disagreement) to 1 (perfect prediction), exceeded 0.9 in 15 countries in cross-validation evaluations. When combined with an existing 37-SNP P. vivax barcode, the 65-SNP panel exhibits MCC scores exceeding 0.9 in 17 countries with up to 30% missing data. As a secondary objective, several genes were identified with moderate MCC scores (median MCC range from 0.54-0.68), amenable as markers for rapid testing using low-throughput genotyping approaches. A likelihood-based classifier framework was established, that supports analysis of missing data and polyclonal infections. To facilitate investigator-lead analyses, the likelihood framework is provided as a web-based, open-access platform (vivaxGEN-geo) to support the analysis and interpretation of data produced either at the 28-SNP core or full 65-SNP barcode. These tools can be used by malaria control programs to identify the main reservoirs of infection so that resources can be focused to where they are needed most.