Phylogenetic Analysis
Currently, there is a world-wide pandemic of the Coronavirus Disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). As of April 29, 2020, there have been more than three million confirmed cases of COVID-19 and more than 224,000 deaths. And these numbers continue to increase rapidly. Existing scientific studies are still limited because of the short time since the outbreak of the virus and many problems remain to be studied for the ongoing threat to all humans. In this project, we work together with our collaborators from Cleveland Clinic, to study the spread of SARS-CoV-2 within hospital setting by analyzing their genomic variation data. The project addresses fundamental questions in battling the pandemic of COVID-19. The genomic sequences and variation identified from them will contribute to the basic knowledge about the virus. Evolutionary analysis, coupled with epidemiological & clinical information, will lead to a better understanding of transmission patterns. The pipeline will be shared with the research community to aid further analysis when new data is available. The proposed research has significant societal impact. First and foremost, understanding of transmission patterns, especially in hospital settings, has direct clinical implications to better protect healthcare providers and minimize infections within hospitals, and is of paramount importance given the unprecedented large number of patients due to the pandemic and an alarming high rate of infections among healthcare providers. It is our goal to understand transmission patterns through computational analysis and to help minimizing infections in hospitals and communities. The project is sponsored by NSF. In this section we have provided our visualization results.
We performed phylogenetic analysis on the data set with 302 samples. Here, we used the NC_045512.2 [1] sequence from Wuhan as our reference sequence. The analysis was done by using the NextStrain [2] visualization tool. Our data range falls in between 01/07/2020 to 04/24/2020. The analysis represents phylogenetic analysis under three major feature groups. They are, basic features, lab results and comorbid features. Each of the major features group has subcategories that include:
- ༓ Basic Features – Age, Gender, Clade, Date, Region, Country, State, Location, Employee, ICU admitted or not, Hospitalized or not, Dead or Alive and GISAID Clade
- ༓ Lab Features - Creatinine, d-dimer, White Blood Cells Count, Absolute Lymphocyte Count, Interleukin-6 Count, Serum Ferritin and Troponin
- ༓ Comorbid Features – Diabetes, Emphasema, Asthma, Diabetes, HTN, Coronary Heart Disease, Heart Failure and Immune Suppression
We have used set of worst lab values provided by the experts in the field to categorize lab f eatures as abnormal and normal. The followed normal lab values are,
- ༓ Creatinine: 0.73 - 1.22 mg/dL
- ༓ d-dimer:500 ng/mL FEU
- ༓ White blood cells (WBC): 3.70 - 11.00 k/uL
- ༓ White blood cells (WBC): 3.70 - 11.00 k/uL
- ༓ Absolute lymphocyte count: 1.0 - 4.0 k/uL
- ༓ Interleukin-6: <=5 pg/mL
- ༓ Troponin T: 0.000 - 0.029 ng/mL
We used Nextstrain [x0] visualization tool for our analysis. It is an open-source tool which is highly useful in analyzing genomic data. Nextstrain uses below mentioned packages at the backend to generate the final outcome.
- ༓ Python 3.5 or higher version
- ༓ Conda – Package management tool uses to create virtual environment for Nextstrain [1]
- ༓ Augur – Unix like bioinformatics tool set use to create designs for different analysis [2]
- ༓ Auspice – Web- based visualization package to visualize phylogeny data [3]
Major steps of building the phylogenetic tree includes sequence analysis and sequence alignment using mafft [4] tool, constructing and adjusting color codes based on metadata, building and refining the tree. The visualization consists of three basic sections: phylogeny, transmission and diversity. In the phylogeny section the tool builds the phylogenetic tree based on the defined required conditions. We have used several features available in our dataset which includes age, sex, race, GISAID clade [5], hospitalized or not, dead or alive, ICU admitted or not, hospital employee or not, lab results (creatinine level, coronary heart disease etc.,), region, country, location and date. As for the transmission section the visualization shows how the disease spread across the world map using circles as markers where size of circle at a particular location is proportional to the number of samples in that location. Circle’s color represents the color of the sample in the phylogenetic tree. In addition to that the transmission lines on the map represents the movement of pathogen in the given timeframe. Finally, the diversity shows how the samples vary from each other with entropy and using the reference sequence.
Phylogenetic Analysis of Cleveland Data
In this project we analyzed and visualized the SARS-CoV-19 samples collected from the Cleveland area.
Phylogenetic Analysis of North America Data
In this project we analyzed and visualized the SARS-CoV-19 samples collected from the Cleveland area with collected samples from across the North America. Please zoom in the map to view the names of the countries. SARS-CoV-19 data used in this project can be found at GISAID
Phylogenetic Analysis of Global Data
In this project we analyzed and visualized the SARS-CoV-19 samples collected from the Cleveland area with the samples collected across the world. Please zoom in the map to view the names of the countries. The used global data can be found at GISAID
Try Phylogenetic Analysis for Your Data
Please follow the given instructions to view the phylogenetic analysis for your own data. To view the data you must upload a .json file generated by using nextstrain [1]. For larger files it will take some time to complete the uploading process. You can link your samples and view them in our covid browser. Please refer to our covid browser page for more information.
- Select Json File
- Email Address for Link to Phylogenetic Tree:
References
- Welcome to Nextstrain Documentation/
- Conda
- Augur:A bioinformatics toolkit for phylogeneticanalysis
- Auspice: An Open-source Interactive Tool for Visualising Phylogenomic Data
- Mafft/
- Clade and Lineage Nomenclature
- Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
- Nextstrain:Real-time tracking of pathogen evolution
Acknowledgements
We would like to offer our special thanks to:
- Frank Esper, Center for Pediatric Infectious Disease, Cleveland Clinic Children’s, Cleveland Ohio
- Yu-Wei Cheng Robert J. Tomsich Pathology and Laboratory Medicine Institute, Cleveland Clinic,Cleveland Ohio
- Zheng Tu, Robert J. Tomsich Pathology and Laboratory Medicine Institute, Cleveland Clinic,Cleveland Ohio
- Dan Farkas, Robert J. Tomsich Pathology and Laboratory Medicine Institute, Cleveland Clinic,Cleveland Ohio
- Gary Procop, Robert J. Tomsich Pathology and Laboratory Medicine Institute, Cleveland Clinic,Cleveland Ohio
- Jennifer Ko,Robert J. Tomsich Pathology and Laboratory Medicine Institute, Cleveland Clinic, Cleveland Ohio
- Timothy A. Chan, Center for Immunotherapy and Precision Immuno-Oncology, Lerner Research Center, Taussig Cancer Institute, Cleveland Clinic, Cleveland, Ohio
- Sheriff Mossad, Department of Infectious Diseases, Respiratory Institute, Cleveland Clinic, Cleveland Ohio
- Brian Rubin, Robert J. Tomsich Pathology and Laboratory Medicine Institute, Cleveland Clinic, Cleveland, Ohio
- Cindy Martin,Cyberinfrastructure Engineer,UTech Research Computing and Cyberinfrastructure, Case Western Reserve University,Cleveland, Ohio
- Mike Warfe, Director,UTech Research Computing and Cyberinfrastructure, Case Western Reserve University,Cleveland, Ohio
- Hadrian Djohari,HPC Service Manager, UTech Research Computing and Cyberinfrastructure, Case Western Reserve University,Cleveland, Ohio
- E.M. Dragowsky,Computing Technologist,Scientist & Information Analyst, UTech Research Computing and Cyberinfrastructure, Case Western Reserve University,Cleveland, Ohio
- Nasir Yilmaz,IT Engineer, UTech Research Computing and Cyberinfrastructure, Case Western Reserve University,Cleveland, Ohio
- Derek Li,Department of Computer and Data Sciences, Case Western Reserve University, Cleveland Ohio and the College at the University of Chicago, Chicago,IL
- Erik A. Li, Department of Computer and Data Sciences, Case Western Reserve University, Cleveland Ohio and Solon High School, Solon, Ohio

