Research Projects

Please find our current and completed research projects. To find active projects results please click on the "Projects" tab on the navigation bar.

Completed Research Projects
P1. Haplotype Inference

Haplotypes, which reflect the correlation structure of heritable variations, hold the key to our understanding of "disease genes" for many complex diseases. However, haplotype data are not collected directly. Efficient and accurate computational methods for the reconstruction of haplotypes from genotype data are in great need. We mainly focus on the problem of haplotype reconstruction from family data, and have been working on a combinatorial formulation that aims to minimize the total number of recombination events (the MRHC problem). Significant advances have been made in understanding the complexity of the problem, and in developing efficient algorithms and software tools (PedPhase) to solve the problem by our group [1-4]. Recently, we have proposed a hidden Markov model (HMM) based approach to predict pair-wise identical-by-descent (IBD) status [5], which can be utilized to solve the haplotype inference problem for big families with many untyped individuals [6-7].

P2. Disease Gene Mapping Based on Haplotype Similarities

Haplotype-based association mapping approaches usually provide higher power in identifying genes underlying complex diseases. However, the large number of distinct haplotypes may compromise the power of haplotype-based association methods because of high degrees of freedom. We have developed an algorithmic method for haplotype mapping using density-based clustering and proposed a new haplotype similarity measure [8]. The mapping regards haplotype segments as data points in a high dimensional space. The disease susceptibility gene embedded haplotype segments, especially those mutants of recent origin, tend to be close to each other due to linkage disequilibrium, while other haplotype segments can be regarded as random noise sampled from the haplotype space. The algorithm is efficient and robust, and it does not require any assumptions about the evolutionary model or the inheritance patterns of the disease. It can also deal with high level of phenocopies. The approach was later extended to quantitative traits [9] and was implemented as a software tool called HapMiner. Our recent experiments also show that the clustering can enhance the power of the score test to detect association [10]. It has also been extended to family data [11].

P3. Structure variation and are SNP Detection

Structure variations such as copy-number alterations may result in genomic disorders and somatic CNVs play an important role in cancers. With remarkable capacity from current technologies in assessing CNVs, the research community has shown great interests in investigating inheritable as well as somatic CNVs recently. We have developed approaches to identify CNVs based on array comparative genomic hybridization (aCGH) data [12-14], as well as approaches for structure variation detection based on high throughput sequencing technologies [15,16]. We have also developed an effective framework for rare SNP identification and calling using overlapped pooling designs [17].

P4. Management and Visualization of Genome-wide Association Results

It is well-known that gene-gene interactions may play an important role in the etiology of complex diseases. We developed an efficient strategy based on two-stage analysis [19]. We also developed new approaches for tackling this problem using machine learning approaches [20,21]. As a byproduct, we made available of the program gs that can generate simulated data for various interaction models [22,23].

P5. Gene-gene interactions

Large-scale genome-wide association studies are increasingly common. With this change in paradigm for genetic studies of complex diseases, it is vital to develop valid, powerful, and efficient tools to manage, analyze, visualize, share and integrate such data. Recently, we develop a web application tool named MAVEN, for Management, Analysis, Visualization and rEsults shariNg of GWA data using cutting edge technologies [18].

P6. Network-based Integration for Disease Gene Ranking, Target Prediction and Drug Repositioning

The general idea of disease gene prioritization is to rank candidate genes from linkage/association results according to their relationships with some known disease genes, reflected in other data sources such as gene co-expression profiles, pathways, protein-protein interactions, etc. We have developed an expandable framework for gene prioritization that can integrate multiple heterogeneous data sources by taking advantage of a unified graphic representation (i.e.,biological networks) [24]. Gene-gene relationships relationships are defined based on a normalized global measure (i.e., a diffusion kernel). Cross-validation results have shown that our approach consistently outperforms other two state-of-the-art programs. An extended framework has been used by our group for drug target prediction and ranking [25]. Furthermore, we recently propose a heterogeneous network model for drug repositioning [26]. Some preliminary results can be found here.

P7. Other projects and collaborations

We also have a few other projects that in the development stages. We extremely welcome collaborations on these and other emerging topics, as well as real data analyses. Send me an email at

  1. Li, J. & Jiang, T. Efficient inference of haplotypes from genotypes on a pedigree. J Bioinform Comput Biol 1, 41-69 (2003).
  2. Doi, K., Li, J. & Jiang, T. Minimum recombinant haplotype configuration on tree pedigrees. In Algorithms in bioinformatics, Proceedings of the third Annual Workshop on Algorithms in Bioinformatics, 339-353 (Springer, Budapest, Hungary, 2003).
  3. Li, J. & Jiang, T. Computing the minimum recombinant haplotype configuration from incomplete genotype data on a pedigree by integer linear programming. J Comput Biol 12, 719-39 (2005).
  4. Li, X. & Li, J. Efficient haplotype inference from pedigrees with missing data using linear systems with disjoint-set data structures. In Proceedings of the seventh annual international conference on computational systems bioinformatics, 297-310 (World Scientific, Palo Alto, CA, USA, 2008).
  5. X Li, X-L Yin and J Li, Efficient identification of identical-by-descent status in pedigrees with many untyped individuals. Bioinformatics 2010 26(12):i191-i198 (ISMB'10 special issue).
  6. X Li and J Li. Haplotype Reconstruction in Large Pedigrees with Many Untyped Individuals. In Proc. 15th Annual Conference on Research in Computational Molecular Biology (RECOMB’11), Lecture Notes in Computer Science, 6577:189-203, 2011.
  7. X Li and J Li. Haplotype Reconstruction in Large Pedigrees with Untyped Individuals through IBD Inference. Journal of Computational Biology, 18(11):1411-21, 2011. PMID: 21923410. (journal version of RECOMB11 paper)
  8. Li, J. & Jiang, T. Haplotype-based linkage disequilibrium mapping via direct data mining. Bioinformatics 21, 4384-93 (2005).
  9. Li, J., Zhou, Y. & Elston, R. C. Haplotype-based quantitative trait mapping using a clustering algorithm. BMC Bioinformatics 7, 258 (2006).
  10. Igo, J., R. P., Li, J. & Goddard, K. A. Association mapping by generalized linear regression with density-based haplotype clustering. Genetic Epidemiology 32, 1-11 (2008).
  11. Y Chen, X Li and J Li. A New Approach for Family-based Haplotype Association Testing and Fine Mapping. BMC Bioinformatics 2010, 11(Suppl 1):S45. Presented at APBC'10.
  12. Hayes, M. & Li, J. A linear-time algorithm for analyzing array cgh data using log ratio triangulation. In Proceedings of the fifth annual International Symposium on Bioinformatics Research and Applications (ISBRA), Lecture Notes in Bioinformatics, 248-259 (Springer, Ft. Lauderdale, FL, USA, 2009).
  13. X-L Yin and J Li, Conditional random field model for detecting copy number variation. Volume: 8, Issue: 2(2010) pp. 295-314, Journal of Bioinformatics and Computational Biology.
  14. R Azad and J Li. Interpreting Biological Data via Entropic Dissection. Nucleic Acids Research doi: 10.1093/nar/gks917, 2012.
  15. M Hayes, YS Pyon and J Li. A model-based clustering method for genomic structural variant prediction and genotyping using paired-end sequencing data. PLoS One 7(12): e52881. doi:10.1371/journal.pone.0052881, 2012.
  16. M Hayes and J Li. Bellerophon: a hybrid method for detecting interchromosomal rearrangements at base pair resolution using next generation sequencing data. BMC Bioinformatics, accepted, 2013. Also presented at RECOMB Seq conference.
  17. W Wang, X Yin, Y-S Pyon, M Hayes and J Li. Rare variant discovery and calling by sequencing pooled samples with overlaps. Bioinformatics 29 (1): 29-38. doi: 10.1093/bioinformatics/bts645, 2012.
  18. K Narayanan and J Li, MAVEN: a Visualization and Functional Analysis Tool for Genome-Wide Association Results. Bioinformatics 2010 26(2):270-272.
  19. Li, J. A novel strategy for detecting multiple loci in genome-wide association studies of complex diseases. Int J Bioinform Res Appl 4, 150-63 (2008).
  20. J Li, B Horstman and Y Chen. Detecting Epistasis Effects in Genome-Wide Association Studies Based on Ensemble Approaches. Bioinformatics, 27(13):i230-i238, 2011 (ISMB'11 special issue).
  21. M Xie, J Li and T Jiang. Detecting genome-wide epistases based on the clustering of relatively frequent items. Bioinformatics 28(1):5-12, 2012.
  22. Li, J. & Chen, Y. Generating samples for association studies based on hapmap data. BMC Bioinformatics 9, 44 (2008).
  23. Y Chen and J Li. Generation of synthetic data and experimental designs in evaluating interactions for association studies. J Bioinform Comput Biol 10(1): (1240005), 2012.
  24. Y Chen, W Wang, Y Zhou, R Shields, SK Chanda, RC Elston, and J Li. In Silico Gene Prioritization by Integrating Multiple Data Sources. Plos One 6(6): e21137, 2011.
  25. W Wang, S Yang and J Li. Drug Target Predictions Based on Heterogeneous Graph Inference, Pacific Symposium on Biocomputing, 18:53-64, 2013.
  26. W Wang, S Yang, X Zhang and J Li. Drug repositioning by integrating target information through a heterogeneous network model. Submitted. 2014.