I'm a computer science Ph.D. candidate at the University of Maryland, College Park, working with Professor Mihai Pop at the Center for Bioinformatics and Computational Biology. I have primarily worked on next-generation sequencing data analysis and algorithm design, focusing on projects including metagenomic assembly and 16S rRNA gene sequence clustering. During my work on these projects, I also had experience in Python, R, HPC, Bash scripting, Nextflow, and RDBMS.
Before my time at UMD, I received my B.A. in computer science from Bryn Mawr College in Pennsylvania, where I was mentored by Professor Dianna Xu for projects on computational geometry. I grew up in Guangzhou, China, where I completed my high school education and developed an interest for computer science.
I am anticipated to graduate in May 2024 and am eager for any job opportunities.
Contact:
Email / GitHub / Google Scholar / LinkedIn
For decades, the 16S rRNA gene has been used to taxonomically classify prokaryotic species and to taxonomically profile microbial communities. The 16S rRNA gene has been criticized for being too conserved to differentiate between distinct species. We argue that the inability to differentiate between species is not a unique feature of the 16S rRNA gene. Rather, we observe the gradual loss of species-level resolution for other marker genes as the number of gene sequences increases in reference databases. We demonstrate this effect through the analysis of three commonly used databases of nearly-universal prokaryotic marker genes: the SILVA 16S rRNA gene database, the Genome Taxonomy Database (GTDB), and a set of 40 taxonomically-informative single-copy genes. Our results reflect a more fundamental property of the taxonomies themselves and have broad implications for bioinformatic analyses beyond taxonomic classification. Effective solutions for fine-level taxonomic classification require a more precise, and operationally-relevant, definition of the taxonomic labels being sought, and the use of combinations of genomic markers in the classification process.
16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at https://github.com/hsmurali/SCRAPT.
Throughout my academic and research career, I have gained proficiency in a wide range of tools and technologies for bioinformatics research and data analysis. Below is a list of some of the key tools I have used extensively: