Education
University of Maryland, College Park
Ph.D. in Computer Science · Advisor: Dr. Mihai Pop
2018 – 2024
Bryn Mawr College
B.A. in Computer Science
2014 – 2018
Professional Experience
Senior Bioinformatics Scientist (ML / Production Pipelines)
July 2024 – Present
Natera, Inc. · San Carlos, CA
Production Genomic Data Pipeline
PythonWDLDockerAWS BatchCI/CDCloudWatch
- Owned and shipped an end-to-end Dockerized production pipeline on AWS Batch (~50 GB/run, 20-50 samples/run): raw-data ingestion, automated QC gating, and standardized outputs consumed by downstream models; achieved <8 h turnaround via WDL parallelization.
- Pipeline powers Natera's publicly launched Fetal Focus™ product; early blinded readout reported 91% sensitivity (n=101), scaled to 1,600+ enrollments.
- Built operational tooling: job-dispatch CLI for parameterized AWS Batch submissions and smoke tests; instrumented CloudWatch peak-memory telemetry with automatic retries (2x) for transient OOM failures.
Feature-Constrained Ancestry Classification
Pythonscikit-learnRandomForestFeature SelectionCross-Validation
- Designed a feature-selection + modeling pipeline to classify ancestry under a fixed marker budget (no new assay sites), including data cleaning/standardization for missing/invalid markers and a CV-driven search for the minimal feature set meeting target accuracy.
- One-vs-one RandomForest ensemble + probability-weighted aggregation with confidence/abstain; reached 97.6% accuracy (F1=0.976) on 5-class with 112 features and 98.2% accuracy on 3-class with 23 features; validated on a small real-world cohort (13/14 correct).
Research & Projects
Research Assistant · Pop Lab
Aug 2019 – Aug 2024
University of Maryland, College Park
Neural Search and Ranking System for Genomic Sequences
PythonPyTorchDNABertFastAPIDocker
- Built a genome-scale retrieval system for detecting CRISPR off-target effects; processes millions of sequence candidates using a DNABert bi-encoder for retrieval and a cross-encoder for re-ranking.
- Outperformed traditional CFD/MIT baseline models with significant improvement in nDCG@50 on out-of-distribution datasets, demonstrating robust generalization across experimental assays.
- Deployed as a FastAPI service returning top-k matches with calibrated scores; Docker-containerized.
SCRAPT — Ultra-Fast Unsupervised Clustering for Large Genomic Datasets
C++MultithreadingBash
- Designed an unsupervised two-stage sequence clustering algorithm with adaptive sampling and mean-shift-style centroid recentering; developed probabilistic guarantees for recovering large clusters first with early-stopping criteria.
- Implemented with multithreading in C++ to process over 22.5M data points in ~1.2 GB memory.
- Achieved 27x performance speedup over SOTA baseline while maintaining comparable cluster quality.
Technical Skills
Languages
ML / Data Science
MLOps / Infrastructure
Data & Pipelines
Selected Publications
T. Luan*, et al. (2023). SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets.
Nucleic Acids Research, 51(8).
IF: 13.1
Liu, S., Rodriguez, J.S., Munteanu, V., et al. (including T. Luan). (2025). Analysis of metagenomic data.
Nature Reviews Methods Primers, 5.
IF: 50.1