Tu Luan

Tu Luan

Machine Learning Engineer · Ph.D. Computer Science

Machine Learning Engineer (Ph.D. CS) who shipped production genomic data/compute pipelines (WDL/Docker/AWS Batch) powering a launched prenatal screening workflow (1.6k+ enrollments). Built and evaluated ML models (semi-supervised under label scarcity) and DNABERT-based retrieval and ranking for large-scale genomic sequences.

1.6k+
Patient enrollments on
shipped screening product
27x
Speedup over SOTA
clustering baseline
22.5M
Data points processed
in 1.2 GB memory

Education

University of Maryland, College Park

Ph.D. in Computer Science · Advisor: Dr. Mihai Pop

2018 – 2024

Bryn Mawr College

B.A. in Computer Science

2014 – 2018

Professional Experience

Senior Bioinformatics Scientist (ML / Production Pipelines)
July 2024 – Present
Natera, Inc. · San Carlos, CA
Production Genomic Data Pipeline
PythonWDLDockerAWS BatchCI/CDCloudWatch
  • Owned and shipped an end-to-end Dockerized production pipeline on AWS Batch (~50 GB/run, 20-50 samples/run): raw-data ingestion, automated QC gating, and standardized outputs consumed by downstream models; achieved <8 h turnaround via WDL parallelization.
  • Pipeline powers Natera's publicly launched Fetal Focus™ product; early blinded readout reported 91% sensitivity (n=101), scaled to 1,600+ enrollments.
  • Built operational tooling: job-dispatch CLI for parameterized AWS Batch submissions and smoke tests; instrumented CloudWatch peak-memory telemetry with automatic retries (2x) for transient OOM failures.
Feature-Constrained Ancestry Classification
Pythonscikit-learnRandomForestFeature SelectionCross-Validation
  • Designed a feature-selection + modeling pipeline to classify ancestry under a fixed marker budget (no new assay sites), including data cleaning/standardization for missing/invalid markers and a CV-driven search for the minimal feature set meeting target accuracy.
  • One-vs-one RandomForest ensemble + probability-weighted aggregation with confidence/abstain; reached 97.6% accuracy (F1=0.976) on 5-class with 112 features and 98.2% accuracy on 3-class with 23 features; validated on a small real-world cohort (13/14 correct).

Research & Projects

Research Assistant · Pop Lab
Aug 2019 – Aug 2024
University of Maryland, College Park
Neural Search and Ranking System for Genomic Sequences
PythonPyTorchDNABertFastAPIDocker
  • Built a genome-scale retrieval system for detecting CRISPR off-target effects; processes millions of sequence candidates using a DNABert bi-encoder for retrieval and a cross-encoder for re-ranking.
  • Outperformed traditional CFD/MIT baseline models with significant improvement in nDCG@50 on out-of-distribution datasets, demonstrating robust generalization across experimental assays.
  • Deployed as a FastAPI service returning top-k matches with calibrated scores; Docker-containerized.
SCRAPT — Ultra-Fast Unsupervised Clustering for Large Genomic Datasets
C++MultithreadingBash
  • Designed an unsupervised two-stage sequence clustering algorithm with adaptive sampling and mean-shift-style centroid recentering; developed probabilistic guarantees for recovering large clusters first with early-stopping criteria.
  • Implemented with multithreading in C++ to process over 22.5M data points in ~1.2 GB memory.
  • Achieved 27x performance speedup over SOTA baseline while maintaining comparable cluster quality.

Technical Skills

Languages

PythonC++JavaSQLBash

ML / Data Science

PyTorchscikit-learnpandasNumPySciPy

MLOps / Infrastructure

DockerAWS (EC2/S3/Batch)GitHub ActionsGitLab CIBigQueryHPC

Data & Pipelines

WDLNextflowDVCCondapytestRDBMS

Selected Publications

T. Luan*, et al. (2023). SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets.
Nucleic Acids Research, 51(8). IF: 13.1
Liu, S., Rodriguez, J.S., Munteanu, V., et al. (including T. Luan). (2025). Analysis of metagenomic data.
Nature Reviews Methods Primers, 5. IF: 50.1