Tu Luan | Machine Learning Engineer

Education

Ph.D. in Computer Science · Advisor: Dr. Mihai Pop

2018 – 2024

B.A. in Computer Science

2014 – 2018

Senior Bioinformatics Scientist (ML / Production Pipelines)

July 2024 – Present

Natera, Inc. · San Carlos, CA

Production Genomic Data Pipeline

PythonWDLDockerAWS BatchCI/CDCloudWatch

Owned and shipped an end-to-end Dockerized production pipeline on AWS Batch (~50 GB/run, 20-50 samples/run): raw-data ingestion, automated QC gating, and standardized outputs consumed by downstream models; achieved <8 h turnaround via WDL parallelization.
Pipeline powers Natera's publicly launched Fetal Focus™ product; early blinded readout reported 91% sensitivity (n=101), scaled to 1,600+ enrollments.
Built operational tooling: job-dispatch CLI for parameterized AWS Batch submissions and smoke tests; instrumented CloudWatch peak-memory telemetry with automatic retries (2x) for transient OOM failures.

Feature-Constrained Ancestry Classification

Pythonscikit-learnRandomForestFeature SelectionCross-Validation

Designed a feature-selection + modeling pipeline to classify ancestry under a fixed marker budget (no new assay sites), including data cleaning/standardization for missing/invalid markers and a CV-driven search for the minimal feature set meeting target accuracy.
One-vs-one RandomForest ensemble + probability-weighted aggregation with confidence/abstain; reached 97.6% accuracy (F1=0.976) on 5-class with 112 features and 98.2% accuracy on 3-class with 23 features; validated on a small real-world cohort (13/14 correct).

Research Assistant · Pop Lab

Aug 2019 – Aug 2024

University of Maryland, College Park

Neural Search and Ranking System for Genomic Sequences

PythonPyTorchDNABertFastAPIDocker

Built a genome-scale retrieval system for detecting CRISPR off-target effects; processes millions of sequence candidates using a DNABert bi-encoder for retrieval and a cross-encoder for re-ranking.
Outperformed traditional CFD/MIT baseline models with significant improvement in nDCG@50 on out-of-distribution datasets, demonstrating robust generalization across experimental assays.
Deployed as a FastAPI service returning top-k matches with calibrated scores; Docker-containerized.

SCRAPT — Ultra-Fast Unsupervised Clustering for Large Genomic Datasets

C++MultithreadingBash

Designed an unsupervised two-stage sequence clustering algorithm with adaptive sampling and mean-shift-style centroid recentering; developed probabilistic guarantees for recovering large clusters first with early-stopping criteria.
Implemented with multithreading in C++ to process over 22.5M data points in ~1.2 GB memory.
Achieved 27x performance speedup over SOTA baseline while maintaining comparable cluster quality.

PythonC++JavaSQLBash

PyTorchscikit-learnpandasNumPySciPy

DockerAWS (EC2/S3/Batch)GitHub ActionsGitLab CIBigQueryHPC

WDLNextflowDVCCondapytestRDBMS

T. Luan*, et al. (2023). SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets.

Nucleic Acids Research, 51(8). IF: 13.1

Liu, S., Rodriguez, J.S., Munteanu, V., et al. (including T. Luan). (2025). Analysis of metagenomic data.

Nature Reviews Methods Primers, 5. IF: 50.1