Projects

A selection of my data science, machine learning, and computational humanities work — applied ML pipelines, NLP research, analytics, and AI systems. Featured projects have GitHub repositories where the code is public.

Mosaik Reverse Prompting + Multimodal Analysis

Published: April 01, 2026

Multimodal pipeline combining YOLO panel segmentation, OCR, and VLM/LLM-based reverse prompting to extract structured semantic representations from 26K–64K mid-20th-century German comic panels. Ongoing research at Stanford.

German Romanticism Corpus Analysis (1780–1840) — NLP + Topic Modeling

Published: December 31, 2025

Custom German Romanticism corpus and NLP pipeline (spaCy, BERTopic, MALLET, R tidytext) tracking thematic evolution across early, transitional, and late Romanticism through topic modeling and cross-period comparative analysis.

SpaceX Falcon 9 Launch Analysis

Published: November 30, 2025

End-to-end data science project predicting SpaceX Falcon 9 launch success and booster landing outcomes. EDA, ML models (Logistic Regression, Decision Tree, SVM), and interactive Plotly/Dash dashboards over Falcon 9 launch records.

E-Commerce Profitability Analytics — Amazon India Dataset

Published: October 31, 2025

End-to-end e-commerce analytics and ML project in R Markdown over 1.2M+ transaction records. Built clustering, regression, and tree-based models to surface profitability drivers, and delivered a polished R Markdown report with multi-panel analytics visualizations.

Credit Tier Prediction ML Pipeline

Published: August 31, 2025

End-to-end multi-class credit tier prediction on 4M+ financial records. Improved accuracy by 4% and reduced false positives by 6% through ensemble modeling and feature engineering. Volunteer engagement with the Microsoft Data Science team.

Credit Fraud Detection — Class Imbalance at Scale

Published: June 30, 2025

Independent ML project tackling severe class imbalance (500 frauds in 200K+ records). Benchmarked Random Forest, SGDClassifier, and MLP with PCA, downsampling, and class reweighting to identify the best algorithm for production fraud detection.

Gomoku AI Engine — Minimax + Reinforcement Learning

Published: March 31, 2025

Two-phase AI engine for Gomoku (Five-in-a-Row): Phase 1 used Minimax with Alpha-Beta pruning and a custom evaluation function; Phase 2 extended it with self-play and Q-learning for adaptive competitive play.

Psychological Impact Prediction — LightGBM Healthcare Classifier

Published: September 30, 2024

LightGBM model predicting patient psychological risk levels from medical and behavioral indicators. Achieved AUC 0.87 on imbalanced healthcare data, with classification reports and confusion-matrix visualizations designed for clinical interpretability.

Transaction Fraud Detection ML Pipeline

Published: February 28, 2024

Production-ready fraud detection pipeline on 800K+ Google Pay transactions. Reduced false positives by 8.7% and improved recall by 2.87% through feature engineering, careful model selection, and a tuned GBDT. Volunteer engagement with Google engineers.

Peter Cheng

Projects