CV
A PDF version is available on request — reach out at cyf0906@stanford.edu.
Education
- M.S. in Data Science, Johns Hopkins University, Jan 2026 – Dec 2026 (expected)
- GPA: 4.0 / 4.0
- M.A. in German Studies — Specialization in Computational NLP & Textual Data Science, Stanford University, Sep 2024 – Jun 2026 (expected)
- GPA: 3.6 / 4.0
- Research focus: NLP, topic modeling, and multimodal analysis applied to historical German textual and visual corpora.
- B.A. in German Studies, University of California, Santa Barbara, Sep 2019 – Sep 2023
- GPA: 3.74 / 4.0
- Related coursework: Applied Mathematics, Data Science
Skills
- Programming Languages
- Python, R, SQL, Java, HTML & CSS
- Machine Learning & Statistics
- Tree-based models (Random Forest, Gradient Boosting, LightGBM)
- Regression models (Linear, Logistic, regularized)
- Support Vector Machines
- Unsupervised methods (k-means, PCA, BERTopic, LDA)
- Model evaluation, cross-validation, hyperparameter tuning
- Feature engineering, MLOps
- Data Science & Visualization
- pandas, NumPy, SciPy, scikit-learn
- seaborn, Matplotlib, Plotly, ggplot2 (R)
- Apache Parquet
- R Markdown, Jupyter
- Cloud & Tools
- AWS (S3, SageMaker AI, EC2)
- Google Cloud BigQuery
- IBM Watson Studio
- GitHub, Flask, Tableau
- Specialties
- Natural Language Processing (spaCy, BERTopic, MALLET)
- Multimodal analysis (vision-language models, OCR pipelines)
- Fraud detection and risk modeling
- Class-imbalance methods (resampling, reweighting, threshold tuning)
- Languages
- Mandarin Chinese — Native
- English — Fluent (US High School through Masters)
- German — Fluent
Certifications
- IBM Data Science Professional Certificate — Fall 2025
- Google Data Analytics Professional Certificate — Summer 2025
Selected Projects
A curated list of data science, machine learning, and computational humanities projects. Full writeups, GitHub links, and additional detail live on the Projects page.
- Mosaik Reverse Prompting + Multimodal Analysis (ongoing, Stanford)
- Multimodal pipeline (YOLO panel segmentation + OCR + VLM/LLM reverse prompting) processing 26K–64K mid-20th-century German comic panels into structured semantic representations.
- Credit Tier Prediction ML Pipeline (Volunteer Data Scientist, Microsoft DS Team, Summer 2025)
- End-to-end pipeline on 4M+ financial records. Improved accuracy 4% and reduced false positives 6% via ensemble modeling and feature engineering.
- Transaction Fraud Detection ML Pipeline (Volunteer Data Scientist, Google, Winter 2023)
- Production-ready GBDT model on 800K+ Google Pay transactions. Reduced false positives 8.7% and improved recall 2.87%.
- E-Commerce Profitability Analytics — Amazon India (Spring 2025 – Oct 2025)
- R Markdown project on 1.2M+ transactions: clustering, regression, and tree-based models for profitability drivers and pricing optimization. GitHub
- German Romanticism Corpus Analysis (1780–1840) (Sep 2025 – Dec 2025, Stanford)
- Custom corpus + NLP pipeline (spaCy, BERTopic, MALLET, R tidytext) tracking thematic evolution across early, transitional, and late Romanticism.
- Credit Fraud Detection (Mar 2025 – Jun 2025)
- Class-imbalance study (500 frauds in 200K+ records) benchmarking Random Forest, SGDClassifier, and MLP with PCA, downsampling, and reweighting.
- SpaceX Falcon 9 Launch Analysis (Nov 2025)
- EDA, ML models (Logistic Regression, Decision Tree, SVM), and interactive Plotly/Dash dashboards on Falcon 9 launch data. GitHub
- Gomoku AI Engine — Minimax + RL (Sep 2024 – Mar 2025, Stanford)
- Two-phase project: Minimax with Alpha-Beta pruning baseline, then self-play and Q-learning extension. GitHub
- Psychological Impact Prediction (Jun 2024 – Sep 2024)
- LightGBM classifier for psychological risk prediction on imbalanced healthcare data. AUC 0.87, with interpretable visualizations for clinicians.
Contact
- Email: cyf0906@stanford.edu
- Phone: 626-536-6862
- LinkedIn: peter-cheng-a2436a331
- GitHub: PeterCheng0906