German Romanticism Corpus Analysis (1780–1840) — NLP + Topic Modeling
Published:
Role: Lead researcher · Affiliation: Stanford University · Period: Sep 2025 – Dec 2025
Overview
Designed and built a custom German Romanticism textual corpus (1780–1840), then ran a full NLP pipeline on it to study how aesthetic, philosophical, and poetic priorities shifted across the early, transitional, and late phases of the movement.
What I did
- Constructed the corpus from scratch by collecting and cataloging literary works across five historical-linguistic “buckets” spanning 1780–1840, performing OCR extraction and text cleaning to standardize heterogeneous historical sources.
- Applied multiple NLP frameworks — spaCy and BERTopic in Python, MALLET for LDA-based topic modeling, and R (tidytext) for cross-validation — to conduct topic modeling, topic-flow analysis, and cross-period comparison.
- Constructed document–topic matrices, visualized topic transitions over time, and examined thematic evolution across the three Romantic periods.
- Analyzed linguistic patterns, recurring motifs, and conceptual clusters to explore how philosophical and poetic priorities shifted — bridging humanities-based interpretation with quantitative ML methods.
- Delivered an analytic report (GRC Analytic Report) documenting the methodology, results, and humanistic interpretation in one integrated document.
Why it matters
This is the kind of project that only works if you take both sides seriously — the literary-historical context that decides which periods, authors, and texts go into the corpus, and the computational rigor that makes the topic models reproducible and the trends defensible. The output is a quantitative view of Romanticism’s thematic arc that complements, rather than replaces, traditional close reading.
Tech stack
Python · spaCy · BERTopic · MALLET · R · tidytext · OCR pipelines · LDA · topic modeling · Jupyter