Mosaik Reverse Prompting + Multimodal Analysis

Published: April 01, 2026

Role: Lead researcher · Affiliation: Stanford University · Status: Ongoing

Overview

Mosaik is a multimodal data pipeline built to process tens of thousands of mid-20th-century German comic panels at scale. The goal: turn unstructured visual-textual material into structured, queryable semantic representations that humanities researchers can actually analyze.

What I built

Panel segmentation using YOLO-based object detection paired with Magic3Comic to break full pages into individual panels and text boxes.
Text extraction through OCR layered with NLP cleaning (BERTopic and MALLET) to handle noisy historical typography and German-language idioms.
Reverse prompting workflow leveraging vision-language and large language models to generate structured semantic annotations — scene type, character interactions, focal objects, perspective — directly from panel images.
Scaled inference to handle 26K–64K panels, with annotation reliability checks (inter-annotator agreement, Cohen’s κ) to validate model output against human reviewers.

Why it matters

The pipeline lets humanities researchers ask quantitative questions about visual narrative — how shot composition shifts across decades, which character archetypes recur, how “Otherness” is depicted across publishers — that were previously impossible without manual coding of every panel. It’s also a working example of how VLM/LLM systems can produce labeling that rivals human annotators, when paired with the right validation infrastructure.

Tech stack

YOLO · Magic3Comic · BERTopic · MALLET · OCR · Vision-language models · Python · Jupyter

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Peter Cheng

Overview

What I built

Why it matters

Tech stack

Share on