Mosaik Reverse Prompting + Multimodal Analysis

Published:

Role: Lead researcher · Affiliation: Stanford University · Status: Ongoing

Overview

Mosaik is a multimodal data pipeline built to process tens of thousands of mid-20th-century German comic panels at scale. The goal: turn unstructured visual-textual material into structured, queryable semantic representations that humanities researchers can actually analyze.

What I built

  • Panel segmentation using YOLO-based object detection paired with Magic3Comic to break full pages into individual panels and text boxes.
  • Text extraction through OCR layered with NLP cleaning (BERTopic and MALLET) to handle noisy historical typography and German-language idioms.
  • Reverse prompting workflow leveraging vision-language and large language models to generate structured semantic annotations — scene type, character interactions, focal objects, perspective — directly from panel images.
  • Scaled inference to handle 26K–64K panels, with annotation reliability checks (inter-annotator agreement, Cohen’s κ) to validate model output against human reviewers.

Why it matters

The pipeline lets humanities researchers ask quantitative questions about visual narrative — how shot composition shifts across decades, which character archetypes recur, how “Otherness” is depicted across publishers — that were previously impossible without manual coding of every panel. It’s also a working example of how VLM/LLM systems can produce labeling that rivals human annotators, when paired with the right validation infrastructure.

Tech stack

YOLO · Magic3Comic · BERTopic · MALLET · OCR · Vision-language models · Python · Jupyter