From first prototype to expert-validated detection
Building a system to detect Talmud stories wasn't straightforward. Each version taught us something new about what makes a story—and what doesn't.
Simple AI prompt asking "Is this a story?" for each page. Only found one story per page maximum.
Many pages have multiple stories. Need to detect them all.
Enhanced to find multiple stories per page. Started tracking story boundaries.
Character offsets were unreliable. Boundaries often cut mid-sentence.
Used first/last words as boundary markers instead of character offsets.
Hebrew/English alignment was imperfect. Needed a better approach.
Preserved Sefaria's segment structure—Hebrew and English perfectly aligned. Added Hebrew narrative markers.
Ready for expert validation to test real-world accuracy.
Jeffrey Rubenstein reviewed 30 stories. Results: 50% false positive rate. Identified specific patterns causing errors.
AI confused legal attribution ("Rabbi X quotes Rabbi Y") with characters in a story.
Replaced percentage scores with clear categories: YES, HIGH, LOW, NOT_A_STORY. Added 6 explicit criteria.
Clearer decisions help experts focus review on uncertain cases.
Addressed all false positive patterns from expert review. Added self-check mechanism. Stricter causality and transformation tests. Jeff reviewed 128 passages: 86% accuracy.
Expert validation at scale (128 passages) revealed 20 errors and 12 refinement opportunities.
Complete overhaul based on Jeff's 128-passage review. Anonymous characters now count fully. Refined what constitutes a "narrative event" vs legal activity. Cross-page story merging. Borderline story calibration. 82.7% accuracy.
Expert feedback revealed 20 errors and 12 refinement opportunities. Each correction taught the system something new.
Instead of one monolithic AI call, decomposed detection into stages. Event triage classifies every segment as narrative, verbal, deliberation, or habitual—then skips pages without enough narrative events. 87.4% accuracy (+6 over v6).
Triage alone accounted for +4.7% accuracy. By never showing the detector purely legal pages, most false positives disappeared.
Upgrading to Google's Gemini 3 Flash model jumped accuracy from 87.4% to 92.1% (117/127 passages). Surprisingly, the smaller Flash model outperformed the larger Pro model. The model was so good that post-processing rules added nothing—it got them right on its own.
Flash > Pro. The Pro model was too conservative, missing borderline stories that Flash correctly identified.
Ran the winning pipeline on Ketubot pages 61-112—104 pages the system had never seen. Found 98 stories (35 YES, 20 HIGH, 43 LOW). The entire tractate is now analyzed: 222 pages, ~153 stories total. Awaiting Jeff's expert review.
Generalization test passed. The pipeline found stories at a similar rate to the training set without any tuning.
Jeff reviewed 113 stories on pages 61-112 and confirmed 96.3% accuracy (109/113). His main feedback: stories cut off at page boundaries. v8 fixes this with improved cross-page merging—16 stories now span two pages (up from 7). Total stories refined from 113 to 103. Built a focused delta review UI so Jeff only needs to re-review the 49 changed stories.
Expert feedback drives iteration. Jeff's single observation about page boundaries led to a new merge pipeline and a 172-story tractate-wide catalog.
Version 4.1's expert review transformed our approach:
"The AI was confusing attribution with characters. When it sees 'Rabbi X said that Rabbi Y said...', it thought there was a story with characters, but it's just legal attribution."
This single insight led to a new disqualifier that caught 53 false positives in our test set.
"Stories can be about unnamed people. The anonymous character does not weaken the confidence." ... "The page is a totally arbitrary marker and should be ignored when identifying the boundaries of stories."
Expert validation doesn't just measure accuracy—it teaches the system to be better. Each round of feedback reveals patterns we couldn't see from the code alone.