Project History

The Journey

Building a system to detect Talmud stories wasn't straightforward. Each version taught us something new about what makes a story—and what doesn't.

Version 1 January 2025

Basic Detection

Simple AI prompt asking "Is this a story?" for each page. Only found one story per page maximum.

Lesson:

Many pages have multiple stories. Need to detect them all.

Version 2 January 2025

Multi-Story Detection

Enhanced to find multiple stories per page. Started tracking story boundaries.

Lesson:

Character offsets were unreliable. Boundaries often cut mid-sentence.

Version 3 January 2025

Text Anchors

Used first/last words as boundary markers instead of character offsets.

Lesson:

Hebrew/English alignment was imperfect. Needed a better approach.

Version 4 January 2025

Segment-Based Detection

Preserved Sefaria's segment structure—Hebrew and English perfectly aligned. Added Hebrew narrative markers.

Lesson:

Ready for expert validation to test real-world accuracy.

Version 4.1 January 2025

Expert Validation

Jeffrey Rubenstein reviewed 30 stories. Results: 50% false positive rate. Identified specific patterns causing errors.

Key Finding:

AI confused legal attribution ("Rabbi X quotes Rabbi Y") with characters in a story.

Version 5.0 January 2025

Categorical Classification

Replaced percentage scores with clear categories: YES, HIGH, LOW, NOT_A_STORY. Added 6 explicit criteria.

Improvement:

Clearer decisions help experts focus review on uncertain cases.

Version 5.1 January 2025

Validation-Driven Improvements

Addressed all false positive patterns from expert review. Added self-check mechanism. Stricter causality and transformation tests. Jeff reviewed 128 passages: 86% accuracy.

Breakthrough:

Expert validation at scale (128 passages) revealed 20 errors and 12 refinement opportunities.

Version 6 February 2025

Comprehensive Revision from Expert Feedback

Complete overhaul based on Jeff's 128-passage review. Anonymous characters now count fully. Refined what constitutes a "narrative event" vs legal activity. Cross-page story merging. Borderline story calibration. 82.7% accuracy.

Lesson:

Expert feedback revealed 20 errors and 12 refinement opportunities. Each correction taught the system something new.

Version 7 February 2025

Decomposed Pipeline: Event Triage

Instead of one monolithic AI call, decomposed detection into stages. Event triage classifies every segment as narrative, verbal, deliberation, or habitual—then skips pages without enough narrative events. 87.4% accuracy (+6 over v6).

Key insight:

Triage alone accounted for +4.7% accuracy. By never showing the detector purely legal pages, most false positives disappeared.

v7 + Gemini 3 Flash February 2026

92.1% Accuracy — New Best

Upgrading to Google's Gemini 3 Flash model jumped accuracy from 87.4% to 92.1% (117/127 passages). Surprisingly, the smaller Flash model outperformed the larger Pro model. The model was so good that post-processing rules added nothing—it got them right on its own.

Surprise:

Flash > Pro. The Pro model was too conservative, missing borderline stories that Flash correctly identified.

Phase 4: Full Tractate February 2026

98 New Stories on Unseen Pages

Ran the winning pipeline on Ketubot pages 61-112—104 pages the system had never seen. Found 98 stories (35 YES, 20 HIGH, 43 LOW). The entire tractate is now analyzed: 222 pages, ~153 stories total. Awaiting Jeff's expert review.

Status:

Generalization test passed. The pipeline found stories at a similar rate to the training set without any tuning.

v8: Cross-Page Fix February 2026

96.3% Expert Accuracy — Jeff Reviews Pages 61-112

Jeff reviewed 113 stories on pages 61-112 and confirmed 96.3% accuracy (109/113). His main feedback: stories cut off at page boundaries. v8 fixes this with improved cross-page merging—16 stories now span two pages (up from 7). Total stories refined from 113 to 103. Built a focused delta review UI so Jeff only needs to re-review the 49 changed stories.

Key insight:

Expert feedback drives iteration. Jeff's single observation about page boundaries led to a new merge pipeline and a 172-story tractate-wide catalog.

Impact of Expert Validation

Version 4.1's expert review transformed our approach:

Before Validation (v4.1)

50%

False positive rate

Current (v8)

96.3%

Accuracy (pages 61-112, 109/113 stories)

"The AI was confusing attribution with characters. When it sees 'Rabbi X said that Rabbi Y said...', it thought there was a story with characters, but it's just legal attribution."

— v4.1 insight from Jeffrey Rubenstein

This single insight led to a new disqualifier that caught 53 false positives in our test set.

"Stories can be about unnamed people. The anonymous character does not weaken the confidence." ... "The page is a totally arbitrary marker and should be ignored when identifying the boundaries of stories."

— v5.1 insights from Jeffrey Rubenstein's 128-passage review

Expert validation doesn't just measure accuracy—it teaches the system to be better. Each round of feedback reveals patterns we couldn't see from the code alone.

The Journey

Impact of Expert Validation

What's Next

v8 Delta Review

More Tractates

Comprehensive Index