EvalFeedbackMLOps

The Feedback Loop: Building a Self-Improving Tutor

Every 👍/👎 is a training signal. Here's how thousands of ratings become a better system prompt every week.

Priya Nair

22 March 2025

9 min read

The data model

Every response is logged with question, subject, level, prompt version, response text, video id, rating, optional feedback text, and timestamp.

Eval pipeline

A nightly job groups feedback by subject × prompt_version and computes thumbs-up rate, average sentiment, and correlation with follow-up question rate (engagement proxy).

LLM-as-judge

For every poorly-rated response, we ask a judge model:

Given this question, this answer, and this user feedback, identify the failure mode (factual, depth, tone, format).

Failure modes are clustered. The top cluster each week becomes a prompt patch proposal.

A/B test before promote

New prompt versions are shadow-served to 5% of traffic for 48h. If thumbs-up rate ≥ control + 0.5%, they're promoted.

Why this matters

Most LLM products ship a prompt and forget it. We treat prompts like code: versioned, evaluated, and continuously improved.

Keep reading

The Brain Behind the Bot: Subject Classification with Small Language Models

Aarav Sinha

Personalization

Expertise Adaptation: How We Personalize Every Answer

Meera Iyer

YouTube MCP

Video on Demand: YouTube Search as a Learning Layer

Karthik Rao