The Feedback Loop: Building a Self-Improving Tutor
Every 👍/👎 is a training signal. Here's how thousands of ratings become a better system prompt every week.
The data model
Every response is logged with question, subject, level, prompt version, response text, video id, rating, optional feedback text, and timestamp.
Eval pipeline
A nightly job groups feedback by subject × prompt_version and computes thumbs-up rate, average sentiment, and correlation with follow-up question rate (engagement proxy).
LLM-as-judge
For every poorly-rated response, we ask a judge model:
Given this question, this answer, and this user feedback, identify the failure mode (factual, depth, tone, format).
Failure modes are clustered. The top cluster each week becomes a prompt patch proposal.
A/B test before promote
New prompt versions are shadow-served to 5% of traffic for 48h. If thumbs-up rate ≥ control + 0.5%, they're promoted.
Why this matters
Most LLM products ship a prompt and forget it. We treat prompts like code: versioned, evaluated, and continuously improved.