AINLPArchitecture

The Brain Behind the Bot: Subject Classification with Small Language Models

How we route every question to the right subject in under 200ms using a cheap classifier — and the prompt design tricks that made it 96% accurate.

Aarav Sinha

12 February 2025

8 min read

Why classification is the first step

Before we can give a learner a good answer, we need to know what they're actually asking about. "Explain pointers" lives in a different universe of context than "Explain pointers in React" — and treating both the same way wastes tokens and confuses the model.

Choosing a model

We benchmarked four candidates against 2,000 hand-labeled CS questions:

| Model | Accuracy | p50 latency | Cost / 1k | |---|---|---|---| | GPT-4o | 98.1% | 1.4s | $5.00 | | Claude Haiku | 96.4% | 380ms | $0.25 | | GPT-4o-mini | 95.9% | 410ms | $0.15 | | Llama 3 8B | 92.1% | 220ms | $0.04 |

Haiku won. The 2% accuracy gap to GPT-4o vanished entirely once we added a confidence threshold that triggers a fallback to the bigger model.

Prompt design

```text You are a subject classifier for a CS tutoring platform. Allowed subjects: DSA, OS, Networks, DBMS, OOP, SystemDesign, Other. Reply with strict JSON: { "subject": ..., "confidence": 0.0-1.0 } ```

No chain-of-thought. No examples. Just a strict JSON shape and a closed enum.

Confidence routing

≥ 0.85 → use Haiku's answer
0.55 - 0.85 → re-classify with GPT-4o-mini
< 0.55 → tag as Other, log for offline review

The interesting engineering is not in picking the model — it's in the confidence routing, the eval set, and the offline review loop.

Keep reading

Personalization

Expertise Adaptation: How We Personalize Every Answer

Meera Iyer

YouTube MCP

Video on Demand: YouTube Search as a Learning Layer

Karthik Rao

Eval

The Feedback Loop: Building a Self-Improving Tutor

Priya Nair