← All Writing
April 23, 20266 min read

Grading the grader: qwen 1.5b vs 7b

I swapped the local AI that grades my chatbot for a bigger one and compared the two across 55 real conversations. Here’s what a smaller model misses, and where Claude Haiku and Sonnet would sit on the same ladder.

YieldTwo graders running side-by-side on every new Ask Goose conversation, with both scores flowing into a private Google Sheet daily for head-to-head comparison
DifficultyIntermediate (local LLM swap via Ollama, Python cron job, Supabase schema with a graded_by column, Google Sheets service account, rubric design)
Total Cook Time~2 hours. One-line model swap, 45 minutes of auto-backfill across 55 historical sessions, ~90 minutes updating the sheet sync to track both graders

Ingredients

Why I needed a grader

Ask Goose is the chatbot on this site. It answers visitor questions about my projects, writing, and background. It’s been live for a few weeks and has handled 55 conversations so far. Not a huge number, but enough that I wasn’t going to read them all myself, so I needed something to grade them for me.

The first grader didn’t really work. The second one does. Comparing them taught me something about how AI models behave as they get bigger, and it’s the kind of thing that’s easy to miss if you only look at the final score.

How grading works

Every time someone chats with Goose, the conversation lands in a database. A small AI on my home server pulls new ones every 15 minutes and scores each one on five things:

Scores flow into a Google Sheet so I can scroll through and spot problems. The grader itself runs on a free open-source model instead of Claude. It’s cheaper, it’s private, and it runs on hardware I already own. I started with qwen2.5:1.5b, the smallest model in the list above.

Why the first grader wasn’t working

After a few weeks of data, the scores looked suspicious. 82% of conversations got a perfect Accuracy score. Helpful was almost as high, at 80%.

Nothing’s that good. A grader that hands out 5s to almost everything isn’t really reading the answers. It’s just nodding along.

Fallback was worse. The rubric asked: “if Goose didn’t know, did it redirect?” Most conversations scored a 1 or a 2, but Goose actually redirects correctly most of the time. The grader had read the question backwards. It was scoring “did fallback happen?” instead of “was fallback used appropriately?” A right answer with no need to redirect was getting punished as if the bot had refused to help.

A grader that can’t use the full scale and reads its own rubric backwards isn’t measuring anything. It’s guessing.

Swapping in a bigger model

I switched the grader to qwen2.5:7b. Same model family, but roughly four and a half times the size. My grader was already built to re-score any conversation the current model hadn’t seen, so flipping the name in the config auto-triggered a backfill. Forty-five minutes later, all 55 existing conversations had a second score from the bigger model, stored next to the old one. Same questions, same rubric, different brain.

What changed

Here are the average scores across all 55 conversations.

Average score per metric (55 conversations, 1–5 scale)0123454.623.89Accuracy4.754.23Helpful4.094.44Tone3.423.35Brevity2.203.89Fallback4.043.94Overallqwen-1.5b (old)qwen-7b (new)
The old grader rated almost everything a 4 or 5. It wasn’t really using the scale. The new grader spread scores out. Look at Fallback: the two graders land on opposite ends of the scale for the same conversations.

The headline number is Accuracy: the average dropped from 4.62 to 3.89. That’s not the new model being mean. That’s the old model giving almost everything a 5.

Fallback tells the clearest story.

Fallback scores — same conversations, opposite ends of the scale01020304050151Score 1305Score 225Score 332Score 4812Score 5qwen-1.5b (old)qwen-7b (new)
The old grader gave 45 out of 55 conversations a 1 or 2. The new grader gave 44 of them a 4 or 5. Same conversations, same rubric — they just disagree about what the question even means. This is what it looks like when a smaller model misreads the instructions.

The smaller model crowded almost every conversation into the bottom of the scale. The bigger model did the opposite. They’re not disagreeing about the answers. They’re disagreeing about the question.

Accuracy shows the same problem in a different shape.

Accuracy scores — the old grader only knew how to say ‘5’01020304050Score 131Score 2512Score 3234Score 4458Score 5qwen-1.5b (old)qwen-7b (new)
45 out of 55 conversations got a perfect score from the old grader. The new one only gave a 5 to eight. That's not the new grader being tough — it's the old one not knowing how to tell 'close enough' from 'nailed it.'

Why smaller models fail at this

Four things went wrong with the small grader, and they’re common patterns you’ll see in any small AI.

Small models like to say yes. They’re trained to be friendly and agreeable. Ask one to score something 1 to 5 and it defaults to “probably a 5.” That’s what made 82% of Accuracy scores perfect.

They skip the “if” in questions. “If it didn’t know, did it redirect?” is two ideas stitched together. Small models drop the “if” and answer the easier half. That’s why Fallback got flipped.

They only read the start of a long conversation. One of the test sessions had four user questions. The small grader summarized only the first. Everything after got quietly ignored.

They hedge when they’re unsure. The old grader averaged 183 characters of commentary per conversation. The new one averaged 84. Wordy wasn’t smarter, just less sure.

The net effect: the small model was making thumbs-up-or-down judgments wearing the costume of a 1-to-5 score. The bigger model actually uses the scale.

Where Claude would sit

Models stack roughly like this, each step about 3 to 5 times smarter than the last: 1.5B → 7B → Claude Haiku → Claude Sonnet → Claude Opus.

If I swapped the local model for a Claude API call:

Claude Haiku 4.5 would handle the “if” easily and probably drop Accuracy a bit further. Not from being tough, but from catching small factual slips that the 7B misses. It might notice “Goose said Jose studied X, but the site says Y” where the 7B waves that by. Cost: about half a cent per conversation. Re-grading all 55 conversations costs less than 30 cents.

Claude Sonnet 4.6 is near the ceiling for this task. It would effectively be the answer key, the tiebreaker when the cheaper graders disagree. Cost: about three cents per conversation, so about $1.50 for a full re-grade. 10× more than Haiku, still trivial for this volume.

The plan

Keep the free 7B as the always-on grader. Add a Claude Haiku score next to it for pennies per month, an independent second opinion. Run Sonnet once as a tiebreaker pass to calibrate. The sheet is already built to show each grader in its own column, so the whole ladder stays visible.

Three lessons from this

Your grader has to be as smart as what it’s grading. Goose runs on Claude Haiku. A small open-source model can’t really judge Haiku’s work. It can only check if the answer looks like an answer.

Flat scores mean the grader isn’t grading. If 80% of your scores are the same number, the grader is rubber-stamping. Real measurement has shape to it.

Keep the old scores when you upgrade. You can’t see progress without a reference. Every future grader (Haiku, Sonnet, whatever comes next) gets its own column right next to the ones before it. The gap between them is the actual story.

← Back to all writing