Grading the grader: qwen 1.5b vs 7b

I swapped the local AI that grades my chatbot for a bigger one and compared the two across 55 real conversations. Here’s what a smaller model misses, and where Claude Haiku and Sonnet would sit on the same ladder.

YieldTwo graders running side-by-side on every new Ask Goose conversation, with both scores flowing into a private Google Sheet daily for head-to-head comparison

DifficultyIntermediate (local LLM swap via Ollama, Python cron job, Supabase schema with a graded_by column, Google Sheets service account, rubric design)

Total Cook Time~2 hours. One-line model swap, 45 minutes of auto-backfill across 55 historical sessions, ~90 minutes updating the sheet sync to track both graders

Ingredients

Ollama — runs local LLMs on my home server (free, open source)
qwen2.5:1.5b — the old grader, about 1.5 billion parameters, fits in a couple gigs of memory (free)
qwen2.5:7b-instruct-q4_K_M — the new grader, about 7 billion parameters, 4-bit quantized, ~5 GB resident (free)
Supabase — stores chat sessions, messages, and grades across three tables (free tier)
Python + Ollama REST API — the grading script that pulls ungraded conversations every 15 minutes (free)
cron — schedules the grader every 15 min and the Google Sheet sync at 5:30 am daily (free)
gspread + Google Sheets API — pushes both graders’ scores into a Google Sheet with model-named columns (free)
Alienware home server with 16 GB RAM — the hardware that made the 7B upgrade possible (already had it)

Why I needed a grader

Ask Goose is the chatbot on this site. It answers visitor questions about my projects, writing, and background. It’s been live for a few weeks and has handled 55 conversations so far. Not a huge number, but enough that I wasn’t going to read them all myself, so I needed something to grade them for me.

The first grader didn’t really work. The second one does. Comparing them taught me something about how AI models behave as they get bigger, and it’s the kind of thing that’s easy to miss if you only look at the final score.

How grading works

Every time someone chats with Goose, the conversation lands in a database. A small AI on my home server pulls new ones every 15 minutes and scores each one on five things:

Accuracy — did it actually answer the question?
Helpful — was the answer useful?
Tone — did it sound like Goose?
Brevity — did it get to the point?
Fallback — if it didn’t know, did it point the person to /contact?

Scores flow into a Google Sheet so I can scroll through and spot problems. The grader itself runs on a free open-source model instead of Claude. It’s cheaper, it’s private, and it runs on hardware I already own. I started with qwen2.5:1.5b, the smallest model in the list above.

Why the first grader wasn’t working

After a few weeks of data, the scores looked suspicious. 82% of conversations got a perfect Accuracy score. Helpful was almost as high, at 80%.

Nothing’s that good. A grader that hands out 5s to almost everything isn’t really reading the answers. It’s just nodding along.

Fallback was worse. The rubric asked: “if Goose didn’t know, did it redirect?” Most conversations scored a 1 or a 2, but Goose actually redirects correctly most of the time. The grader had read the question backwards. It was scoring “did fallback happen?” instead of “was fallback used appropriately?” A right answer with no need to redirect was getting punished as if the bot had refused to help.

A grader that can’t use the full scale and reads its own rubric backwards isn’t measuring anything. It’s guessing.

Swapping in a bigger model

I switched the grader to qwen2.5:7b. Same model family, but roughly four and a half times the size. My grader was already built to re-score any conversation the current model hadn’t seen, so flipping the name in the config auto-triggered a backfill. Forty-five minutes later, all 55 existing conversations had a second score from the bigger model, stored next to the old one. Same questions, same rubric, different brain.

What changed

Here are the average scores across all 55 conversations.

The old grader rated almost everything a 4 or 5. It wasn’t really using the scale. The new grader spread scores out. Look at Fallback: the two graders land on opposite ends of the scale for the same conversations.

The headline number is Accuracy: the average dropped from 4.62 to 3.89. That’s not the new model being mean. That’s the old model giving almost everything a 5.

Fallback tells the clearest story.

The old grader gave 45 out of 55 conversations a 1 or 2. The new grader gave 44 of them a 4 or 5. Same conversations, same rubric — they just disagree about what the question even means. This is what it looks like when a smaller model misreads the instructions.

The smaller model crowded almost every conversation into the bottom of the scale. The bigger model did the opposite. They’re not disagreeing about the answers. They’re disagreeing about the question.

Accuracy shows the same problem in a different shape.

45 out of 55 conversations got a perfect score from the old grader. The new one only gave a 5 to eight. That's not the new grader being tough — it's the old one not knowing how to tell 'close enough' from 'nailed it.'

Why smaller models fail at this

Four things went wrong with the small grader, and they’re common patterns you’ll see in any small AI.

Small models like to say yes. They’re trained to be friendly and agreeable. Ask one to score something 1 to 5 and it defaults to “probably a 5.” That’s what made 82% of Accuracy scores perfect.

They skip the “if” in questions. “If it didn’t know, did it redirect?” is two ideas stitched together. Small models drop the “if” and answer the easier half. That’s why Fallback got flipped.

They only read the start of a long conversation. One of the test sessions had four user questions. The small grader summarized only the first. Everything after got quietly ignored.

They hedge when they’re unsure. The old grader averaged 183 characters of commentary per conversation. The new one averaged 84. Wordy wasn’t smarter, just less sure.

The net effect: the small model was making thumbs-up-or-down judgments wearing the costume of a 1-to-5 score. The bigger model actually uses the scale.

Where Claude would sit

Models stack roughly like this, each step about 3 to 5 times smarter than the last: 1.5B → 7B → Claude Haiku → Claude Sonnet → Claude Opus.

If I swapped the local model for a Claude API call:

Claude Haiku 4.5 would handle the “if” easily and probably drop Accuracy a bit further. Not from being tough, but from catching small factual slips that the 7B misses. It might notice “Goose said Jose studied X, but the site says Y” where the 7B waves that by. Cost: about half a cent per conversation. Re-grading all 55 conversations costs less than 30 cents.

Claude Sonnet 4.6 is near the ceiling for this task. It would effectively be the answer key, the tiebreaker when the cheaper graders disagree. Cost: about three cents per conversation, so about $1.50 for a full re-grade. 10× more than Haiku, still trivial for this volume.

The plan

Keep the free 7B as the always-on grader. Add a Claude Haiku score next to it for pennies per month, an independent second opinion. Run Sonnet once as a tiebreaker pass to calibrate. The sheet is already built to show each grader in its own column, so the whole ladder stays visible.

Three lessons from this

Your grader has to be as smart as what it’s grading. Goose runs on Claude Haiku. A small open-source model can’t really judge Haiku’s work. It can only check if the answer looks like an answer.

Flat scores mean the grader isn’t grading. If 80% of your scores are the same number, the grader is rubber-stamping. Real measurement has shape to it.

Keep the old scores when you upgrade. You can’t see progress without a reference. Every future grader (Haiku, Sonnet, whatever comes next) gets its own column right next to the ones before it. The gap between them is the actual story.