How I Built Ask Goose, a RAG Chatbot for My Personal Site
A streaming AI chatbot grounded in real content — 291 embedded chunks, keyword-boosted retrieval, page-aware context, and a floating widget that knows where you are on the site
Ingredients
- Claude Code — terminal-based AI for direct file editing ($200/yr)
- Claude Haiku API — the LLM behind Goose’s responses, streaming via Server-Sent Events (~$0.01/conversation)
- Supabase with pgvector — vector storage, chat sessions, message history (free tier)
- Hugging Face Inference API — query embedding at runtime via all-MiniLM-L6-v2 (free tier)
- @xenova/transformers — local embedding model for build-time chunk vectorization (free)
- Ollama + qwen2.5 — local LLM on the Alienware server for automated QA grading (free)
What Ask Goose Is and Why It’s the Capstone
Ask Goose is a chatbot that answers questions about me, my work, my projects, and the technical content on this site. It doesn’t make things up — every answer is grounded in actual content I’ve written, structured data I’ve curated, and metadata I’ve generated across 19 posts and 6 projects.
Under the hood, it uses a technique called RAG — which stands for Retrieval-Augmented Generation. Here’s the plain-English version: instead of asking an AI to answer your question purely from memory (where it might confidently make something up), the system first searches through everything I’ve actually written to find the most relevant pieces, then hands those pieces to the AI and says “answer based on this.” Think of it like the difference between asking someone a question from memory versus handing them the right page of a book and saying “the answer is in here somewhere.”
This makes Ask Goose fundamentally different from the search bar that already existed on the site. Search gives you a list of links and says “one of these probably has what you need.” Ask Goose reads the content for you and gives you the answer directly — with links to the sources so you can verify or go deeper. The AI doesn’t replace the search — it builds on top of the same vector infrastructure to have an actual conversation about what’s here.
If you’ve been following this series, Ask Goose is the point where everything converges. The search bar gave me a content index. Semantic search gave me vector embeddings and pgvector infrastructure. TL;DR by Goose proved that Claude could summarize my content well. The content pipeline automated tags, related posts, and reading times. Ask Goose takes all of that — the embeddings, the summaries, the structured data — and makes it conversational. It’s the feature that every earlier feature was quietly building toward.
Architecture: How the Pieces Fit
The system has three layers: a content chunking pipeline that runs at build time, a retrieval + generation API that runs at query time, and a frontend that handles streaming display and session persistence.
At build time, a script reads every writing post, splits it by H2 headings into individual sections, and generates a 384-dimensional embedding for each chunk. It does the same for structured data files — resume roles, project descriptions, and prompt library entries. The result: 291 chunks stored in a rag_chunks table in Supabase with vector indexes for fast similarity search.
At query time, the user’s question hits /api/chat. The API embeds the question via Hugging Face, runs a vector similarity search against the chunks, applies keyword-based forced retrieval for known topics (career, tech stack, projects), injects the current page path for context awareness, and streams the top chunks into Claude Haiku with a system prompt that tells Goose to be concise, cite sources, and redirect to /contact when it doesn’t know.
Build time (left): content is split into chunks and embedded as vectors. Query time (right): each question triggers a similarity search, feeds the best chunks to Claude, and streams the answer back.
🔧 Developer section: Data flow
- Build time: Post TSX → split by H2 → embed with Xenova/MiniLM → upsert to
rag_chunks(291 rows) - Query time: User question → HF embed →
match_rag_chunksRPC (pgvector cosine similarity) → keyword boost → page-aware injection → Claude Haiku stream → SSE to browser - Storage:
chat_sessions+chat_messagestables track every conversation for QA grading
Why Hugging Face and qwen2.5
Two tool choices worth explaining, because they’re not the obvious ones.
Hugging Face for embeddings — when a user types a question, it needs to be converted into a vector (that list of 384 numbers) so the database can compare it to the stored content. The natural instinct is to run that conversion locally on the server, but Vercel’s serverless functions have strict time and memory limits — loading a machine learning model on every request wasn’t reliable. Hugging Face offers a free hosted version of the exact same model I use at build time (all-MiniLM-L6-v2), so the API sends the text to Hugging Face, gets the vector back, and passes it to Supabase. Same model on both sides means the vectors are comparable. Zero cost, no cold-start issues.
qwen2.5 for QA grading — I wanted a way to automatically grade Goose’s answers without paying for API calls on every evaluation. The solution: run a local LLM on the Alienware server that evaluates batches of 10 conversations and emails me a report. The constraint was hardware — the Alienware has 6GB of RAM and a 1.4GHz dual-core CPU from 2011. Most local models need 8–16GB minimum. Qwen 2.5 at 1.5 billion parameters fits in under 1GB of RAM, follows structured instructions well enough to return valid JSON grades, and runs acceptably on limited hardware. It’s not the smartest model available, but it’s the smartest model that actually fits.
How the Vector Store Made This Significantly Easier
When I built semantic search, I wrote at the end that vector embeddings were “half of a pattern called RAG” and that Ask Goose would be “a matter of wiring the retrieval results into a Claude API call.” That turned out to be exactly right — and exactly incomplete.
The retrieval function, the pgvector index, the embedding model, and the Supabase RPC pattern were all reusable. I didn’t rebuild any of that. But the originalcontent_embeddings table embedded metadata only — title, description, and TLDR summary, roughly 50 words per item. That’s fine for ranking search results but too thin for RAG. When someone asks “how did Jose set up port knocking?” the chatbot needs the actual post content, not a one-line description of it.
So I built a parallel table — rag_chunks — that stores dense, retrievable text chunks split by H2 section boundaries. The original search table stays untouched, and Ask Goose gets its own purpose-built retrieval layer. Same embedding model, same Supabase infrastructure, different granularity.
System Prompt Design Decisions
The system prompt went through several iterations during testing. Three decisions shaped the final version:
Brevity as default. Early responses ran 3–4 paragraphs for simple questions. The fix was explicit: “One paragraph is ideal. Never exceed two short paragraphs. For broad questions, give a tight summary and ask the user what they’d like to dig into.” This turned Goose from an essay writer into a conversationalist.
Honest fallback. When Goose doesn’t have enough context, early versions would hedge vaguely — “Jose probably has thoughts on that.” Now the prompt says to be direct and point to /contact: “I don’t have the details on that, but Jose would — you can reach him at /contact.”
Current question focus. Chat history was causing prior topics to bleed into unrelated answers. A user who asked a joke question and then asked about projects would get a response that started with “I appreciate the question, but I’m not here to play matchmaker” before answering about projects. The fix was two-fold: trim history to just the last exchange (not the last six messages), and add an explicit instruction to always prioritize the current question.
Three Failure Modes and How I Handled Each
1. Resume chunks were invisible to retrieval. The career data was in the database, but queries like “tell me about Jose’s work experience” returned zero results. The vector similarity scores for resume chunks were below 0.12 — essentially noise. MiniLM-L6-v2 can’t bridge the gap between a conversational question and a formal role description. The fix: keyword-based forced retrieval. When the query contains words like “career,” “Goldman,” or “experience,” the system force-includes resume chunks regardless of vector score. I also rewrote the resume data in conversational language (“Jose worked at DoorDash from 2020 to 2023…”) instead of formal bullet points.
2. “What is this page for?” returned generic answers. The widget sends the current page path to the API, but retrieval wasn’t using it. Asking “how long did this take to build?” while reading a specific post returned “I don’t have that information.” The answer was literally on the page — in the recipe card. The fix: when currentPage is a writing post URL, force-include all chunks from that post in the retrieval results. Now the chatbot always has the page you’re reading in its context.
3. Conversations reset on page navigation. The floating widget re-mounted on every navigation, wiping React state. Users would ask three questions, click to another page, and find a blank chat with the counter reset. The fix: persist messages, session ID, question count, and open/closed state to sessionStorage. The widget restores its full state on mount and saves on every change. Conversations survive navigation but clear when the tab closes — the right boundary for a casual chat widget.
All three failures shared a theme: the retrieval layer was the bottleneck, not the generation layer. Claude Haiku answered well when given the right context. The hard part was making sure the right context showed up. If I were advising someone building their first RAG system: spend 80% of your time on retrieval quality, 20% on prompt tuning.
What I’d Design Differently from Scratch
Upgrade the embedding model. MiniLM-L6-v2 has a 256-token context window. Anything beyond ~200 words gets silently truncated before the model sees it, which limits chunk size and forces aggressive splitting. A model like text-embedding-3-small (8,191-token window, 1,536 dimensions) would let me embed longer sections and get better similarity scores on conversational queries. I kept MiniLM because it was already wired into the site search — but for a purpose-built RAG system, I’d start with a larger model.
Build the chunking pipeline first. I designed the schema, then the data files, then the chunking script, then the API, then the frontend. If I did it again, I’d build the chunking pipeline and immediately test retrieval quality in isolation before writing a single line of API or UI code. Two of my three bugs were retrieval problems that would have surfaced earlier with a simple test script.
Rate limiting from day one. The 10-question limit is client-side only — anyone with curl can hit /api/chat directly. For a personal site this is fine, but the architecture choice should be deliberate. If I were building this for a client, I’d add server-side rate limiting per session and per IP from the start, not bolt it on later.
If you’re still reading, you should just go try it. Hit the chat icon in the bottom right corner and ask me anything. I have opinions about sticks, strong feelings about DoorDash’s grocery strategy, and I know exactly how long it took Jose to build every feature on this site.