Models claude llama model language model transformer open-source

I Built an AI Chatbot That Knows Everything About Me

DEV Communityby AkromdevApril 1, 202610 min read1 views

<p>My portfolio site has project pages, work experience entries, and blog posts, all written as MDX files. When someone visits, they usually have a specific question: "Has this person worked with React?" or "What's their most recent project?" The answer is somewhere on the site, but finding it means clicking through pages and scanning project cards.</p> <p>I wanted visitors to be able to just ask. Not a FAQ page with canned answers, but something that reads the actual content on the site and answers questions from it.</p> <h2> Why Not Just Feed It Everything? </h2> <p>Your first thought might be: take all the content, send it to a language model like GPT-4o or Claude, and let it answer questions. This works for short content. But language models hallucinate. Ask about a technology you neve

My portfolio site has project pages, work experience entries, and blog posts, all written as MDX files. When someone visits, they usually have a specific question: "Has this person worked with React?" or "What's their most recent project?" The answer is somewhere on the site, but finding it means clicking through pages and scanning project cards.

I wanted visitors to be able to just ask. Not a FAQ page with canned answers, but something that reads the actual content on the site and answers questions from it.

Why Not Just Feed It Everything?

Your first thought might be: take all the content, send it to a language model like GPT-4o or Claude, and let it answer questions. This works for short content. But language models hallucinate. Ask about a technology you never mentioned, and the model might confidently say "yes, they have 3 years of experience with that" because it sounds plausible.

There's also a scale problem. My site has around 30 content files. Sending all of them as context every time someone asks a question is wasteful, and the more content you include, the more room there is for the model to drift.

Search First, Then Answer

Instead of sending everything, what if I first searched my own content to find the pieces relevant to the question, and only sent those to the model? That's the core idea behind RAG (Retrieval-Augmented Generation). The model writes its answer from a small, focused set of context instead of your entire site. Because it only sees what's relevant, it stays grounded in what's actually there.

To make this work, I needed three things: a way to split my content into searchable pieces, a way to search by meaning (not just keywords), and a language model to write the final answer.

Splitting Content Into Chunks

My content lives in MDX files: one per project, one per job, one per blog post. Some of these are long. A single project page might describe the tech stack, what I built, and how it works, all in one file. Sending an entire file as context when the user only asked about the tech stack wastes tokens and adds noise.

So I split each file into smaller chunks at paragraph boundaries, capped at 500 characters:

function chunkText(text: string, maxLen = 500): string[] {  const paragraphs = text.split(/\n\n+/);  const chunks: string[] = [];  let current = "";

function chunkText(text: string, maxLen = 500): string[] {  const paragraphs = text.split(/\n\n+/);  const chunks: string[] = [];  let current = "";

for (const para of paragraphs) { if (current.length + para.length > maxLen && current) { chunks.push(current.trim()); current = para; } else { current += (current ? "\n\n" : "") + para; } }

if (current.trim()) chunks.push(current.trim()); return chunks; }`

Enter fullscreen mode

Exit fullscreen mode

One thing I learned through testing: raw chunks with no context confused the model. A chunk that says "Built with TypeScript and PostgreSQL" is meaningless without knowing whether it's describing a personal project or a company I worked at. The fix was adding type prefixes. Every chunk starts with [PROJECT], [WORK EXPERIENCE], [BLOG POST], or [PROFILE], so the AI immediately knows what kind of content it's looking at. I also added catalog chunks (complete lists of all projects or all work history) so questions like "list all my projects" don't return partial results.

Searching by Meaning

Now I have chunks, but how do I find which ones are relevant to a question? Keyword search is the obvious choice, but it's brittle. If someone asks about "React experience" and my project description says "built with NextJS", there's no keyword match, even though NextJS is a React framework.

This is where embeddings come in. An embedding model takes a piece of text and converts it into a list of numbers that represent its meaning. "React" and "NextJS" produce similar numbers because they're related concepts. "PostgreSQL" and "Redis" end up close together because they're both databases. When someone asks about "React experience", the question gets converted to numbers too, and it naturally lands close to anything frontend-related in my content.

To convert text into these numbers, you need an embedding model. My first attempt used the HuggingFace Inference API, which worked, but had a problem: 0.5 seconds when the model was warm, 9.4 seconds when it was cold. HuggingFace spins down free-tier models after inactivity, so the chatbot would randomly hang for nearly 10 seconds. I switched to running the same model locally. all-MiniLM-L6-v2 is a popular open-source option, only 22MB, and it produces 384 numbers per piece of text in about 12ms:

import { pipeline } from "@huggingface/transformers";

const extractor = await pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2");

async function embedText(text: string): Promise { const result = await extractor(text, { pooling: "mean", normalize: true }); return result.tolist()[0]; // 384 numbers }`

Enter fullscreen mode

Exit fullscreen mode

At build time, I run this on every chunk and save the results to a JSON file. At runtime, I embed the user's question and find the closest chunks by comparing their numbers using cosine similarity (how much two sets of numbers point in the same direction):

async function searchChunks(query: string, topK = 8) {  const queryEmbedding = await embedText(query);

async function searchChunks(query: string, topK = 8) {  const queryEmbedding = await embedText(query);

return chunks .map((chunk) => ({ ...chunk, score: cosineSimilarity(queryEmbedding, chunk.embedding), })) .sort((a, b) => b.score - a.score) .slice(0, topK); }`

Enter fullscreen mode

Exit fullscreen mode

If you're working with thousands of chunks, you'd want a vector database like Pinecone or Weaviate to handle the search. For a personal site with around 160 chunks, looping through all of them in memory works fine.

Generating the Answer

At this point I have the top 8 chunks most relevant to the user's question. The last step is sending them to a language model to write a readable answer.

I went with Groq's free tier running Llama 3.1 8B. The model doesn't know anything about me by default. It only sees whatever chunks I send it. The system prompt tells it how to interpret the content and what the type prefixes mean:

const SYSTEM_PROMPT = You are a helpful assistant on a personal website. Answer questions using only the provided context.

Pay attention to type labels:

[PROJECT]: Portfolio projects
[WORK EXPERIENCE]: Employment history
[BLOG POST]: Articles written
[PROFILE]: Personal info

Keep answers concise and friendly. Do not make up information.;

Enter fullscreen mode

Exit fullscreen mode

The API call:

const response = await fetch("https://api.groq.com/openai/v1/chat/completions", {  method: "POST",  headers: {  "Content-Type": "application/json",  Authorization:

const response = await fetch("https://api.groq.com/openai/v1/chat/completions", {  method: "POST",  headers: {  "Content-Type": "application/json",  Authorization:

Bearer ${process.env.GROQ_API_KEY}

,  },  body: JSON.stringify({  model: "llama-3.1-8b-instant",  messages: [  { role: "system", content: SYSTEM_PROMPT },  ...conversationHistory,  { role: "user", content:

,  },  body: JSON.stringify({  model: "llama-3.1-8b-instant",  messages: [  { role: "system", content: SYSTEM_PROMPT },  ...conversationHistory,  { role: "user", content:

Context:\n${relevantChunks}\n\nQuestion: ${question}

 },  ],  temperature: 0.3,  }), });

 },  ],  temperature: 0.3,  }), });

Enter fullscreen mode

Exit fullscreen mode

Temperature controls how creative the model gets. At 0.3 (out of 1.0), it stays close to the most likely answer, which is what you want when accuracy matters. Conversation history (the last 10 messages) goes in with each request so follow-up questions like "tell me more about that project" work without losing context.

Deploying to Vercel

At this point everything worked locally and I was ready to deploy and move on. The chatbot ran as a serverless function through Astro's Vercel adapter, the model was only 22MB, and the embeddings were a static JSON file. Should have been the easy part.

I deployed and immediately hit Vercel's 250MB size limit on serverless functions. The model is only 22MB, so that wasn't the issue. @huggingface/transformers depends on onnxruntime-node, which ships native binaries for every platform. They all get bundled into your function, and that alone pushes you way past 250MB.

There's a lighter alternative called onnxruntime-web that uses WebAssembly instead of native binaries, around 11MB. But it's built for browsers. Run it in Node.js and it tries to fetch WASM files from a CDN over HTTPS, which Node.js refuses to do.

The workaround: swap onnxruntime-node for onnxruntime-web with a pnpm override, copy the WASM files to a local directory during the build, and tell the runtime to load them from the filesystem instead of the CDN:

const wasmDir = join(process.cwd(), ".wasm"); onnxEnv.wasm.wasmPaths = {  wasm:

const wasmDir = join(process.cwd(), ".wasm"); onnxEnv.wasm.wasmPaths = {  wasm:

file://${wasmDir}/ort-wasm-simd-threaded.wasm

,  mjs:

,  mjs:

file://${wasmDir}/ort-wasm-simd-threaded.mjs

, }; onnxEnv.wasm.numThreads = 1;

, }; onnxEnv.wasm.numThreads = 1;

Enter fullscreen mode

Exit fullscreen mode

With Vercel's includeFiles bundling the model and WASM into the function, the same local inference that works on my laptop works in production. No embedding API, no cold starts, no cost.

What It Costs

Embedding a query: ~50ms
Searching 164 chunks: under 1ms
LLM response: ~400ms
Total: under 500ms

Monthly cost: $0. Groq's free tier covers the LLM, embeddings run inside the serverless function, and chunk data is a static JSON file built at deploy time.

The whole thing is around 250 lines of TypeScript. There's a chat button on my site if you want to try it.

Originally published on akrom.dev. For quick dev tips, join @akromdotdev on Telegram.

Original source

DEV Community

https://dev.to/akromdev/i-built-an-ai-chatbot-that-knows-everything-about-me-4jeb

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudellamamodel

ModelsLive

Google launches Gemma 4 open AI model, it can run on your smartphone - India Today

Google launches Gemma 4 open AI model, it can run on your smartphone India Today

GNews AI Gemma

1mabout 1 hour ago

ModelsFresh

trunk/318e7eb43b73fd79cae64e4ea146f918760707f7

[Be][Claude Skills] Update bug bash skill with feedback from bug bash…

PyTorch Releases

1mabout 4 hours ago

ModelsFresh

trunk/34b6e17d1a24014822e71d2f0726adafc230ed0b: [Native DSLs] DSL Registry, base tests rework (#178381)

Summary: Note: Due to git-related shenanigans, this has subsumed #178518 Tests cleaning based on more explicit instructions to claude - should be better aligned with other torch tests. Add a separate registry for DSLs (alongside the existing registry for overrides). This allows a) a centralized place to query the availability of different DSLs, and b) a cleaner way to test / test for multiple DSLs without requiring manually adding each new DSL. Add Test skip decorators for current DSL list Test Plan: pytest -sv test/python_native/ Signed-off-by: Simon Layton [email protected] Pull Request resolved: #178381 Approved by: https://github.com/drisspg , https://github.com/albanD ghstack dependencies: #178637

PyTorch Releases

1mabout 4 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 156 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

I Built an AI Chatbot That Knows Everything About Me

Why Not Just Feed It Everything?

Search First, Then Answer

Splitting Content Into Chunks

Searching by Meaning

Generating the Answer

Deploying to Vercel

What It Costs

Daily AI Digest

More about

Google launches Gemma 4 open AI model, it can run on your smartphone - India Today

trunk/318e7eb43b73fd79cae64e4ea146f918760707f7

trunk/34b6e17d1a24014822e71d2f0726adafc230ed0b: [Native DSLs] DSL Registry, base tests rework (#178381)

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Google launches Gemma 4 open AI model, it can run on your smartphone - India Today

trunk/34b6e17d1a24014822e71d2f0726adafc230ed0b: [Native DSLs] DSL Registry, base tests rework (#178381)

trunk/318e7eb43b73fd79cae64e4ea146f918760707f7

Fears Over U.S. AI Dominance Boost Business for France’s Mistral - WSJ