← Back to Blog
AI EngineeringNext.jsSystem DesignArchitecture

How I Built a Production RAG Pipeline with Next.js and Pinecone

SY
Sumit Yadav
May 02, 20268 min read

What is RAG and Why Does It Matter?

Large Language Models are powerful but fundamentally limited — they only know what they were trained on. Ask GPT-4 about your internal company documents, a PDF you just uploaded, or data from last week and it either hallucinates or says it doesn't know.

Retrieval-Augmented Generation (RAG) solves this by giving the model a memory it can look up at query time. Instead of relying on training data, a RAG system:

  1. Stores your documents as searchable vector embeddings
  2. When you ask a question, retrieves the most relevant chunks
  3. Passes those chunks as context to the LLM
  4. The LLM answers based on your actual documents — not its training data

The result is an AI that gives accurate, grounded answers specifically about your content.


What I Built

AI-Powered RAG Assistant — a production-ready document chat application where you can:

  • Upload any PDF document
  • Ask questions about its content in natural language
  • Get streamed, accurate answers that reference your specific document

Built with Next.js 16, Google Gemini, Pinecone, and the Vercel AI SDK. Let me walk you through every architectural decision.


The RAG Pipeline — End to End

Here's the complete data flow:

Two separate pipelines:

  • Ingestion pipeline — runs when you upload a document
  • Query pipeline — runs on every user message

Step 1: Document Ingestion

Text Extraction

The first challenge is getting clean text out of a PDF. PDFs are notoriously messy — they're designed for printing, not parsing.

import pdf from "pdf-parse";

async function extractText(buffer: Buffer): Promise<string> {
  const data = await pdf(buffer);
  return data.text;
}

pdf-parse handles most PDFs well, extracting raw text while ignoring formatting. The output is a single string of all the document's text.

Chunking — The Most Underrated Decision

This is where most RAG tutorials skip the important details. You can't just embed the entire document as one vector — the embedding would lose too much nuance, and you'd blow past token limits.

Instead, split the text into overlapping chunks:

function chunkText(text: string, chunkSize = 500, overlap = 50): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + chunkSize, text.length);
    chunks.push(text.slice(start, end));
    start += chunkSize - overlap; // overlap preserves context
  }

  return chunks;
}

Why overlap matters: Imagine a key sentence that falls at the boundary between chunk 3 and chunk 4. Without overlap, the context is split and neither chunk contains the complete idea. With a 50-character overlap, both chunks share that boundary context.

Chunk size trade-offs:

Chunk SizeProsCons
Small (200 chars)Precise retrievalLoses broader context
Medium (500 chars)BalancedGood default
Large (1000+ chars)More contextLess precise, costs more

I settled on 500 characters with 50 character overlap — a good balance for most documents.

Generating Embeddings

Each chunk gets converted into a vector — a numerical representation of its semantic meaning:

import { GoogleGenerativeAI } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GOOGLE_GENERATIVE_AI_API_KEY!);

async function embedChunks(chunks: string[]): Promise<number[][]> {
  const model = genAI.getGenerativeModel({ model: "gemini-embedding-001" });

  const embeddings = await Promise.all(
    chunks.map((chunk) =>
      model.embedContent(chunk).then((r) => r.embedding.values),
    ),
  );

  return embeddings;
}

gemini-embedding-001 produces 768-dimensional vectors. Two chunks about similar topics will have vectors that are close together in this 768-dimensional space — that's what makes semantic search possible.

Storing in Pinecone

import { Pinecone } from "@pinecone-database/pinecone";

const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pc.index(process.env.PINECONE_INDEX!);

async function storeEmbeddings(
  chunks: string[],
  embeddings: number[][],
): Promise<void> {
  const vectors = chunks.map((chunk, i) => ({
    id: `chunk-${Date.now()}-${i}`,
    values: embeddings[i],
    metadata: { text: chunk },
  }));

  await index.upsert(vectors);
}

Each vector is stored with its original text as metadata. When we retrieve vectors later, we get the text back alongside the similarity score.


Step 2: The Query Pipeline

Semantic Search

When a user asks a question, we embed their query using the same model and find the most similar chunks:

async function retrieveContext(query: string, topK = 5): Promise<string[]> {
  // Embed the user's question
  const model = genAI.getGenerativeModel({ model: "gemini-embedding-001" });
  const queryEmbedding = await model.embedContent(query);

  // Find most similar chunks in Pinecone
  const results = await index.query({
    vector: queryEmbedding.embedding.values,
    topK,
    includeMetadata: true,
  });

  // Extract and return the text
  return results.matches
    .filter((m) => m.score! > 0.7) // only high-confidence matches
    .map((m) => m.metadata!.text as string);
}

The similarity threshold matters. Setting score > 0.7 filters out loosely related chunks that would add noise to the context. Too low and you get irrelevant context; too high and you might miss relevant content.

Building the Prompt

The retrieved chunks become the context window for Gemini:

function buildPrompt(query: string, context: string[]): string {
  return `You are a helpful assistant that answers questions based on the provided document context.

DOCUMENT CONTEXT:
${context.join("\n\n---\n\n")}

USER QUESTION: ${query}

Answer the question based only on the context above. If the answer isn't in the context, say so clearly. Do not make up information.`;
}

This system prompt is critical. It instructs Gemini to:

  • Only use the provided context
  • Not hallucinate information outside the documents
  • Be transparent when it can't find an answer

Streaming the Response

The Vercel AI SDK makes streaming trivial:

import { streamText } from "ai";
import { google } from "@ai-sdk/google";

export async function POST(req: Request) {
  const { messages } = await req.json();
  const lastMessage = messages[messages.length - 1].content;

  const context = await retrieveContext(lastMessage);
  const prompt = buildPrompt(lastMessage, context);

  const result = await streamText({
    model: google("gemini-2.5-flash"),
    messages: [{ role: "user", content: prompt }],
  });

  return result.toDataStreamResponse();
}

The toDataStreamResponse() handles all the streaming protocol details. The frontend receives tokens as they're generated — no waiting for the full response.


Architecture Decisions

Why Gemini over OpenAI?

Three reasons:

1. Context window — Gemini 2.5 Flash has a 1M token context window. For large documents, this matters enormously.

2. Embedding qualitygemini-embedding-001 produces high-quality semantic embeddings that compete with OpenAI's text-embedding-3-large.

3. Cost — Gemini is significantly cheaper at scale, which matters for a production app with real usage.

Why Pinecone over pgvector?

For a serverless Next.js app on Vercel, Pinecone's serverless offering is the natural fit. No database to manage, scales automatically, and the free tier is generous enough for production experimentation.

pgvector is excellent for teams already running Postgres — but introduces operational overhead that doesn't fit a lean, serverless architecture.

Why Vercel AI SDK?

The killer feature: provider abstraction. Switching from Gemini to Claude to GPT-4 is a one-line change:

// Gemini
model: google("gemini-2.5-flash");

// Claude
model: anthropic("claude-sonnet-4-5");

// GPT-4
model: openai("gpt-4o");

In production, you want this flexibility. Model quality and pricing shift constantly — you don't want to be locked in.


What I Learned

1. Chunking strategy is everything. The quality of your RAG system is largely determined by how well you chunk documents. Size, overlap, and whether you chunk by character, word, or sentence all matter significantly.

2. The similarity threshold is a tunable parameter. There's no universal right answer — it depends on your documents and use case. Start at 0.7 and tune from there.

3. System prompt engineering is underrated. The instruction to "only answer from context" is what prevents hallucinations. Without it, the LLM fills gaps with training data — defeating the purpose of RAG.

4. Streaming is non-negotiable for AI UX. Waiting 5 seconds for a full response feels broken. Streaming tokens as they arrive feels intelligent and responsive. The Vercel AI SDK makes this trivially easy.

5. RAG is not a magic bullet. It works brilliantly for factual questions about specific documents. It struggles with synthesis across many documents, numerical reasoning, and questions that require understanding the document as a whole rather than specific passages.


What's Next

  • Source citations — show which chunk of the document answered the query
  • Multi-document support — upload and query across multiple PDFs
  • Conversation history — multi-turn RAG that remembers previous context
  • Hybrid search — combine semantic search with keyword search for better retrieval

Try the live demo →

View the source code →


RAG is the most practically important AI architecture pattern right now. Every company is building internal knowledge bases, document assistants, and search systems using these exact techniques. Understanding how to build one end to end — not just call an API — is what separates engineers who can talk about AI from engineers who can ship AI.

← More ArticlesConnect on LinkedIn →