Back to archive

DOC-001

My First Time Building a RAG System

Building a RAG system taught me that the intelligence of an AI product lives not in the model, but in the pipeline you design around it.

2025.11.14Engineering7 min

I was tasked with building an AI-powered Document Management System (DMS). The goal wasn’t just storage and retrieval, it was to enable users to interact with documents intelligently. Think: asking questions, extracting insights, and navigating large PDFs without manually reading everything.

The solution I arrived at was a Retrieval-Augmented Generation system, which I came to think of simply as the "AI Worker." But before getting into the mechanics of how it was built, it's worth spending a moment on why RAG exists as an architectural pattern in the first place — because understanding the motivation clarifies every design decision that follows.

Why RAG, and Why It Matters

Large language models are, in a narrow sense, extraordinarily well-read. They have processed more text than any human could in a thousand lifetimes. But their knowledge is frozen at the moment of training, and it is entirely general. They know nothing about your company's contracts, your internal research, or the specific regulatory filings that matter to your team. Worse, when asked about things they don't know, they don't always admit ignorance — they improvise, fluently and convincingly, which is precisely what makes hallucination so dangerous in professional contexts.

Retrieval-Augmented Generation was designed to address both of these limitations in a single architectural move. Rather than asking the model to recall information from memory, you retrieve relevant passages from a private knowledge base and hand them to the model as context. The model's job is then not to remember but to reason — to synthesize, summarize, and respond based on material placed directly in front of it. The intelligence shifts from the weights of the model to the quality of the retrieval system that feeds it. This distinction is not academic. It is, as I came to learn, the central truth around which everything else in the project orbited.

Designing the Pipeline

I structured the system as a deterministic pipeline, meaning each stage had a clearly defined input, a clearly defined output, and a single responsibility. Modularity here was not an aesthetic preference — it was an engineering necessity. When something went wrong, and things always go wrong, I needed to isolate the problem without having to untangle an interconnected web of logic.

The pipeline divides naturally into two phases: ingestion and retrieval. During ingestion, documents are processed, chunked, embedded, and stored. During retrieval, user queries are embedded, matched against stored vectors, and the results are assembled into a prompt that the language model can reason over. These two phases are cleanly separated, which means you can update your chunking strategy or swap your embedding model without rebuilding the query pipeline, and vice versa.

The Ingestion Phase: Where Quality is Determined

Every document in this system arrived as a PDF, which simplified one dimension of the problem while introducing several others. PDF is a presentation format, not a content format. It was designed to make documents look a certain way on screen and on paper, not to make their text easily extractable. The result is that raw extraction often produces text that is technically correct but structurally broken — lines that wrap mid-sentence, headers that appear randomly between paragraphs, page numbers that interrupt the flow of prose, and whitespace that follows the logic of a layout engine rather than the logic of language.

I built a preprocessing layer to address this before any other stage touched the text. Normalizing whitespace, stripping repeated headers and footers, and preserving paragraph boundaries were not glamorous tasks, but they were consequential ones. The quality of every downstream stage — chunking, embedding, retrieval — depends entirely on the quality of the text you start with. A beautiful embedding model cannot compensate for malformed input. Garbage in, garbage out is not a cliché in this context; it is a precise description of what happens when you skip preprocessing and wonder later why your answers are incoherent.

Once the text was clean, I split it into chunks of roughly three hundred to five hundred tokens, with an overlap of fifty to one hundred tokens between adjacent chunks. The overlap deserves emphasis because it is frequently omitted in introductory explanations of RAG. When you split text mechanically, you inevitably cut across ideas that span boundaries. A sentence that begins at the end of one chunk and concludes at the start of the next will be orphaned — retrievable in neither chunk with its full meaning intact. The overlap ensures that context bleeds across boundaries in a controlled way, preserving semantic continuity without duplicating large amounts of content.

The guiding principle for chunking was to follow the structure of the text wherever possible. Paragraph and section boundaries are natural seams. They represent places where the author chose to transition between ideas, and splitting there respects that intention. Token-based splitting was reserved for cases where no natural seam existed within the acceptable size range — a fallback, not a default.

Embedding: Turning Language into Geometry

The next stage converts each text chunk into a vector embedding — a high-dimensional numerical representation of the chunk's semantic content. This is where the system gains its ability to understand meaning rather than just match keywords. Two chunks that use different words but discuss the same concept will have embeddings that are geometrically close to each other. Two chunks that share surface-level vocabulary but address different topics will be far apart. Similarity in embedding space corresponds to similarity in meaning, which is what makes semantic search possible.

Because the document corpus was entirely in English, I was able to use an off-the-shelf embedding model without modification. The language constraint removed what is often a significant source of complexity in multilingual systems — the need to ensure that embeddings are semantically aligned across languages. This was a fortunate simplification, and I was careful not to overcomplicate it by reaching for solutions to problems I did not have.

Vector Storage and the Geometry of Retrieval

All embeddings are stored in a vector database built specifically for approximate nearest neighbor search. Exact nearest neighbor search — finding the mathematically closest vector to a query — becomes computationally prohibitive at scale. As the number of stored embeddings grows into the hundreds of thousands or millions, the time required for an exhaustive search grows proportionally. Approximate nearest neighbor algorithms trade a small, bounded amount of accuracy for dramatic gains in speed, making real-time retrieval practical even across large document collections.

Each entry in the vector store contains three things: the embedding vector, the original text chunk it represents, and metadata describing where that chunk came from — the document name, page number, section heading, and similar attributes. The metadata turns out to be more important than it might initially appear. At retrieval time, it allows the system to tell the user not just what the answer is, but where it comes from. Provenance matters in professional contexts. An answer that can be traced to a specific page of a specific document is fundamentally more trustworthy than one that cannot.

The Query Pipeline: Putting It Together

When a user submits a question, the system mirrors the ingestion process in reverse. The query is converted into an embedding using the same model that embedded the document chunks — this is essential, because the query and the chunks must exist in the same vector space for similarity search to be meaningful. The system then retrieves the top three to five chunks whose embeddings are closest to the query embedding, assembles them into a structured prompt, and passes that prompt to the language model.

The prompt structure I settled on was deliberately constrained. The retrieved chunks were presented as context, the user's question followed, and the model was explicitly instructed to answer based only on what was provided. This constraint is not a limitation — it is the mechanism by which the system remains trustworthy. An unconstrained model will fill gaps in the retrieved context with its own knowledge, and those fills are unverifiable. A constrained model that says "I cannot find this information in the provided documents" is more useful, in a professional setting, than one that answers confidently from memory.

Extending Beyond Basic Retrieval

One of the more significant decisions I made after many testing was to move beyond treating chunks as isolated units. In a naive RAG implementation, each chunk is independent. The system retrieves whichever chunks are most similar to the query and presents them without regard for their relationship to each other or to the broader document from which they were drawn. This works adequately for simple factual questions, but it breaks down for questions that require understanding how pieces of information relate — questions about processes, sequences, comparisons, or any topic that spans multiple sections of a document.

To address this, I introduced relational context into the retrieval layer. Chunks were linked to their parent sections, and sections were linked to their parent documents. When a chunk was retrieved, its surrounding structural context came with it. This allowed the system to answer questions with a coherence that pure similarity search cannot achieve on its own — not just finding the right information, but understanding how that information connects to the rest of what the document says.

What I Learned Building This

The most important insight from this project is one that runs counter to how AI systems are typically presented in popular discourse. The language model is not where the intelligence lives. The language model is the last mile — the component that synthesizes and articulates. The intelligence of the system as a whole lives in the pipeline: in the quality of the preprocessing, the precision of the chunking, the care taken in prompt construction, and the tuning of the retrieval mechanism.

A state-of-the-art language model fed poor retrieval results will produce poor answers. A modest language model fed excellent retrieval results will produce answers that are accurate, grounded, and useful. I tested this directly, and the results were unambiguous. Improving retrieval quality had a larger impact on answer quality than upgrading the model did. This is a humbling finding if you have spent time focused on model selection, and a liberating one if you are an engineer who prefers working on systems over waiting for the next model release.

Managing the context window — deciding how much retrieved content to include in each prompt — required ongoing attention. Including too little context meant that the model sometimes lacked the information needed to answer completely. Including too much meant that relevant content was diluted by tangentially related material, and the model would sometimes anchor on the wrong passage. The optimal range was narrower than I expected, and finding it required iteration against real queries from real users rather than synthetic benchmarks.

Latency was a constant consideration. Every stage of the pipeline adds time: embedding the query, executing the similarity search, assembling the prompt, and waiting for the model to generate a response. I addressed this by caching embeddings for documents that had already been processed, precomputing vectors during ingestion rather than at query time, and batching operations wherever the architecture permitted. These optimizations were not dramatic individually, but their cumulative effect was significant.

Evaluation was the most intellectually honest challenge of the project. There is no clean metric for answer quality in an open-ended QA system. I built a set of task-specific test queries drawn from real use cases, evaluated responses manually against ground truth, and used prompt-based scoring as a secondary signal. None of these methods are perfect, but together they gave me enough signal to understand whether changes to the pipeline were improvements or regressions.

Final Thought

What this project ultimately reinforced is something that experienced engineers know but that the current moment in AI tends to obscure: building useful AI systems is primarily a systems engineering problem. The model is a component, not the product. The product is the pipeline, the data quality, the retrieval strategy, the prompt design, and the feedback mechanisms that keep the whole thing improving over time. Getting that right is harder than selecting a model, and it matters more.