Vol. XV / Issue 02

The McKinnie Dispatch

Filed from the knowledge bench

RAG archaeology Knowledge systems 2023 to 2024

A private knowledge problem

The knowledge wasn't in the model. Getting it there was the work.

Archibus and WebCentral have a large body of documentation. GPT-4 knew none of it. Getting that material into a form a model could retrieve reliably turned out to be the harder problem, and the more useful lesson.

GPT-4 was a real step up. It could reason, explain, synthesize. What it could not do was answer a specific question about a specific Archibus workflow it had never seen. That was not a model limitation in the usual sense. It was a context problem.

Archibus and WebCentral ship a large, detailed body of material: configuration guides, help content, schema references, workflow documentation accumulated over many releases. None of that was in the model's training data in any useful form. If you wanted it to answer questions about private software knowledge, you had to get that material in front of it. In 2023, doing that reliably was not yet a solved problem.

The naive approach hits the first constraint fast. Token counts explode when you try to pass a full corpus through a chat completion call. Even the wider context options at the time only worked if you were sending well-selected text. You could not brute-force the whole corpus into a single call. Which meant the first real problem was not "where do I store embeddings." It was "what am I even trying to store, and in what form."

The rewrite pass

Before any vector store, there was a simpler step: reshape the source material into something more retrievable.

Archibus help documents are written for users who are already looking at the product and need to get unstuck on a specific step. Useful for that context. Not shaped for cosine similarity. The structure that helps a reader navigate a UI screen does not help a retrieval system find the right chunk when someone phrases a question differently than the docs expect.

The first experiment was a rewrite pass: count the tokens, send each section to the model with an instruction to make it more searchable, store the result. The idea was that the shape of the stored text should match the shape of the expected question more closely than the original doc structure did.

Exhibit A: The reshaping pass Before anything went into a vector store, it went through a rewrite. Count tokens, send to the model with a retrieval-oriented prompt, store the output. The shape of what you store matters.
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo-16k")
tokens = len(encoding.encode(source_text))

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-16k",
    messages=[
        {"role": "system", "content": "Rewrite these docs so they are easier to search."},
        {"role": "user", "content": source_text},
    ],
)

The FAQ-style rewrite was not always the right answer. It was one approach to making text more retrievable. But it forced the useful question: what does "retrievable" actually mean for this material? That question mattered more than the specific technique.

The retrieval apparatus

Once the source was reshaped, the retrieval stack came next. LangChain was the obvious harness in 2023. It was everywhere. Document loaders, text splitters, and vector store abstractions made it easy to go from raw file to embedded chunk to stored document in a handful of lines.

Exhibit B: Splitting and embedding Split the reshaped docs, embed locally, store in a vector database. The first product-shaped test query: "how do I set a user role?" Similarity search returned results. The model gave a reasonable answer.
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = splitter.split_documents(loader.load())

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
store = SupabaseVectorStore.from_documents(
    client=supabase,
    documents=docs,
    embedding=embeddings,
    table_name="documents",
)

That worked. It also immediately opened the real questions: how do you decide chunk size? What happens when the source document structure does not align with how users phrase questions? When does the embedding model matter, and when is it noise?

LangChain helped at first, then started getting in the way. The abstractions were still moving. Things that worked at tutorial scale broke at real usage. At some point the decision was to move closer to the raw embedding APIs and vector store clients directly. Less magic, fewer surprises when the scripts stopped being examples and started handling real material.

The vector store parade

Over the next stretch the bench tried a lot of stores: Supabase/pgvector, FAISS, in-memory DocArray, ChromaDB, Weaviate hybrid search, Milvus, Vertex AI RAG Engine, Haystack with its own pipeline abstractions, LlamaIndex. Later, sparse and hybrid retrieval through a BGE-M3 model backed by Ollama.

Each one surfaced something different. Weaviate hybrid search showed that BM25 and vector similarity often disagree in useful ways. Milvus raised different deployment questions than Chroma. Vertex AI RAG raised the "how much do you want the cloud to own" question directly. Embedding model choices were real: from text-embedding-ada-002 to newer OpenAI generations, to Google's text-embedding-004 when Gemini entered the stack, to local all-MiniLM-L6-v2 when remote API calls were too slow. But the switching was never the main variable.

Source prep is most of the work

The main variable was always the source material.

The Archibus and WebCentral knowledge base is spread across help files, schema references, workflow documentation, and implementation guides that accumulated over many release cycles. Not structured for ingestion. Not deduplicated. Not consistently formatted. The useful scripts from that period were mostly not about embeddings. They were about flattening directory trees into a single corpus, deleting noisy or irrelevant files, deduplicating chunks, normalizing sources, merging documents, and running tree-sitter to split code differently from prose.

Changing the embedding model moved retrieval quality noticeably less than fixing source prep did. Better-cleaned, better-chunked input reliably beat better models on bad input.

The model was not the bottleneck. The corpus was.

A chat interface is a product question

At some point the command-line retrieval test became a chat interface. Once retrieval worked well enough to trust, the natural next question was: can I build something a person can actually talk to?

Exhibit C: Retrieval as a chat surface A Streamlit chat interface backed by a LlamaIndex query engine, Milvus, and a Gemini LLM. The retrieval path became something that looked like a product. That shift made the design questions visible.
if prompt := st.chat_input("What would you like to know about Archibus?"):
    st.session_state.messages.append({"role": "user", "content": prompt})
    response = query_engine.query(prompt)
    st.session_state.messages.append({"role": "assistant", "content": str(response)})

It worked. It also surfaced the product design questions that retrieval alone does not answer: what does the user see when retrieval fails? How does chat history change what the model understands about the question? What scope is right: all documentation, or a narrower context window? Those questions did not have answers from the script. They required product decisions.

When retrieval needs memory and tools

The logical next step was also the more interesting one: if retrieval makes the model smarter about a knowledge base, what happens when you give the retrieval system memory and tools?

Exhibit D: The agent loop moment Workers with vector-backed memory, web search, and file actions. The retrieval step became one tool among several. The loop could plan before it answered. "Retrieve context for a chat answer" started becoming something else.
worker = Worker(
    role="Customer Service Agent",
    instructions="Use the knowledge base and tools before answering.",
    memory=Memory(storage=chroma_storage),
    actions=[DuckDuckGoSearch, WebBaseContextTool, ReadFileAction, WriteFileAction],
)

admin.assign_workers([worker])
admin.run("Answer the user's request with the available tools.")

This is where "retrieve context for a chat answer" started turning into something closer to an agent: a system that could reason, remember, and act rather than just look something up. It raised the same design questions as the chat interface, but at larger scope.

What the bench produced

The scripts from that period were never intended to ship. They were for learning. But the learning had a shape.

The useful things that came out of it were not vector store opinions. They were judgments about what actually moves retrieval quality: corpus preparation matters more than model choice; chunk boundaries matter more than most benchmarks measure; the chat interface and the retrieval system are separate design problems that happen to share the same pipeline; agent loops are retrieval extended, not retrieval replaced.

Archibot's later knowledge systems became an MCP server and shaped what ended up in product. They were not built from scratch. They were built on top of what the bench learned about cleaning, reshaping, chunking, embedding, storing, retrieving, and asking a better question.

The first useful lesson was not that embeddings are magic. It was that the material had to be reshaped before a model could retrieve it reliably.