Elliot O
Issue #3: Local RAG System using Semantic Kernel, Ollama, and Qdrant
6 min read  |  November 22, 2025
Issue #3: Local RAG System using Semantic Kernel, Ollama, and Qdrant

Recent developments in AI, particularly large language models (LLMs), are enabling applications to combine structured data retrieval with generative reasoning. Retrieval-Augmented Generation (RAG) is an approach in which a system retrieves relevant documents from a vector database and then generates responses grounded in those documents. This ensures that AI outputs are accurate, context-aware, and verifiable, making it suitable for knowledge-intensive domains such as legal, compliance, and technical workflows.

This issue presents a fully local RAG system implemented in C# using Semantic Kernel to orchestrate the pipeline, Ollama for local LLM hosting, and QDrant for high-performance vector search. The system integrates deterministic retrieval with AI reasoning, providing both precision and interpretability.


System Overview

The core RAG workflow relies on four components:

  • Embeddings: Text is converted into vector representations using nomic-embed-text:latest.
  • Vector Search: QDrant stores embeddings and performs high-performance semantic similarity queries.
  • LLM Reasoning: Ollama hosts the gemma3:1b model locally to generate context-aware answers.
  • Semantic Kernel: Orchestrates memory, embeddings, vector search, and model calls, enabling deterministic and AI components to work together consistently.

RAG System Architecture

The following diagram illustrates the flow of the Retrieval-Augmented Generation system, showing how a user query is processed, relevant documents are retrieved, and a grounded response is generated by the local LLM.

The workflow is as follows:

  • User Query: The user submits a natural language question via the console interface.
  • Semantic Kernel: Orchestrates the embedding generation, vector search, and LLM reasoning.
  • Embedding Generation: The query is transformed into a vector representation using nomic-embed-text.
  • Vector Search (QDrant): Retrieves the most relevant document chunks based on semantic similarity.
  • LLM Reasoning (Ollama gemma3:1b): Generates a context-aware response grounded in the retrieved documents.
  • Response & References: The system returns the answer to the user along with links or metadata for the source documents.

Kernel and Memory Configuration

Semantic Kernel centralizes orchestration of AI and memory components. The KernelMemoryBuilder integrates embeddings, vector storage, and model configuration. A typical setup is:

This configuration ensures consistent model management, centralized memory access, and simplified orchestration. All AI calls, embeddings, and retrieval steps are coordinated through a single Kernel instance, allowing for maintainable and modular development.

Document Import and Indexing

Legal documents such as NDAs, employment contracts, SLAs, and DPAs are processed into embeddings and stored in QDrant with associated metadata for semantic retrieval. A code snippet for importing a document is:

Each document receives a unique identifier and metadata tags, enabling fast and accurate retrieval during query processing.

Chat Loop and Semantic Search

The application provides an interactive console interface. User queries are augmented with conversation history and sent to Kernel memory for retrieval and LLM-based response generation:

The system ensures responses are contextually grounded and include references to relevant source documents. Chat history is maintained to preserve context across multiple interactions:


User Experience

Users can ask questions such as:

  • "What are the key confidentiality clauses in the NDA?"
  • "How can I terminate the employment contract?"
  • "What is the defined uptime in the SLA?"

The system executes the following steps:

  • Embeds the query using nomic-embed-text.
  • Performs a semantic search in QDrant to retrieve the most relevant document sections.
  • Passes the retrieved content and conversation context to Ollama (gemma3:1b).
  • Generates a grounded, context-aware answer and displays relevant source references.

Architectural Insights

The architecture separates deterministic and AI-driven components. QDrant handles fast, deterministic semantic retrieval, while Semantic Kernel orchestrates embeddings, memory access, and LLM reasoning. This separation maintains modularity, testability, and predictability, while allowing the AI reasoning to remain focused and context-aware.

Key Advantages

  • Separation of Concerns: Deterministic retrieval and AI reasoning operate independently but are coordinated centrally.
  • Modularity: KernelMemory ensures consistent handling of embeddings, prompts, and AI calls.
  • Local Execution: All models and vector storage run locally, ensuring data privacy and low latency.
  • Extensibility: Additional document sources, models, or interfaces can be incorporated with minimal changes to the core system.

Setup Highlights

  • .NET 9.0 SDK for development and execution.
  • QDrant running locally for vector storage.
  • Ollama for local AI runtime with nomic-embed-text:latest and gemma3:1b.
  • The codebase is structured with clear separation of concerns: document import (LegalDocumentImporter.cs), configuration (LegalDocConfig.cs), orchestration (LegalDocRagApp.cs), and chat interface (ChatService.cs).

Potential Enhancements

Future improvements could include:

  • Additional document sources to broaden coverage.
  • Web or desktop front-end interfaces.
  • Long-term memory to maintain context across sessions.
  • Natural language query parsing and dynamic indexing pipelines.
  • Adapting the RAG workflow to other knowledge-intensive domains.

Final Notes

This newsletter issue demonstrated a fully local RAG system using Semantic Kernel, Ollama, and QDrant. By combining deterministic vector search with local LLM reasoning, the system produces accurate, context-aware responses while maintaining transparency through source document references. The architecture is modular, predictable, and extensible, illustrating how deterministic computation and AI reasoning can coexist in a professional, production-ready environment.

Explore the source code at the GitHub repository.

See you in the next issue.

Stay curious.

Share this article with your network.

Join the Newsletter

Subscribe for exclusive insights, strategies, and updates from Elliot One. No spam, just value.

Your information is safe. Unsubscribe anytime.