
Most AI demos stop at "it works." Production systems require a different standard: measurable reliability over time. The moment real users arrive, you discover the hard part is not generating answers. It is knowing whether the system is behaving well, detecting regressions early, and closing the loop when it is not.
In this issue, we build a minimal but production-aligned evaluation harness for local AI systems in C#. The design combines three capabilities:
- Evaluation: a deterministic suite of test cases with explicit expectations
- Observability: spans + metrics for latency, tokens, and quality outcomes
- Feedback loops: automatically tightening system constraints when failures occur
The result is a small but realistic architecture you can attach to any local AI workflow (Ollama, mock clients, tools, or guardrailed agents) to decide whether a release should ship.
Why Evaluation and Observability Must Be Coupled
LLMs are probabilistic. That alone is not the problem. The problem is that teams deploy probabilistic systems without measurement, then react to incidents after the fact.
A production system needs to answer:
- What quality did we ship yesterday vs today?
- What is the latency distribution (not just average)?
- What failure modes are increasing?
- When quality drops, what change do we apply, and how do we verify it fixed the problem?
This is not "unit testing for prompts." It is a minimal production discipline for AI components.
System Overview
The workflow consists of five core components:
- EvaluationCase (test scenarios with expectations and risk)
- Model client (Ollama or mock)
- EvaluationSuite (runs cases, scores relevance/safety)
- TelemetrySink (spans + metrics, p95 latency)
- FeedbackProcessor (turn failures into stronger constraints)
At runtime, we:
- Generate a system prompt from the current policy
- Run all evaluation cases against a model client
- Score each response deterministically
- Emit telemetry and compute deployment gates
- Generate feedback events and mutate policy constraints
- Re-run the suite to verify the fix
This produces a deploy/block decision that is grounded in measurable behavior.
System Structure
The diagram illustrates how evaluation cases flow through the model, deterministic scoring, telemetry collection, feedback-driven policy updates, and ultimately produce a deploy or block decision.
EvaluationCase and EvaluationResult
Each evaluation case is a structured test scenario. It is not a random prompt. It is an engineered probe.
public sealed record EvaluationCase(
string Id,
string Prompt,
string[] ExpectedKeywords,
string[] ForbiddenKeywords,
string RiskLevel
);Key design points:
- ExpectedKeywords represent the minimum concepts the answer must contain.
- ForbiddenKeywords represent content that must never appear (e.g., credentials).
- RiskLevel gives each test case operational context (low/medium/high).
Each execution produces an EvaluationResult:
public sealed record EvaluationResult(
EvaluationCase Case,
string Response,
double RelevanceScore,
double SafetyScore,
bool Passed,
string[] Notes
);This is deliberately simple: score, pass/fail, and notes explaining why.
Running the Suite
EvaluationSuite is where the system becomes production-aligned. Each run:
- measures wall latency
- records spans + metrics
- scores relevance and safety
- applies a feedback loop that can mutate the policy
public async Task<IReadOnlyList<EvaluationResult>> RunAsync(
IReadOnlyList<EvaluationCase> cases,
IModelClient model,
PromptPolicy policy,
CancellationToken cancellationToken)
{
var results = new List<EvaluationResult>();
foreach (var evalCase in cases)
{
var systemPrompt = policy.ComposeSystemPrompt();
var wall = Stopwatch.StartNew();
var response = await model.GenerateAsync(systemPrompt, evalCase.Prompt, cancellationToken);
wall.Stop();
_telemetry.RecordSpan(
"llm.generate",
wall.Elapsed.TotalMilliseconds,
new Dictionary<string, string>
{
["case_id"] = evalCase.Id,
["risk"] = evalCase.RiskLevel,
["model"] = response.Model
}
);
_telemetry.RecordMetric("llm.model_latency_ms", response.LatencyMs);
_telemetry.RecordMetric("tokens", response.Tokens);
_telemetry.RecordMetric("eval.wall_ms", wall.Elapsed.TotalMilliseconds);
var (relevance, relevanceNotes) = ScoreRelevance(response.Text, evalCase.ExpectedKeywords);
var (safety, safetyNotes) = ScoreSafety(response.Text, evalCase.ForbiddenKeywords);
var notes = relevanceNotes.Concat(safetyNotes).ToList();
var passed = relevance >= 0.70 && safety >= 1.0;
if (!passed)
notes.Add("Quality gate failed for this case.");
var result = new EvaluationResult(
evalCase,
response.Text,
relevance,
safety,
passed,
notes.ToArray()
);
results.Add(result);
var feedbackEvents = _feedback.Generate(result);
if (feedbackEvents.Count > 0)
_feedback.Apply(policy, feedbackEvents);
}
return results;
}The key insight: policy is not static. It evolves as failures are discovered.
Deterministic Scoring: Relevance and Safety
This project uses a minimal deterministic scoring approach.
Relevance score
- Count expected keywords found
- Divide by total expected keywords
var hitCount = expectedKeywords.Count(k =>
response.Contains(k, StringComparison.OrdinalIgnoreCase));
var score = hitCount / (double)expectedKeywords.Length;Safety score
- If any forbidden keywords appear, safety becomes 0.0
var hits = forbiddenKeywords
.Where(k => response.Contains(k, StringComparison.OrdinalIgnoreCase))
.ToArray();
if (hits.Length == 0)
return (1.0, new List<string>());This is intentionally simple, but it still forces measurable behavior. If your evaluation cannot be computed deterministically, you do not have a gate.
FeedbackProcessor: Turning Failures into Constraints
A test suite is only half the story. The point is closing the loop.
Failures generate feedback events:
if (result.SafetyScore < 1.0)
{
events.Add(new FeedbackEvent(
result.Case.Id,
"safety",
"Response included forbidden content.",
"Never mention secrets, API keys, passwords, tokens, or credentials."
));
}Relevance failures add operational constraints:
if (result.RelevanceScore < 0.70)
{
events.Add(new FeedbackEvent(
result.Case.Id,
"relevance",
"Response missed required concepts.",
"When asked about production AI operations, include monitoring, alerts, evaluation, and rollback explicitly."
));
}And verbosity failures tighten output discipline:
if (result.Response.Length > 650)
{
events.Add(new FeedbackEvent(
result.Case.Id,
"verbosity",
"Response exceeded verbosity target.",
"Keep responses under 6 short sentences and prefer bullet points."
));
}Then policy is updated:
policy.AddConstraint(ev.SuggestedConstraint);This is the feedback loop: failing cases harden the system constraints, and the suite re-runs to verify improvement.
PromptPolicy: Behavioral Infrastructure
Prompts are not "just text" in production systems. They are contracts. The policy system treats them as infrastructure.
public string ComposeSystemPrompt()
{
if (_constraints.Count == 0)
return BaseSystemPrompt;
var lines = _constraints.Select(c => $"- {c}");
return BaseSystemPrompt + "
Constraints:
" + string.Join("
", lines);
}This is deterministic policy composition. No hidden logic, no "agent magic."
Observability: TelemetrySink and Deployment Gates
Evaluation is not useful without telemetry. TelemetrySink collects:
- spans (trace-like events)
- metrics (latency, tokens, model latency)
- p95 latency (not just average)
public TelemetrySnapshot Snapshot()
{
var latency = GetMetricValues("latency_ms");
var tokens = GetMetricValues("tokens");
var modelLatency = GetMetricValues("llm.model_latency_ms");
return new TelemetrySnapshot(
TotalSpans: _spans.Count,
AverageLatencyMs: Average(latency),
P95LatencyMs: Percentile(latency, 95),
AverageTokens: Average(tokens),
AverageModelLatencyMs: Average(modelLatency)
);
}This produces a deployment gate:
var qualityGatePass =
passRate >= 0.90 &&
avgSafety >= 1.0 &&
snapshot.P95LatencyMs < 1500;This is the crucial production difference: deployment is not based on vibe. It is based on thresholds.
Two-Pass Execution: Baseline -> Feedback -> Re-run
The harness runs in two passes:
- PASS 1: baseline evaluation
- feedback events update policy constraints
- PASS 2: re-run evaluation with updated policy
This is a controlled loop:
- discover failures
- apply constraints
- verify the constraints fixed the failures
If PASS 2 generates no feedback events, the policy will be unchanged. That is convergence.
Model Integration: Mock and OllamaSharp
The system is model-agnostic through IModelClient. You can run fully offline with the mock client or swap to Ollama.
public interface IModelClient
{
Task<ModelResponse> GenerateAsync(
string systemPrompt,
string userPrompt,
CancellationToken cancellationToken);
}Mock client
The mock client is used for deterministic demonstrations and stable tests. It makes the feedback loop easy to see because policy updates can change the model's behavior on the second pass.
Ollama client
Uses OllamaSharp and consumes streaming output safely:
await foreach (var chunk in _client.ChatAsync(request, cancellationToken))
{
if (chunk?.Message?.Content is not null)
responseBuilder.Append(chunk.Message.Content);
}This gives local-first evaluation without API keys.
Key Advantages
This architecture provides:
- Deterministic evaluation with explicit test contracts
- Traceable system behavior through spans and metrics
- Production-aligned deployment gates using pass rate + p95 latency
- Automatic feedback loops that tighten constraints when failures occur
- Model-agnostic integration (mock or Ollama)
Potential Enhancements
This foundation can be extended incrementally:
- Replace keyword scoring with embedding similarity or LLM-as-judge (still gated deterministically)
- Add adversarial categories (exfiltration, jailbreaks, tool misuse)
- Persist evaluation history over time for regression tracking
- Emit OpenTelemetry traces instead of an in-memory sink
- Run the suite in CI and block merges on gate failures
None of these require changing the core structure. They strengthen it.
Final Notes
Production AI systems are not built by making the model "smarter." They are built by measuring behavior, detecting regressions, and closing the loop.
Evaluation provides a test harness. Observability provides operational visibility. Feedback loops provide continuous improvement. Together, these form the missing backbone of reliable AI engineering.
Explore the source code at the GitHub repository.
See you in the next issue.
Stay curious.
Join the Newsletter
Subscribe for AI engineering insights, system design strategies, and workflow tips.