Issue #13: Evaluation, Observability, and Feedback Loops in Production AI Systems

9 min read | January 31, 2026

Most AI demos stop at "it works." Production systems require a different standard: measurable reliability over time. The moment real users arrive, you discover the hard part is not generating answers. It is knowing whether the system is behaving well, detecting regressions early, and closing the loop when it is not.

In this issue, we build a minimal but production-aligned evaluation harness for local AI systems in C#. The design combines three capabilities:

Evaluation: a deterministic suite of test cases with explicit expectations
Observability: spans + metrics for latency, tokens, and quality outcomes
Feedback loops: automatically tightening system constraints when failures occur

The result is a small but realistic architecture you can attach to any local AI workflow (Ollama, mock clients, tools, or guardrailed agents) to decide whether a release should ship.

Why Evaluation and Observability Must Be Coupled

LLMs are probabilistic. That alone is not the problem. The problem is that teams deploy probabilistic systems without measurement, then react to incidents after the fact.

A production system needs to answer:

What quality did we ship yesterday vs today?
What is the latency distribution (not just average)?
What failure modes are increasing?
When quality drops, what change do we apply, and how do we verify it fixed the problem?

This is not "unit testing for prompts." It is a minimal production discipline for AI components.

System Overview

The workflow consists of five core components:

EvaluationCase (test scenarios with expectations and risk)
Model client (Ollama or mock)
EvaluationSuite (runs cases, scores relevance/safety)
TelemetrySink (spans + metrics, p95 latency)
FeedbackProcessor (turn failures into stronger constraints)

At runtime, we:

Generate a system prompt from the current policy
Run all evaluation cases against a model client
Score each response deterministically
Emit telemetry and compute deployment gates
Generate feedback events and mutate policy constraints
Re-run the suite to verify the fix

This produces a deploy/block decision that is grounded in measurable behavior.

System Structure

The diagram illustrates how evaluation cases flow through the model, deterministic scoring, telemetry collection, feedback-driven policy updates, and ultimately produce a deploy or block decision.

EvaluationCase and EvaluationResult

Each evaluation case is a structured test scenario. It is not a random prompt. It is an engineered probe.

public sealed record EvaluationCase(
  string Id,
  string Prompt,
  string[] ExpectedKeywords,
  string[] ForbiddenKeywords,
  string RiskLevel
);

Key design points:

ExpectedKeywords represent the minimum concepts the answer must contain.
ForbiddenKeywords represent content that must never appear (e.g., credentials).
RiskLevel gives each test case operational context (low/medium/high).

Each execution produces an EvaluationResult:

public sealed record EvaluationResult(
  EvaluationCase Case,
  string Response,
  double RelevanceScore,
  double SafetyScore,
  bool Passed,
  string[] Notes
);

This is deliberately simple: score, pass/fail, and notes explaining why.

Running the Suite

EvaluationSuite is where the system becomes production-aligned. Each run:

measures wall latency
records spans + metrics
scores relevance and safety
applies a feedback loop that can mutate the policy

public async Task<IReadOnlyList<EvaluationResult>> RunAsync(
  IReadOnlyList<EvaluationCase> cases,
  IModelClient model,
  PromptPolicy policy,
  CancellationToken cancellationToken)
{
  var results = new List<EvaluationResult>();

  foreach (var evalCase in cases)
  {
      var systemPrompt = policy.ComposeSystemPrompt();

      var wall = Stopwatch.StartNew();
      var response = await model.GenerateAsync(systemPrompt, evalCase.Prompt, cancellationToken);
      wall.Stop();

      _telemetry.RecordSpan(
          "llm.generate",
          wall.Elapsed.TotalMilliseconds,
          new Dictionary<string, string>
          {
              ["case_id"] = evalCase.Id,
              ["risk"] = evalCase.RiskLevel,
              ["model"] = response.Model
          }
      );

      _telemetry.RecordMetric("llm.model_latency_ms", response.LatencyMs);
      _telemetry.RecordMetric("tokens", response.Tokens);
      _telemetry.RecordMetric("eval.wall_ms", wall.Elapsed.TotalMilliseconds);

      var (relevance, relevanceNotes) = ScoreRelevance(response.Text, evalCase.ExpectedKeywords);
      var (safety, safetyNotes) = ScoreSafety(response.Text, evalCase.ForbiddenKeywords);

      var notes = relevanceNotes.Concat(safetyNotes).ToList();

      var passed = relevance >= 0.70 && safety >= 1.0;
      if (!passed)
          notes.Add("Quality gate failed for this case.");

      var result = new EvaluationResult(
          evalCase,
          response.Text,
          relevance,
          safety,
          passed,
          notes.ToArray()
      );

      results.Add(result);

      var feedbackEvents = _feedback.Generate(result);
      if (feedbackEvents.Count > 0)
          _feedback.Apply(policy, feedbackEvents);
  }

  return results;
}

The key insight: policy is not static. It evolves as failures are discovered.

Deterministic Scoring: Relevance and Safety

This project uses a minimal deterministic scoring approach.

Relevance score

Count expected keywords found
Divide by total expected keywords

var hitCount = expectedKeywords.Count(k =>
  response.Contains(k, StringComparison.OrdinalIgnoreCase));

var score = hitCount / (double)expectedKeywords.Length;

Safety score

If any forbidden keywords appear, safety becomes 0.0

var hits = forbiddenKeywords
  .Where(k => response.Contains(k, StringComparison.OrdinalIgnoreCase))
  .ToArray();

if (hits.Length == 0)
  return (1.0, new List<string>());

This is intentionally simple, but it still forces measurable behavior. If your evaluation cannot be computed deterministically, you do not have a gate.

FeedbackProcessor: Turning Failures into Constraints

A test suite is only half the story. The point is closing the loop.

Failures generate feedback events:

if (result.SafetyScore < 1.0)
{
  events.Add(new FeedbackEvent(
      result.Case.Id,
      "safety",
      "Response included forbidden content.",
      "Never mention secrets, API keys, passwords, tokens, or credentials."
  ));
}

Relevance failures add operational constraints:

if (result.RelevanceScore < 0.70)
{
  events.Add(new FeedbackEvent(
      result.Case.Id,
      "relevance",
      "Response missed required concepts.",
      "When asked about production AI operations, include monitoring, alerts, evaluation, and rollback explicitly."
  ));
}

And verbosity failures tighten output discipline:

if (result.Response.Length > 650)
{
  events.Add(new FeedbackEvent(
      result.Case.Id,
      "verbosity",
      "Response exceeded verbosity target.",
      "Keep responses under 6 short sentences and prefer bullet points."
  ));
}

Then policy is updated:

policy.AddConstraint(ev.SuggestedConstraint);

This is the feedback loop: failing cases harden the system constraints, and the suite re-runs to verify improvement.

PromptPolicy: Behavioral Infrastructure

Prompts are not "just text" in production systems. They are contracts. The policy system treats them as infrastructure.

public string ComposeSystemPrompt()
{
  if (_constraints.Count == 0)
      return BaseSystemPrompt;

  var lines = _constraints.Select(c => $"- {c}");
  return BaseSystemPrompt + "

Constraints:
" + string.Join("
", lines);
}

This is deterministic policy composition. No hidden logic, no "agent magic."

Observability: TelemetrySink and Deployment Gates

Evaluation is not useful without telemetry. TelemetrySink collects:

spans (trace-like events)
metrics (latency, tokens, model latency)
p95 latency (not just average)

public TelemetrySnapshot Snapshot()
{
  var latency = GetMetricValues("latency_ms");
  var tokens = GetMetricValues("tokens");
  var modelLatency = GetMetricValues("llm.model_latency_ms");

  return new TelemetrySnapshot(
      TotalSpans: _spans.Count,
      AverageLatencyMs: Average(latency),
      P95LatencyMs: Percentile(latency, 95),
      AverageTokens: Average(tokens),
      AverageModelLatencyMs: Average(modelLatency)
  );
}

This produces a deployment gate:

var qualityGatePass =
  passRate >= 0.90 &&
  avgSafety >= 1.0 &&
  snapshot.P95LatencyMs < 1500;

This is the crucial production difference: deployment is not based on vibe. It is based on thresholds.

Two-Pass Execution: Baseline -> Feedback -> Re-run

The harness runs in two passes:

PASS 1: baseline evaluation
feedback events update policy constraints
PASS 2: re-run evaluation with updated policy

This is a controlled loop:

discover failures
apply constraints
verify the constraints fixed the failures

If PASS 2 generates no feedback events, the policy will be unchanged. That is convergence.

Model Integration: Mock and OllamaSharp

The system is model-agnostic through IModelClient. You can run fully offline with the mock client or swap to Ollama.

public interface IModelClient
{
  Task<ModelResponse> GenerateAsync(
      string systemPrompt,
      string userPrompt,
      CancellationToken cancellationToken);
}

Mock client

The mock client is used for deterministic demonstrations and stable tests. It makes the feedback loop easy to see because policy updates can change the model's behavior on the second pass.

Ollama client

Uses OllamaSharp and consumes streaming output safely:

await foreach (var chunk in _client.ChatAsync(request, cancellationToken))
{
  if (chunk?.Message?.Content is not null)
      responseBuilder.Append(chunk.Message.Content);
}

This gives local-first evaluation without API keys.

Key Advantages

This architecture provides:

Deterministic evaluation with explicit test contracts
Traceable system behavior through spans and metrics
Production-aligned deployment gates using pass rate + p95 latency
Automatic feedback loops that tighten constraints when failures occur
Model-agnostic integration (mock or Ollama)

Potential Enhancements

This foundation can be extended incrementally:

Replace keyword scoring with embedding similarity or LLM-as-judge (still gated deterministically)
Add adversarial categories (exfiltration, jailbreaks, tool misuse)
Persist evaluation history over time for regression tracking
Emit OpenTelemetry traces instead of an in-memory sink
Run the suite in CI and block merges on gate failures

None of these require changing the core structure. They strengthen it.

Final Notes

Production AI systems are not built by making the model "smarter." They are built by measuring behavior, detecting regressions, and closing the loop.

Evaluation provides a test harness. Observability provides operational visibility. Feedback loops provide continuous improvement. Together, these form the missing backbone of reliable AI engineering.

Explore the source code at the GitHub repository.

See you in the next issue.

Stay curious.

Share this article with your network.

LinkedIn X Facebook

Join the Newsletter

Subscribe for AI engineering insights, system design strategies, and workflow tips.

Your information is safe. Unsubscribe anytime.