Issue #22: Prompt Versioning and A/B Evaluation

8 min read | April 4, 2026

Most prompt engineering is still done like a demo workflow. Someone edits a system prompt, eyeballs two or three outputs, and decides the new version feels better.

That is not a stable engineering loop. A prompt change can alter classification boundaries, output structure, refusal behavior, and operational safety. If the change matters, it should be versioned, replayed on a frozen evaluation set, and promoted only when it wins on explicit metrics.

In this issue, we build a local-first prompt experiment harness in C#. The model stays local through LM Studio. Deterministic code owns prompt loading, evaluation cases, structured response validation, keyword and safety scoring, variant ranking, and promotion guidance.

What You Are Building

You are building a prompt evaluation workflow that compares multiple prompt versions against the same fixed task set before promotion:

Load runtime config from appsettings.json and PROMPTAB_ environment overrides
Load prompt variants from versioned text files
Load a frozen evaluation set from JSON
Run each case through each prompt version against a live local model or deterministic mock client
Require structured JSON output from the model
Score category accuracy, priority accuracy, keyword coverage, and safety violations
Rank prompt variants deterministically and recommend a winner

This is prompt change control, not vibes-based prompt tweaking.

System Structure

The harness is intentionally simple and deterministic around the model path:

Config path: runtime mode, model endpoint, and prompt file list load first.
Prompt path: each variant is read from a file and assigned a stable version ID.
Execution path: the same frozen cases are replayed against every prompt version.
Scoring path: model output is normalized, parsed, validated, and scored against the contract.
Ranking path: variants are sorted by pass rate, then accuracy, then latency.

The diagram below shows the high-level control flow:

Runtime Configuration First

The app starts by loading experiment and model configuration before any prompt execution:

var configuration = new ConfigurationBuilder()
  .SetBasePath(AppContext.BaseDirectory)
  .AddJsonFile("appsettings.json", optional: true, reloadOnChange: false)
  .AddEnvironmentVariables(prefix: "PROMPTAB_")
  .Build();

var experimentConfig = PromptExperimentConfig.Load(configuration);
experimentConfig.Validate();

var llmConfig = LlmAppConfig.Load(configuration);
if (!experimentConfig.UseMockClient)
{
  llmConfig.Validate();
}

The default live profile used in this repo:

{
  "Experiment": {
    "UseMockClient": false,
    "DatasetPath": "data/support_ticket_eval.json",
    "PromptDirectory": "prompts",
    "PromptFiles": [
      "support_triage_v1.txt",
      "support_triage_v2.txt"
    ]
  },
  "Llm": {
    "Provider": "lmstudio",
    "BaseUrl": "http://192.168.0.20:1234/v1",
    "ApiKey": "not-needed",
    "ModelId": "openai/gpt-oss-20b",
    "Temperature": 0
  }
}

Prompt evaluation should be reproducible. That starts with explicit runtime configuration, not hidden local state.

Prompt Versions Are Deployable Artifacts

Prompt variants are loaded from files and assigned stable identifiers:

public sealed record PromptVariant(string Id, string DisplayName, string SystemPrompt);

foreach (var promptFile in promptFiles)
{
  var fullPath = ResolvePath(promptFile);
  var id = Path.GetFileNameWithoutExtension(promptFile);
  var displayName = string.Join(" ", id.Split('_').Select(TitleCaseToken));
  var systemPrompt = File.ReadAllText(fullPath).Trim();

  variants.Add(new PromptVariant(id, displayName, systemPrompt));
}

That matters because prompts stop being inline strings and become controlled artifacts with diffs, review, rollback, and promotion history.

In this repo, support_triage_v1.txt is intentionally more generic. support_triage_v2.txt is stricter about taxonomy, priority rules, and secret-handling constraints.

The Eval Set Is the Contract

Each evaluation case defines what the model is expected to do, not just how the answer should sound:

public sealed class EvalCase
{
  public string Id { get; init; } = string.Empty;
  public string UserMessage { get; init; } = string.Empty;
  public string ExpectedCategory { get; init; } = string.Empty;
  public string ExpectedPriority { get; init; } = string.Empty;
  public string[] MustContainKeywords { get; init; } = [];
  public string[] ForbiddenKeywords { get; init; } = [];
}

A representative case from the sample dataset:

{
  "Id": "PT-001",
  "UserMessage": "Our production chat assistant is timing out for every customer after the latest deployment. What should support do first?",
  "ExpectedCategory": "incident",
  "ExpectedPriority": "P1",
  "MustContainKeywords": [
    "status page",
    "incident commander"
  ],
  "ForbiddenKeywords": [
    "password"
  ]
}

This is what makes the evaluation useful. A prompt can return valid JSON and still fail because it picked the wrong priority or omitted critical action language.

Live Local Models Through Semantic Kernel

The default execution path uses Semantic Kernel against an OpenAI-compatible local endpoint such as LM Studio:

_kernel = Kernel.CreateBuilder()
  .AddOpenAIChatCompletion(
      modelId: config.ModelId,
      apiKey: config.ApiKey,
      endpoint: new Uri(config.BaseUrl))
  .Build();

Prompt execution is simple and measurable:

var prompt = $"""
<system>
{variant.SystemPrompt}
</system>
<user>
{evalCase.UserMessage}
</user>
""";

var stopwatch = Stopwatch.StartNew();
var result = await _kernel.InvokePromptAsync(
  prompt,
  new KernelArguments(settings),
  cancellationToken: cancellationToken);
stopwatch.Stop();

This keeps the model path local while leaving the evaluation logic framework-agnostic.

Scoring Is Deterministic

The scorer normalizes model output, parses JSON, validates the response contract, and then checks exact task expectations:

if (!TriageDecision.TryParse(normalized, out var decision, out var parseError))
{
  notes.Add($"Invalid JSON response: {parseError}");
  return new PromptEvaluationResult(
      variant,
      evalCase,
      response,
      StructuredOutputValid: false,
      CategoryMatch: false,
      PriorityMatch: false,
      KeywordCoverage: 0.0,
      SafetyPassed: false,
      Passed: false,
      Notes: [.. notes],
      Decision: null);
}

Keyword coverage and safety are enforced explicitly:

var keywordCoverage = ScoreKeywordCoverage(decision!.SuggestedReply, evalCase.MustContainKeywords, notes);
var safetyPassed = ScoreSafety(decision.SuggestedReply, evalCase.ForbiddenKeywords, notes);
var passed = structuredOutputValid
  && categoryMatch
  && priorityMatch
  && keywordCoverage >= 1.0
  && safetyPassed;

This is the core idea of the repo: prompt evaluation should be deterministic around the model output, not hand-waved after the fact.

Promotion Is Ranked, Not Guessed

After all runs complete, variants are ranked deterministically:

var summaries = variants
  .Select(variant => BuildSummary(variant, results.Where(result => result.Variant.Id == variant.Id).ToArray()))
  .OrderByDescending(summary => summary.PassRate)
  .ThenByDescending(summary => summary.CategoryAccuracy)
  .ThenByDescending(summary => summary.PriorityAccuracy)
  .ThenByDescending(summary => summary.StructuredOutputRate)
  .ThenBy(summary => summary.AverageLatencyMs)
  .ToArray();

Pass rate is primary. Accuracy comes next. Latency only breaks ties after behavioral quality has already been decided.

Reading a Real Live Run

A real run against openai/gpt-oss-20b through LM Studio produced:

Prompt Versioning and A/B Evaluation
Mode: live
Dataset: 5 evaluation cases
Variants: Support Triage V1, Support Triage V2
Provider: lmstudio
Model: openai/gpt-oss-20b
Endpoint: http://192.168.0.20:1234/v1

Support Triage V2
- Pass rate: 20%
- Structured output rate: 100%
- Category accuracy: 100%
- Priority accuracy: 80%
- Keyword coverage: 27%
- Safety pass rate: 100%
- Average latency: 1946 ms

Support Triage V1
- Pass rate: 0%
- Structured output rate: 100%
- Category accuracy: 100%
- Priority accuracy: 60%
- Keyword coverage: 13%
- Safety pass rate: 100%
- Average latency: 1524 ms

Winner: Support Triage V2

How to interpret this:

The live model returned valid structured JSON for every case, which means the output contract is working
Category accuracy was strong for both prompts, so taxonomy guidance was mostly clear
The main weakness was not JSON validity but action specificity, shown by low keyword coverage
support_triage_v2 still won clearly, but the low pass rate shows it should not be promoted blindly without another revision

This is exactly why the harness is useful. It prevents a prompt from being promoted just because the outputs look acceptable in a few ad-hoc spot checks.

Why This Architecture Works

The model remains useful, but prompt promotion authority stays in deterministic code:

Prompt versions are explicit files instead of hidden inline strings
Evaluation cases are frozen and replayable
Structured output validity is enforced instead of assumed
Required action language and forbidden terms are measured directly
Variant ranking is deterministic and inspectable
Live model behavior can be tested locally before rollout

Potential Enhancements

To extend this project further, you can consider:

Add more prompt variants and compare them in the same harness
Add larger eval sets with adversarial and edge-case prompts
Add rubric-style scoring for refusal quality or citation quality
Add prompt metadata such as owner, version date, and release notes
Add promotion thresholds that fail the run when pass rate drops below a fixed floor

Final Notes

Prompt engineering becomes much more useful when prompt changes are treated as controlled system changes instead of informal edits.

When prompts are versioned, replayed on a frozen eval set, scored deterministically, and promoted by measured behavior, you get a real engineering loop instead of a demo loop.

Explore the source code at the GitHub repository.

See you in the next issue.

Stay curious.

Share this article with your network.

LinkedIn X Facebook

Join the Newsletter

Subscribe for AI engineering insights, system design strategies, and workflow tips.

Your information is safe. Unsubscribe anytime.