
Most prompt engineering is still done like a demo workflow. Someone edits a system prompt, eyeballs two or three outputs, and decides the new version feels better.
That is not a stable engineering loop. A prompt change can alter classification boundaries, output structure, refusal behavior, and operational safety. If the change matters, it should be versioned, replayed on a frozen evaluation set, and promoted only when it wins on explicit metrics.
In this issue, we build a local-first prompt experiment harness in C#. The model stays local through LM Studio. Deterministic code owns prompt loading, evaluation cases, structured response validation, keyword and safety scoring, variant ranking, and promotion guidance.
What You Are Building
You are building a prompt evaluation workflow that compares multiple prompt versions against the same fixed task set before promotion:
- Load runtime config from
appsettings.jsonandPROMPTAB_environment overrides - Load prompt variants from versioned text files
- Load a frozen evaluation set from JSON
- Run each case through each prompt version against a live local model or deterministic mock client
- Require structured JSON output from the model
- Score category accuracy, priority accuracy, keyword coverage, and safety violations
- Rank prompt variants deterministically and recommend a winner
This is prompt change control, not vibes-based prompt tweaking.
System Structure
The harness is intentionally simple and deterministic around the model path:
- Config path: runtime mode, model endpoint, and prompt file list load first.
- Prompt path: each variant is read from a file and assigned a stable version ID.
- Execution path: the same frozen cases are replayed against every prompt version.
- Scoring path: model output is normalized, parsed, validated, and scored against the contract.
- Ranking path: variants are sorted by pass rate, then accuracy, then latency.
The diagram below shows the high-level control flow:
Runtime Configuration First
The app starts by loading experiment and model configuration before any prompt execution:
var configuration = new ConfigurationBuilder()
.SetBasePath(AppContext.BaseDirectory)
.AddJsonFile("appsettings.json", optional: true, reloadOnChange: false)
.AddEnvironmentVariables(prefix: "PROMPTAB_")
.Build();
var experimentConfig = PromptExperimentConfig.Load(configuration);
experimentConfig.Validate();
var llmConfig = LlmAppConfig.Load(configuration);
if (!experimentConfig.UseMockClient)
{
llmConfig.Validate();
}The default live profile used in this repo:
{
"Experiment": {
"UseMockClient": false,
"DatasetPath": "data/support_ticket_eval.json",
"PromptDirectory": "prompts",
"PromptFiles": [
"support_triage_v1.txt",
"support_triage_v2.txt"
]
},
"Llm": {
"Provider": "lmstudio",
"BaseUrl": "http://192.168.0.20:1234/v1",
"ApiKey": "not-needed",
"ModelId": "openai/gpt-oss-20b",
"Temperature": 0
}
}Prompt evaluation should be reproducible. That starts with explicit runtime configuration, not hidden local state.
Prompt Versions Are Deployable Artifacts
Prompt variants are loaded from files and assigned stable identifiers:
public sealed record PromptVariant(string Id, string DisplayName, string SystemPrompt);foreach (var promptFile in promptFiles)
{
var fullPath = ResolvePath(promptFile);
var id = Path.GetFileNameWithoutExtension(promptFile);
var displayName = string.Join(" ", id.Split('_').Select(TitleCaseToken));
var systemPrompt = File.ReadAllText(fullPath).Trim();
variants.Add(new PromptVariant(id, displayName, systemPrompt));
}That matters because prompts stop being inline strings and become controlled artifacts with diffs, review, rollback, and promotion history.
In this repo, support_triage_v1.txt is intentionally more generic. support_triage_v2.txt is stricter about taxonomy, priority rules, and secret-handling constraints.
The Eval Set Is the Contract
Each evaluation case defines what the model is expected to do, not just how the answer should sound:
public sealed class EvalCase
{
public string Id { get; init; } = string.Empty;
public string UserMessage { get; init; } = string.Empty;
public string ExpectedCategory { get; init; } = string.Empty;
public string ExpectedPriority { get; init; } = string.Empty;
public string[] MustContainKeywords { get; init; } = [];
public string[] ForbiddenKeywords { get; init; } = [];
}A representative case from the sample dataset:
{
"Id": "PT-001",
"UserMessage": "Our production chat assistant is timing out for every customer after the latest deployment. What should support do first?",
"ExpectedCategory": "incident",
"ExpectedPriority": "P1",
"MustContainKeywords": [
"status page",
"incident commander"
],
"ForbiddenKeywords": [
"password"
]
}This is what makes the evaluation useful. A prompt can return valid JSON and still fail because it picked the wrong priority or omitted critical action language.
Live Local Models Through Semantic Kernel
The default execution path uses Semantic Kernel against an OpenAI-compatible local endpoint such as LM Studio:
_kernel = Kernel.CreateBuilder()
.AddOpenAIChatCompletion(
modelId: config.ModelId,
apiKey: config.ApiKey,
endpoint: new Uri(config.BaseUrl))
.Build();Prompt execution is simple and measurable:
var prompt = $"""
<system>
{variant.SystemPrompt}
</system>
<user>
{evalCase.UserMessage}
</user>
""";
var stopwatch = Stopwatch.StartNew();
var result = await _kernel.InvokePromptAsync(
prompt,
new KernelArguments(settings),
cancellationToken: cancellationToken);
stopwatch.Stop();This keeps the model path local while leaving the evaluation logic framework-agnostic.
Scoring Is Deterministic
The scorer normalizes model output, parses JSON, validates the response contract, and then checks exact task expectations:
if (!TriageDecision.TryParse(normalized, out var decision, out var parseError))
{
notes.Add($"Invalid JSON response: {parseError}");
return new PromptEvaluationResult(
variant,
evalCase,
response,
StructuredOutputValid: false,
CategoryMatch: false,
PriorityMatch: false,
KeywordCoverage: 0.0,
SafetyPassed: false,
Passed: false,
Notes: [.. notes],
Decision: null);
}Keyword coverage and safety are enforced explicitly:
var keywordCoverage = ScoreKeywordCoverage(decision!.SuggestedReply, evalCase.MustContainKeywords, notes);
var safetyPassed = ScoreSafety(decision.SuggestedReply, evalCase.ForbiddenKeywords, notes);
var passed = structuredOutputValid
&& categoryMatch
&& priorityMatch
&& keywordCoverage >= 1.0
&& safetyPassed;This is the core idea of the repo: prompt evaluation should be deterministic around the model output, not hand-waved after the fact.
Promotion Is Ranked, Not Guessed
After all runs complete, variants are ranked deterministically:
var summaries = variants
.Select(variant => BuildSummary(variant, results.Where(result => result.Variant.Id == variant.Id).ToArray()))
.OrderByDescending(summary => summary.PassRate)
.ThenByDescending(summary => summary.CategoryAccuracy)
.ThenByDescending(summary => summary.PriorityAccuracy)
.ThenByDescending(summary => summary.StructuredOutputRate)
.ThenBy(summary => summary.AverageLatencyMs)
.ToArray();Pass rate is primary. Accuracy comes next. Latency only breaks ties after behavioral quality has already been decided.
Reading a Real Live Run
A real run against openai/gpt-oss-20b through LM Studio produced:
Prompt Versioning and A/B Evaluation
Mode: live
Dataset: 5 evaluation cases
Variants: Support Triage V1, Support Triage V2
Provider: lmstudio
Model: openai/gpt-oss-20b
Endpoint: http://192.168.0.20:1234/v1
Support Triage V2
- Pass rate: 20%
- Structured output rate: 100%
- Category accuracy: 100%
- Priority accuracy: 80%
- Keyword coverage: 27%
- Safety pass rate: 100%
- Average latency: 1946 ms
Support Triage V1
- Pass rate: 0%
- Structured output rate: 100%
- Category accuracy: 100%
- Priority accuracy: 60%
- Keyword coverage: 13%
- Safety pass rate: 100%
- Average latency: 1524 ms
Winner: Support Triage V2How to interpret this:
- The live model returned valid structured JSON for every case, which means the output contract is working
- Category accuracy was strong for both prompts, so taxonomy guidance was mostly clear
- The main weakness was not JSON validity but action specificity, shown by low keyword coverage
support_triage_v2still won clearly, but the low pass rate shows it should not be promoted blindly without another revision
This is exactly why the harness is useful. It prevents a prompt from being promoted just because the outputs look acceptable in a few ad-hoc spot checks.
Why This Architecture Works
The model remains useful, but prompt promotion authority stays in deterministic code:
- Prompt versions are explicit files instead of hidden inline strings
- Evaluation cases are frozen and replayable
- Structured output validity is enforced instead of assumed
- Required action language and forbidden terms are measured directly
- Variant ranking is deterministic and inspectable
- Live model behavior can be tested locally before rollout
Potential Enhancements
To extend this project further, you can consider:
- Add more prompt variants and compare them in the same harness
- Add larger eval sets with adversarial and edge-case prompts
- Add rubric-style scoring for refusal quality or citation quality
- Add prompt metadata such as owner, version date, and release notes
- Add promotion thresholds that fail the run when pass rate drops below a fixed floor
Final Notes
Prompt engineering becomes much more useful when prompt changes are treated as controlled system changes instead of informal edits.
When prompts are versioned, replayed on a frozen eval set, scored deterministically, and promoted by measured behavior, you get a real engineering loop instead of a demo loop.
Explore the source code at the GitHub repository.
See you in the next issue.
Stay curious.
Join the Newsletter
Subscribe for AI engineering insights, system design strategies, and workflow tips.