
Local inference still needs explicit resource control. Repeated deterministic requests should not execute twice, low-urgency work should not consume the interactive lane, and tenant workloads still need hard guardrails around queue depth, budget caps, and model-call timeout windows.
The economics are different from hosted AI APIs, but the engineering problem is still real. In this project, queue capacity, timeout windows, cache reuse, and daily budget limits are explicit control surfaces. If those controls are implicit, local inference becomes harder to predict even for small production-shaped workloads.
In this issue, we build a local-first inference control layer in C#. The model path stays local through Ollama. Deterministic code owns request estimation, exact cache keys, budget reservation, sync-vs-batch routing, deferred queue draining, structured output fallback, and measurable runtime cost signals.
What You Are Building
You are building a production-shaped inference gateway that controls spend before and after local model execution:
- Load runtime config from
appsettings.jsonandAICOST_environment overrides - Estimate request cost before execution and compute actual charge after execution
- Use exact-response caching for deterministic requests
- Derive cache keys from tenant, task, prompt versions, lane, model, and normalized input
- Reserve org and tenant budget before admitting work
- Route urgent work to an interactive lane and low-urgency or high-cost work to a deferred batch lane
- Queue deferred requests and drain them later with explicit execution semantics
- Validate structured JSON output and apply deterministic fallback when local models drift off schema
- Track charged cost, avoided cost, queued estimated cost, queue depth, and cache hit rate at runtime
This is cost control engineering for AI systems, not just model selection.
System Structure
The gateway is built as a deterministic control loop: load runtime config, estimate the interactive path, decide the execution lane, attempt exact cache reuse, reserve budget, either enqueue or execute immediately, then commit actual spend only after the response is complete. If model output is malformed or semantically weak, the policy layer corrects or replaces it before the result is trusted.
The diagram below shows the high-level control flow:
Runtime Configuration
The app starts by loading and validating the cost-control profile before any model call happens:
var configuration = new ConfigurationBuilder()
.SetBasePath(AppContext.BaseDirectory)
.AddJsonFile("appsettings.json", optional: true, reloadOnChange: false)
.AddEnvironmentVariables(prefix: "AICOST_")
.Build();
var config = AppConfig.Load(configuration);
config.Validate();The tuned live local profile used in this repo:
{
"App": {
"Provider": "ollama",
"UseMockModel": false,
"OllamaBaseUrl": "http://localhost:11434",
"InteractiveModel": "gpt-oss:20b",
"BatchModel": "mistral:7b",
"FixedSystemTokens": 140,
"ReservedOutputTokens": 220,
"CacheTtlMinutes": 120,
"MaxCacheEntries": 512,
"DailyBudgetUsd": 6,
"DefaultTenantDailyBudgetUsd": 2,
"MaxInteractiveEstimatedCostUsd": 0.0018,
"MaxInteractiveInputTokens": 900,
"BatchQueueCapacity": 128,
"BatchProcessMaxItemsPerDrain": 32,
"InteractiveTimeoutSeconds": 30,
"BatchTimeoutSeconds": 90,
"InteractiveInputCostPer1KTokensUsd": 0.00045,
"InteractiveOutputCostPer1KTokensUsd": 0.0009,
"BatchInputCostPer1KTokensUsd": 0.0003,
"BatchOutputCostPer1KTokensUsd": 0.0006
}
}The validation rules protect the budget model from invalid configuration:
if (DefaultTenantDailyBudgetUsd > DailyBudgetUsd)
{
throw new InvalidOperationException("DefaultTenantDailyBudgetUsd cannot exceed DailyBudgetUsd.");
}
if (BatchProcessMaxItemsPerDrain <= 0 || BatchQueueCapacity <= 0)
{
throw new InvalidOperationException("Batch queue settings must be greater than zero.");
}Cost control starts with explicit constraints. If the runtime profile is ambiguous, the economics will be ambiguous too.
Cost Estimation and Routing
The gateway estimates the interactive path before deciding what to do with a request:
var interactiveEstimate = _costCalculator.Estimate(systemPrompt, userPrompt, ExecutionLane.Interactive);
var route = _routingPolicy.Decide(request, interactiveEstimate);
var routeEstimate = route.Lane == ExecutionLane.Interactive
? interactiveEstimate
: _costCalculator.Estimate(systemPrompt, userPrompt, ExecutionLane.DeferredBatch);The estimator itself separates the planning estimate from the actual post-execution charge:
public UsageEstimate Estimate(
string systemPrompt,
string userPrompt,
ExecutionLane lane)
{
var inputTokens = EstimateInputTokens(systemPrompt, userPrompt);
var outputTokens = config.ReservedOutputTokens;
var totalCostUsd = CalculateCost(inputTokens, outputTokens, lane);
return new UsageEstimate(
InputTokens: inputTokens,
EstimatedOutputTokens: outputTokens,
EstimatedCostUsd: totalCostUsd);
}
public ExecutionCharge ComputeActualCharge(
int inputTokens,
string responseText,
ExecutionLane lane)
{
var outputTokens = Math.Max(1, tokenEstimator.EstimateTokens(responseText));
var totalCostUsd = CalculateCost(inputTokens, outputTokens, lane);
return new ExecutionCharge(
InputTokens: inputTokens,
OutputTokens: outputTokens,
TotalCostUsd: totalCostUsd);
}That distinction matters. Estimated cost is used to decide admission and routing. Actual charged cost is committed only after the response exists.
Execution Routing
The routing policy is deterministic and inspectable:
if (request.RequiresFreshResponse)
{
return new RoutingDecision(ExecutionLane.Interactive, config.InteractiveModel, "fresh-response-required");
}
if (request.Urgency is RequestUrgency.Critical or RequestUrgency.High)
{
return new RoutingDecision(ExecutionLane.Interactive, config.InteractiveModel, "high-urgency");
}
if (interactiveEstimate.InputTokens > config.MaxInteractiveInputTokens)
{
return new RoutingDecision(ExecutionLane.DeferredBatch, config.BatchModel, "interactive-token-threshold-exceeded");
}
if (interactiveEstimate.EstimatedCostUsd > config.MaxInteractiveEstimatedCostUsd)
{
return new RoutingDecision(ExecutionLane.DeferredBatch, config.BatchModel, "interactive-cost-threshold-exceeded");
}
if (request.Urgency == RequestUrgency.Low)
{
return new RoutingDecision(ExecutionLane.DeferredBatch, config.BatchModel, "low-urgency-batch-lane");
}High-urgency support tickets stay interactive on gpt-oss:20b. Lower-value or threshold-exceeding work is pushed to the deferred batch lane on mistral:7b. The route reason is always attached to the result.
Exact Cache Keys
The cache is intentionally strict. Exact reuse is only allowed when the request is deterministic, tenant-scoped, and version-aligned:
public string Build(
CostAwareRequest request,
string modelId,
ExecutionLane lane)
{
var normalizedInput = Normalize(request.TicketText);
var payload = string.Join(
"\n",
request.TenantId.Trim(),
request.TaskType.Trim(),
request.Versions.PromptVersion.Trim(),
request.Versions.SchemaVersion.Trim(),
request.Versions.RetrievalVersion.Trim(),
request.Versions.ToolsetVersion.Trim(),
request.Urgency,
request.Temperature.ToString("0.###", CultureInfo.InvariantCulture),
lane,
modelId.Trim(),
normalizedInput);
var bytes = SHA256.HashData(Encoding.UTF8.GetBytes(payload));
return Convert.ToHexString(bytes);
}Cache eligibility is also explicit:
private static bool CanUseCache(CostAwareRequest request)
=> request.AllowResponseCache
&& !request.RequiresFreshResponse
&& Math.Abs(request.Temperature) < 0.0001d;This avoids a common production mistake where cache reuse ignores prompt changes, tenant boundaries, or model identity and quietly returns stale or invalid outputs.
Budget Reservation
The budget ledger checks both org-wide and tenant-scoped daily caps before work is admitted:
public bool TryReserve(
string tenantId,
double amountUsd,
DateTimeOffset nowUtc,
out BudgetReservation reservation,
out string rejectionReason)
{
var day = DateOnly.FromDateTime(nowUtc.UtcDateTime);
lock (_gate)
{
var state = GetOrCreate(day);
var tenant = state.GetOrCreateTenant(tenantId);
if (state.SpentUsd + state.ReservedUsd + amountUsd > config.DailyBudgetUsd)
{
reservation = new BudgetReservation(day, tenantId, 0);
rejectionReason = "org-daily-budget-exceeded";
return false;
}
if (tenant.SpentUsd + tenant.ReservedUsd + amountUsd > config.DefaultTenantDailyBudgetUsd)
{
reservation = new BudgetReservation(day, tenantId, 0);
rejectionReason = "tenant-daily-budget-exceeded";
return false;
}
state.ReservedUsd += amountUsd;
tenant.ReservedUsd += amountUsd;
reservation = new BudgetReservation(day, tenantId, amountUsd);
rejectionReason = string.Empty;
return true;
}
}A deferred request is admitted with reserved budget but zero charged cost at submission time:
if (route.Lane == ExecutionLane.DeferredBatch)
{
_metrics.RecordQueued(routeEstimate.EstimatedCostUsd);
return new GatewayResult(
RequestId: request.RequestId,
TenantId: request.TenantId,
Outcome: RequestOutcome.Queued,
Lane: route.Lane,
ModelId: route.ModelId,
CacheHit: false,
ResponseText: string.Empty,
Estimate: routeEstimate,
ChargedCostUsd: 0,
AvoidedCostUsd: 0,
Detail: route.Reason,
StructuredResponse: null);
}This is an important operational distinction. A queued request is not free. It is reserved. It simply has not been charged yet because the model has not executed yet.
Deferred Queue
The batch lane uses a simple in-memory FIFO queue with bounded capacity:
private bool TryEnqueue(DeferredWorkItem workItem)
{
lock (_queueGate)
{
if (_deferredQueue.Count >= _config.BatchQueueCapacity)
{
return false;
}
_deferredQueue.Enqueue(workItem);
return true;
}
}Draining is explicit and processes only up to the configured max items per pass:
public async Task<IReadOnlyList<GatewayResult>> DrainDeferredAsync(CancellationToken cancellationToken = default)
{
var items = Dequeue(_config.BatchProcessMaxItemsPerDrain);
if (items.Count == 0)
{
return [];
}
_metrics.RecordDequeued(items.Sum(static item => item.Estimate.EstimatedCostUsd));
var results = new List<GatewayResult>(items.Count);
foreach (var item in items)
{
var result = await ExecuteReservedAsync(
item.Request,
item.Route,
item.Estimate,
item.Reservation,
item.CacheKey,
item.SystemPrompt,
item.UserPrompt,
cancellationToken);
results.Add(result);
}
return results;
}This keeps queue semantics obvious for the demo. Submission and execution are separate events. That is why queued rows show zero charge at first and a non-zero charge later when the drain actually runs.
Structured Output Validation
The task contract is strict: category, priority, summary, and at least two next actions are required. The parser validates that structure before the result is accepted:
if (!AllowedCategories.Contains(triage.Category))
{
errors.Add("Category must be one of Billing, Bug, Performance, Access, FeatureRequest.");
}
if (!AllowedPriorities.Contains(triage.Priority))
{
errors.Add("Priority must be one of P1, P2, P3, P4.");
}
if (string.IsNullOrWhiteSpace(triage.Summary))
{
errors.Add("Summary is required.");
}
if (triage.NextActions.Count < 2)
{
errors.Add("At least two next actions are required.");
}When local model output is malformed or weak, the policy layer replaces raw parser noise with deterministic fallback behavior and stable reason codes:
if (parsed is null)
{
return new PolicyResolution(
Triage: BuildFallback(ticketText),
UsedFallback: true,
Adjusted: true,
Reason: $"deterministic-fallback-used: {ClassifyParseError(parseError)}");
}
var corrected = Apply(ticketText, parsed);
var adjusted = !Equivalent(parsed, corrected);
return new PolicyResolution(
Triage: corrected,
UsedFallback: false,
Adjusted: adjusted,
Reason: adjusted ? "policy-corrected" : string.Empty);That is the practical lesson of local inference engineering: even when the model is local, output trust still belongs to deterministic code.
OllamaSharp Integration
The live path uses Ollama locally and assembles the streamed response into a final string:
public async Task<string> CompleteAsync(
string modelId,
string systemPrompt,
string userPrompt,
TimeSpan timeout,
CancellationToken cancellationToken = default)
{
var client = GetOrCreateClient(modelId);
using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
timeoutCts.CancelAfter(timeout);
var request = new ChatRequest
{
Model = modelId,
Messages =
[
new Message(ChatRole.System, systemPrompt),
new Message(ChatRole.User, userPrompt)
]
};
var builder = new StringBuilder(capacity: 512);
await foreach (var chunk in client.ChatAsync(request, timeoutCts.Token))
{
if (chunk?.Message?.Content is { Length: > 0 } content)
{
builder.Append(content);
}
}
return builder.ToString().Trim();
}Here, the interactive lane runs on gpt-oss:20b and the deferred lane runs on mistral:7b. The cost-control logic remains independent from the specific model choice.
Runtime Metrics
The gateway tracks cost, queue, and cache behavior as a snapshot instead of leaving economics hidden in logs:
public sealed record GatewayMetricsSnapshot(
long Received,
long Completed,
long Queued,
long Rejected,
long Failed,
long CacheHits,
long CacheMisses,
double CacheHitRate,
long InteractiveCompleted,
long BatchCompleted,
double ChargedCostUsd,
double AvoidedCostUsd,
double QueuedEstimatedCostUsd,
int QueueDepth,
int CacheEntries);Queue and cache accounting are updated explicitly:
public void RecordCacheHit(double avoidedCostUsd)
{
lock (_gate)
{
_completed++;
_cacheHits++;
_avoidedCostUsd += Math.Max(0, avoidedCostUsd);
}
}
public void RecordQueued(double estimatedCostUsd)
{
lock (_gate)
{
_queued++;
_queuedEstimatedCostUsd += Math.Max(0, estimatedCostUsd);
}
}
public void RecordDequeued(double estimatedCostUsd)
{
lock (_gate)
{
_queuedEstimatedCostUsd = Math.Max(0, _queuedEstimatedCostUsd - Math.Max(0, estimatedCostUsd));
}
}This gives you three separate signals that matter operationally: what was actually charged, what was avoided by cache, and what deferred work is still sitting in the queue as estimated future spend.
Reading a Real Live Run
A real run against local Ollama models produced:
=== AI Cost Engineering for Local Inference (.NET) ===
Provider: ollama
Interactive model: gpt-oss:20b
Batch model: mistral:7b
[REQ-1001] tenant=contoso outcome=Completed lane=Interactive model=gpt-oss:20b cacheHit=False
detail: high-urgency; policy-corrected
estimated cost: $0.000391 | charged: $0.000290 | avoided: $0.000000
[REQ-1002] tenant=contoso outcome=Completed lane=Cache model=gpt-oss:20b cacheHit=True
detail: served-from-exact-cache
estimated cost: $0.000391 | charged: $0.000000 | avoided: $0.000391
[REQ-1003] tenant=contoso outcome=Queued lane=DeferredBatch model=mistral:7b cacheHit=False
detail: low-urgency-batch-lane
estimated cost: $0.000261 | charged: $0.000000 | avoided: $0.000000
Draining deferred batch lane...
[REQ-1003] tenant=contoso outcome=Completed lane=DeferredBatch model=mistral:7b cacheHit=False
detail: low-urgency-batch-lane
estimated cost: $0.000261 | charged: $0.000184 | avoided: $0.000000
Metrics snapshot:
- received: 5
- completed: 5
- queued: 2
- cache hits: 1
- charged cost USD: 0.000971
- avoided cost USD: 0.000391
- queued estimated cost USD: 0.000000
- queue depth: 0How to interpret this:
- The second identical interactive request was served from exact cache and avoided another full model call
- Low-urgency feature work entered the deferred lane with reserved budget but zero immediate charge
- The actual charge for deferred work appeared only when the queue was drained and the model really executed
- The final metrics show the queue is empty and no deferred estimated cost remains outstanding
This is the real operational value of the gateway. You can explain why a request was interactive, queued, cached, corrected, rejected, or charged.
Why This Architecture Works
The model remains useful, but spending authority stays in deterministic code:
- Estimated cost is computed before execution and used for routing decisions
- Exact cache keys prevent accidental reuse across tenants, models, or prompt versions
- Budget is reserved before work is admitted, not after the money is already spent
- Deferred work uses explicit queue semantics instead of hidden asynchronous execution
- Actual spend is committed only after execution completes
- Malformed or weak model output is normalized before it becomes a system result
- Metrics expose charged, avoided, and queued cost separately
Potential Enhancements
To push this further, you can consider:
- Persist the deferred queue in SQLite or Redis instead of process memory
- Run batch draining on a timed background worker instead of a manual command
- Add per-model concurrency limits or GPU-slot reservation policies
- Split tenant budget policy by model tier or workload class
- Add semantic near-duplicate detection before exact cache lookup
- Export cost and queue metrics to OpenTelemetry or Prometheus backends
Final Notes
Inference systems become much more reliable when resource usage is engineered as deliberately as output quality.
When cache keys, routing thresholds, queue semantics, budget reservation, and actual charge accounting are explicit, the model path stops being a black box and becomes a controllable service boundary.
Explore the source code at the GitHub repository.
See you in the next issue.
Stay curious.
Join the Newsletter
Subscribe for AI engineering insights, system design strategies, and workflow tips.