Issue #23: AI Cost Engineering for Local Inference

12 min read | April 11, 2026

Local inference still needs explicit resource control. Repeated deterministic requests should not execute twice, low-urgency work should not consume the interactive lane, and tenant workloads still need hard guardrails around queue depth, budget caps, and model-call timeout windows.

The economics are different from hosted AI APIs, but the engineering problem is still real. In this project, queue capacity, timeout windows, cache reuse, and daily budget limits are explicit control surfaces. If those controls are implicit, local inference becomes harder to predict even for small production-shaped workloads.

In this issue, we build a local-first inference control layer in C#. The model path stays local through Ollama. Deterministic code owns request estimation, exact cache keys, budget reservation, sync-vs-batch routing, deferred queue draining, structured output fallback, and measurable runtime cost signals.

What You Are Building

You are building a production-shaped inference gateway that controls spend before and after local model execution:

Load runtime config from appsettings.json and AICOST_ environment overrides
Estimate request cost before execution and compute actual charge after execution
Use exact-response caching for deterministic requests
Derive cache keys from tenant, task, prompt versions, lane, model, and normalized input
Reserve org and tenant budget before admitting work
Route urgent work to an interactive lane and low-urgency or high-cost work to a deferred batch lane
Queue deferred requests and drain them later with explicit execution semantics
Validate structured JSON output and apply deterministic fallback when local models drift off schema
Track charged cost, avoided cost, queued estimated cost, queue depth, and cache hit rate at runtime

This is cost control engineering for AI systems, not just model selection.

System Structure

The gateway is built as a deterministic control loop: load runtime config, estimate the interactive path, decide the execution lane, attempt exact cache reuse, reserve budget, either enqueue or execute immediately, then commit actual spend only after the response is complete. If model output is malformed or semantically weak, the policy layer corrects or replaces it before the result is trusted.

The diagram below shows the high-level control flow:

Runtime Configuration

The app starts by loading and validating the cost-control profile before any model call happens:

var configuration = new ConfigurationBuilder()
  .SetBasePath(AppContext.BaseDirectory)
  .AddJsonFile("appsettings.json", optional: true, reloadOnChange: false)
  .AddEnvironmentVariables(prefix: "AICOST_")
  .Build();

var config = AppConfig.Load(configuration);
config.Validate();

The tuned live local profile used in this repo:

{
  "App": {
    "Provider": "ollama",
    "UseMockModel": false,
    "OllamaBaseUrl": "http://localhost:11434",
    "InteractiveModel": "gpt-oss:20b",
    "BatchModel": "mistral:7b",
    "FixedSystemTokens": 140,
    "ReservedOutputTokens": 220,
    "CacheTtlMinutes": 120,
    "MaxCacheEntries": 512,
    "DailyBudgetUsd": 6,
    "DefaultTenantDailyBudgetUsd": 2,
    "MaxInteractiveEstimatedCostUsd": 0.0018,
    "MaxInteractiveInputTokens": 900,
    "BatchQueueCapacity": 128,
    "BatchProcessMaxItemsPerDrain": 32,
    "InteractiveTimeoutSeconds": 30,
    "BatchTimeoutSeconds": 90,
    "InteractiveInputCostPer1KTokensUsd": 0.00045,
    "InteractiveOutputCostPer1KTokensUsd": 0.0009,
    "BatchInputCostPer1KTokensUsd": 0.0003,
    "BatchOutputCostPer1KTokensUsd": 0.0006
  }
}

The validation rules protect the budget model from invalid configuration:

if (DefaultTenantDailyBudgetUsd > DailyBudgetUsd)
{
  throw new InvalidOperationException("DefaultTenantDailyBudgetUsd cannot exceed DailyBudgetUsd.");
}

if (BatchProcessMaxItemsPerDrain <= 0 || BatchQueueCapacity <= 0)
{
  throw new InvalidOperationException("Batch queue settings must be greater than zero.");
}

Cost control starts with explicit constraints. If the runtime profile is ambiguous, the economics will be ambiguous too.

Cost Estimation and Routing

The gateway estimates the interactive path before deciding what to do with a request:

var interactiveEstimate = _costCalculator.Estimate(systemPrompt, userPrompt, ExecutionLane.Interactive);
var route = _routingPolicy.Decide(request, interactiveEstimate);
var routeEstimate = route.Lane == ExecutionLane.Interactive
  ? interactiveEstimate
  : _costCalculator.Estimate(systemPrompt, userPrompt, ExecutionLane.DeferredBatch);

The estimator itself separates the planning estimate from the actual post-execution charge:

public UsageEstimate Estimate(
  string systemPrompt,
  string userPrompt,
  ExecutionLane lane)
{
  var inputTokens = EstimateInputTokens(systemPrompt, userPrompt);
  var outputTokens = config.ReservedOutputTokens;
  var totalCostUsd = CalculateCost(inputTokens, outputTokens, lane);

  return new UsageEstimate(
      InputTokens: inputTokens,
      EstimatedOutputTokens: outputTokens,
      EstimatedCostUsd: totalCostUsd);
}

public ExecutionCharge ComputeActualCharge(
  int inputTokens,
  string responseText,
  ExecutionLane lane)
{
  var outputTokens = Math.Max(1, tokenEstimator.EstimateTokens(responseText));
  var totalCostUsd = CalculateCost(inputTokens, outputTokens, lane);

  return new ExecutionCharge(
      InputTokens: inputTokens,
      OutputTokens: outputTokens,
      TotalCostUsd: totalCostUsd);
}

That distinction matters. Estimated cost is used to decide admission and routing. Actual charged cost is committed only after the response exists.

Execution Routing

The routing policy is deterministic and inspectable:

if (request.RequiresFreshResponse)
{
  return new RoutingDecision(ExecutionLane.Interactive, config.InteractiveModel, "fresh-response-required");
}

if (request.Urgency is RequestUrgency.Critical or RequestUrgency.High)
{
  return new RoutingDecision(ExecutionLane.Interactive, config.InteractiveModel, "high-urgency");
}

if (interactiveEstimate.InputTokens > config.MaxInteractiveInputTokens)
{
  return new RoutingDecision(ExecutionLane.DeferredBatch, config.BatchModel, "interactive-token-threshold-exceeded");
}

if (interactiveEstimate.EstimatedCostUsd > config.MaxInteractiveEstimatedCostUsd)
{
  return new RoutingDecision(ExecutionLane.DeferredBatch, config.BatchModel, "interactive-cost-threshold-exceeded");
}

if (request.Urgency == RequestUrgency.Low)
{
  return new RoutingDecision(ExecutionLane.DeferredBatch, config.BatchModel, "low-urgency-batch-lane");
}

High-urgency support tickets stay interactive on gpt-oss:20b. Lower-value or threshold-exceeding work is pushed to the deferred batch lane on mistral:7b. The route reason is always attached to the result.

Exact Cache Keys

The cache is intentionally strict. Exact reuse is only allowed when the request is deterministic, tenant-scoped, and version-aligned:

public string Build(
  CostAwareRequest request,
  string modelId,
  ExecutionLane lane)
{
  var normalizedInput = Normalize(request.TicketText);
  var payload = string.Join(
      "\n",
      request.TenantId.Trim(),
      request.TaskType.Trim(),
      request.Versions.PromptVersion.Trim(),
      request.Versions.SchemaVersion.Trim(),
      request.Versions.RetrievalVersion.Trim(),
      request.Versions.ToolsetVersion.Trim(),
      request.Urgency,
      request.Temperature.ToString("0.###", CultureInfo.InvariantCulture),
      lane,
      modelId.Trim(),
      normalizedInput);

  var bytes = SHA256.HashData(Encoding.UTF8.GetBytes(payload));
  return Convert.ToHexString(bytes);
}

Cache eligibility is also explicit:

private static bool CanUseCache(CostAwareRequest request)
  => request.AllowResponseCache
     && !request.RequiresFreshResponse
     && Math.Abs(request.Temperature) < 0.0001d;

This avoids a common production mistake where cache reuse ignores prompt changes, tenant boundaries, or model identity and quietly returns stale or invalid outputs.

Budget Reservation

The budget ledger checks both org-wide and tenant-scoped daily caps before work is admitted:

public bool TryReserve(
  string tenantId,
  double amountUsd,
  DateTimeOffset nowUtc,
  out BudgetReservation reservation,
  out string rejectionReason)
{
  var day = DateOnly.FromDateTime(nowUtc.UtcDateTime);

  lock (_gate)
  {
      var state = GetOrCreate(day);
      var tenant = state.GetOrCreateTenant(tenantId);

      if (state.SpentUsd + state.ReservedUsd + amountUsd > config.DailyBudgetUsd)
      {
          reservation = new BudgetReservation(day, tenantId, 0);
          rejectionReason = "org-daily-budget-exceeded";
          return false;
      }

      if (tenant.SpentUsd + tenant.ReservedUsd + amountUsd > config.DefaultTenantDailyBudgetUsd)
      {
          reservation = new BudgetReservation(day, tenantId, 0);
          rejectionReason = "tenant-daily-budget-exceeded";
          return false;
      }

      state.ReservedUsd += amountUsd;
      tenant.ReservedUsd += amountUsd;
      reservation = new BudgetReservation(day, tenantId, amountUsd);
      rejectionReason = string.Empty;
      return true;
  }
}

A deferred request is admitted with reserved budget but zero charged cost at submission time:

if (route.Lane == ExecutionLane.DeferredBatch)
{
  _metrics.RecordQueued(routeEstimate.EstimatedCostUsd);

  return new GatewayResult(
      RequestId: request.RequestId,
      TenantId: request.TenantId,
      Outcome: RequestOutcome.Queued,
      Lane: route.Lane,
      ModelId: route.ModelId,
      CacheHit: false,
      ResponseText: string.Empty,
      Estimate: routeEstimate,
      ChargedCostUsd: 0,
      AvoidedCostUsd: 0,
      Detail: route.Reason,
      StructuredResponse: null);
}

This is an important operational distinction. A queued request is not free. It is reserved. It simply has not been charged yet because the model has not executed yet.

Deferred Queue

The batch lane uses a simple in-memory FIFO queue with bounded capacity:

private bool TryEnqueue(DeferredWorkItem workItem)
{
  lock (_queueGate)
  {
      if (_deferredQueue.Count >= _config.BatchQueueCapacity)
      {
          return false;
      }

      _deferredQueue.Enqueue(workItem);
      return true;
  }
}

Draining is explicit and processes only up to the configured max items per pass:

public async Task<IReadOnlyList<GatewayResult>> DrainDeferredAsync(CancellationToken cancellationToken = default)
{
  var items = Dequeue(_config.BatchProcessMaxItemsPerDrain);
  if (items.Count == 0)
  {
      return [];
  }

  _metrics.RecordDequeued(items.Sum(static item => item.Estimate.EstimatedCostUsd));

  var results = new List<GatewayResult>(items.Count);
  foreach (var item in items)
  {
      var result = await ExecuteReservedAsync(
          item.Request,
          item.Route,
          item.Estimate,
          item.Reservation,
          item.CacheKey,
          item.SystemPrompt,
          item.UserPrompt,
          cancellationToken);

      results.Add(result);
  }

  return results;
}

This keeps queue semantics obvious for the demo. Submission and execution are separate events. That is why queued rows show zero charge at first and a non-zero charge later when the drain actually runs.

Structured Output Validation

The task contract is strict: category, priority, summary, and at least two next actions are required. The parser validates that structure before the result is accepted:

if (!AllowedCategories.Contains(triage.Category))
{
  errors.Add("Category must be one of Billing, Bug, Performance, Access, FeatureRequest.");
}

if (!AllowedPriorities.Contains(triage.Priority))
{
  errors.Add("Priority must be one of P1, P2, P3, P4.");
}

if (string.IsNullOrWhiteSpace(triage.Summary))
{
  errors.Add("Summary is required.");
}

if (triage.NextActions.Count < 2)
{
  errors.Add("At least two next actions are required.");
}

When local model output is malformed or weak, the policy layer replaces raw parser noise with deterministic fallback behavior and stable reason codes:

if (parsed is null)
{
  return new PolicyResolution(
      Triage: BuildFallback(ticketText),
      UsedFallback: true,
      Adjusted: true,
      Reason: $"deterministic-fallback-used: {ClassifyParseError(parseError)}");
}

var corrected = Apply(ticketText, parsed);
var adjusted = !Equivalent(parsed, corrected);

return new PolicyResolution(
  Triage: corrected,
  UsedFallback: false,
  Adjusted: adjusted,
  Reason: adjusted ? "policy-corrected" : string.Empty);

That is the practical lesson of local inference engineering: even when the model is local, output trust still belongs to deterministic code.

OllamaSharp Integration

The live path uses Ollama locally and assembles the streamed response into a final string:

public async Task<string> CompleteAsync(
  string modelId,
  string systemPrompt,
  string userPrompt,
  TimeSpan timeout,
  CancellationToken cancellationToken = default)
{
  var client = GetOrCreateClient(modelId);

  using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
  timeoutCts.CancelAfter(timeout);

  var request = new ChatRequest
  {
      Model = modelId,
      Messages =
      [
          new Message(ChatRole.System, systemPrompt),
          new Message(ChatRole.User, userPrompt)
      ]
  };

  var builder = new StringBuilder(capacity: 512);
  await foreach (var chunk in client.ChatAsync(request, timeoutCts.Token))
  {
      if (chunk?.Message?.Content is { Length: > 0 } content)
      {
          builder.Append(content);
      }
  }

  return builder.ToString().Trim();
}

Here, the interactive lane runs on gpt-oss:20b and the deferred lane runs on mistral:7b. The cost-control logic remains independent from the specific model choice.

Runtime Metrics

The gateway tracks cost, queue, and cache behavior as a snapshot instead of leaving economics hidden in logs:

public sealed record GatewayMetricsSnapshot(
  long Received,
  long Completed,
  long Queued,
  long Rejected,
  long Failed,
  long CacheHits,
  long CacheMisses,
  double CacheHitRate,
  long InteractiveCompleted,
  long BatchCompleted,
  double ChargedCostUsd,
  double AvoidedCostUsd,
  double QueuedEstimatedCostUsd,
  int QueueDepth,
  int CacheEntries);

Queue and cache accounting are updated explicitly:

public void RecordCacheHit(double avoidedCostUsd)
{
  lock (_gate)
  {
      _completed++;
      _cacheHits++;
      _avoidedCostUsd += Math.Max(0, avoidedCostUsd);
  }
}

public void RecordQueued(double estimatedCostUsd)
{
  lock (_gate)
  {
      _queued++;
      _queuedEstimatedCostUsd += Math.Max(0, estimatedCostUsd);
  }
}

public void RecordDequeued(double estimatedCostUsd)
{
  lock (_gate)
  {
      _queuedEstimatedCostUsd = Math.Max(0, _queuedEstimatedCostUsd - Math.Max(0, estimatedCostUsd));
  }
}

This gives you three separate signals that matter operationally: what was actually charged, what was avoided by cache, and what deferred work is still sitting in the queue as estimated future spend.

Reading a Real Live Run

A real run against local Ollama models produced:

=== AI Cost Engineering for Local Inference (.NET) ===
Provider: ollama
Interactive model: gpt-oss:20b
Batch model: mistral:7b

[REQ-1001] tenant=contoso outcome=Completed lane=Interactive model=gpt-oss:20b cacheHit=False
detail: high-urgency; policy-corrected
estimated cost: $0.000391 | charged: $0.000290 | avoided: $0.000000

[REQ-1002] tenant=contoso outcome=Completed lane=Cache model=gpt-oss:20b cacheHit=True
detail: served-from-exact-cache
estimated cost: $0.000391 | charged: $0.000000 | avoided: $0.000391

[REQ-1003] tenant=contoso outcome=Queued lane=DeferredBatch model=mistral:7b cacheHit=False
detail: low-urgency-batch-lane
estimated cost: $0.000261 | charged: $0.000000 | avoided: $0.000000

Draining deferred batch lane...
[REQ-1003] tenant=contoso outcome=Completed lane=DeferredBatch model=mistral:7b cacheHit=False
detail: low-urgency-batch-lane
estimated cost: $0.000261 | charged: $0.000184 | avoided: $0.000000

Metrics snapshot:
- received: 5
- completed: 5
- queued: 2
- cache hits: 1
- charged cost USD: 0.000971
- avoided cost USD: 0.000391
- queued estimated cost USD: 0.000000
- queue depth: 0

How to interpret this:

The second identical interactive request was served from exact cache and avoided another full model call
Low-urgency feature work entered the deferred lane with reserved budget but zero immediate charge
The actual charge for deferred work appeared only when the queue was drained and the model really executed
The final metrics show the queue is empty and no deferred estimated cost remains outstanding

This is the real operational value of the gateway. You can explain why a request was interactive, queued, cached, corrected, rejected, or charged.

Why This Architecture Works

The model remains useful, but spending authority stays in deterministic code:

Estimated cost is computed before execution and used for routing decisions
Exact cache keys prevent accidental reuse across tenants, models, or prompt versions
Budget is reserved before work is admitted, not after the money is already spent
Deferred work uses explicit queue semantics instead of hidden asynchronous execution
Actual spend is committed only after execution completes
Malformed or weak model output is normalized before it becomes a system result
Metrics expose charged, avoided, and queued cost separately

Potential Enhancements

To push this further, you can consider:

Persist the deferred queue in SQLite or Redis instead of process memory
Run batch draining on a timed background worker instead of a manual command
Add per-model concurrency limits or GPU-slot reservation policies
Split tenant budget policy by model tier or workload class
Add semantic near-duplicate detection before exact cache lookup
Export cost and queue metrics to OpenTelemetry or Prometheus backends

Final Notes

Inference systems become much more reliable when resource usage is engineered as deliberately as output quality.

When cache keys, routing thresholds, queue semantics, budget reservation, and actual charge accounting are explicit, the model path stops being a black box and becomes a controllable service boundary.

Explore the source code at the GitHub repository.

See you in the next issue.

Stay curious.

Share this article with your network.

LinkedIn X Facebook

Join the Newsletter

Subscribe for AI engineering insights, system design strategies, and workflow tips.

Your information is safe. Unsubscribe anytime.