Issue #21: Production Reliability Engineering for AI APIs

8 min read | March 28, 2026

Most local LLM demos stop at model output quality. Production systems fail for different reasons: request spikes, queue buildup, timeout cascades, and retry storms that push latency out of control.

In this issue, we build a local-first reliability layer for AI APIs in C#. The model remains local through Ollama. Deterministic code owns rate limits, queue pressure, time budgets, retry policy, circuit state, fallback routing, and metrics.

What You Are Building

You are building a production-shaped inference gateway that protects local LLM calls under load:

Load runtime config from appsettings.json and RELIABILITY_ environment overrides
Enforce fixed-window ingress rate limits
Use a bounded queue and worker pool for controlled execution
Apply per-attempt timeout plus request-level deadline budget
Apply deterministic retry policy with bounded delay
Use per-model circuit breakers with half-open probes
Fallback across a local model chain when upstream models fail
Track p50 and p95 latency, timeout rate, fallback rate, and queue saturation in real time

This is reliability engineering for AI APIs, not prompt engineering.

System Structure

The gateway is built as a deterministic control loop: receive request, apply rate gate, enqueue into bounded channel, process on workers with timeout and retry constraints, fallback across model chain with circuit awareness, and publish metrics snapshots continuously. Every rejection and every failure mode has an explicit reason.

The diagram below shows the high-level control flow:

Runtime Configuration Is the Reliability Contract

The runtime starts by loading and validating a strict config object:

var configuration = new ConfigurationBuilder()
  .SetBasePath(AppContext.BaseDirectory)
  .AddJsonFile("appsettings.json", optional: true, reloadOnChange: false)
  .AddEnvironmentVariables(prefix: "RELIABILITY_")
  .Build();

var config = AppConfig.Load(configuration);
config.Validate();

The tuned local profile used for the successful run:

{
  "App": {
    "RateLimitRequestsPerSecond": 8,
    "QueueCapacity": 24,
    "WorkerCount": 3,
    "EndToEndTimeoutMs": 30000,
    "AttemptTimeoutMs": 9000,
    "MaxAttemptsPerModel": 1,
    "CircuitBreakerFailureThreshold": 2,
    "CircuitBreakerOpenSeconds": 15,
    "LoadTotalRequests": 30,
    "LoadConcurrentClients": 4,
    "LoadBurstMode": false
  }
}

The key validation rule prevents invalid timeout hierarchy:

if (AttemptTimeoutMs > EndToEndTimeoutMs)
{
  throw new InvalidOperationException("App:AttemptTimeoutMs must be less than or equal to App:EndToEndTimeoutMs.");
}

Reliability starts with explicit runtime constraints, not hidden defaults.

Ingress Control: Rate Limit and Queue Backpressure

Rate limiting is deterministic with a fixed one-second window:

public bool TryAcquire(DateTimeOffset nowUtc)
{
  lock (_gate)
  {
      if (_windowStartUtc == DateTimeOffset.MinValue)
      {
          _windowStartUtc = nowUtc;
      }

      if ((nowUtc - _windowStartUtc) >= _window)
      {
          _windowStartUtc = nowUtc;
          _requestCount = 0;
      }

      if (_requestCount >= _maxRequestsPerWindow)
      {
          return false;
      }

      _requestCount++;
      return true;
  }
}

Queueing uses a bounded channel that drops writes when full:

_channel = Channel.CreateBounded<QueuedRequest>(new BoundedChannelOptions(config.QueueCapacity)
{
  FullMode = BoundedChannelFullMode.DropWrite,
  SingleReader = false,
  SingleWriter = false
});

Ingress responses are explicit for each rejection mode:

if (!_rateLimiter.TryAcquire(DateTimeOffset.UtcNow))
{
  _metrics.RecordRateLimited();
  return InferenceResponse.FailureResponse(
      requestId: request.RequestId,
      outcome: RequestOutcome.RateLimited,
      responseText: "Rate limit exceeded. Retry shortly.",
      latencyMs: 0,
      modelAttempts: 0,
      detail: "rate-limited");
}

var depth = Interlocked.Increment(ref _queueDepth);
if (!_channel.Writer.TryWrite(queued))
{
  Interlocked.Decrement(ref _queueDepth);
  _metrics.RecordQueueRejected();

  return InferenceResponse.FailureResponse(
      requestId: request.RequestId,
      outcome: RequestOutcome.QueueRejected,
      responseText: "Queue is saturated. Retry shortly.",
      latencyMs: 0,
      modelAttempts: 0,
      detail: "queue-rejected");
}

This removes ambiguous overload behavior. Requests are either accepted with bounded processing or rejected with concrete reasons.

Time Budget, Retry, Circuit, and Fallback

Each request carries a global deadline. Each model call gets a bounded attempt timeout inside that budget:

var deadline = request.Deadline <= TimeSpan.Zero
  ? TimeSpan.FromMilliseconds(_config.EndToEndTimeoutMs)
  : request.Deadline;
deadlineCts.CancelAfter(deadline);

var remainingBudget = deadline - sw.Elapsed;
var attemptTimeout = remainingBudget < _attemptTimeout ? remainingBudget : _attemptTimeout;

Circuit breakers protect each model independently and allow half-open probes:

if (!breaker.CanExecute(DateTimeOffset.UtcNow, out var breakerReason))
{
  attemptTrace.Add($"{model}:skip({breakerReason})");
  _metrics.RecordCircuitOpenSkip(model);
  continue;
}

public bool CanExecute(DateTimeOffset nowUtc, out string reason)
{
  lock (_gate)
  {
      if (_state == CircuitState.Open)
      {
          var elapsed = nowUtc - _openedAtUtc;
          if (elapsed < _openDuration)
          {
              reason = "circuit-open";
              return false;
          }

          _state = CircuitState.HalfOpen;
          _halfOpenProbeReserved = false;
      }

      if (_state == CircuitState.HalfOpen)
      {
          if (_halfOpenProbeReserved)
          {
              reason = "half-open-probe-in-flight";
              return false;
          }

          _halfOpenProbeReserved = true;
          reason = "half-open-probe";
          return true;
      }

      reason = "closed";
      return true;
  }
}

Fallback routing is deterministic by model-chain order:

for (var modelIndex = 0; modelIndex < _modelChain.Count; modelIndex++)
{
  var model = _modelChain[modelIndex];

  for (var attempt = 1; attempt <= _retryPolicy.MaxAttempts; attempt++)
  {
      var text = await modelClient.CompleteAsync(
          systemPrompt,
          request.Prompt,
          attemptTimeout,
          requestToken);

      return InferenceResponse.SuccessResponse(
          requestId: request.RequestId,
          responseText: text,
          modelUsed: model,
          modelAttempts: attemptsMade,
          fallbackDepth: modelIndex,
          latencyMs: sw.Elapsed.TotalMilliseconds,
          detail: string.Join(" | ", attemptTrace));
  }
}

Primary model first, then secondary, then tertiary. No hidden routing decisions.

Local Models Through OllamaSharp

The gateway uses Ollama locally through OllamaSharp with streaming response assembly:

public async Task<string> CompleteAsync(
  string systemPrompt,
  string userPrompt,
  TimeSpan timeout,
  CancellationToken cancellationToken)
{
  using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
  timeoutCts.CancelAfter(timeout);

  var request = new ChatRequest
  {
      Model = ModelName,
      Messages =
      [
          new Message(ChatRole.System, systemPrompt),
          new Message(ChatRole.User, userPrompt)
      ]
  };

  var builder = new StringBuilder(capacity: 512);
  await foreach (var chunk in _client.ChatAsync(request, timeoutCts.Token))
  {
      if (chunk?.Message?.Content is { Length: > 0 } text)
      {
          builder.Append(text);
      }
  }

  return builder.ToString().Trim();
}

The model path stays local while gateway reliability remains framework-agnostic.

Metrics as Reliability Signals

Metrics snapshots compute latency percentiles and failure rates in process:

var samples = _latencySamples.ToArray();
Array.Sort(samples);

var avg = samples.Length == 0 ? 0 : samples.Average();
var p50 = Percentile(samples, 0.50);
var p95 = Percentile(samples, 0.95);

var errorRate = completed == 0 ? 0 : (double)(failed + timedOut) / completed;
var timeoutRate = completed == 0 ? 0 : (double)timedOut / completed;
var fallbackRate = succeeded == 0 ? 0 : (double)Volatile.Read(ref _fallbackServed) / succeeded;

The console reporter publishes those values continuously:

Console.WriteLine($"Latency(ms): avg={snapshot.AverageLatencyMs:F1} p50={snapshot.P50LatencyMs:F1} p95={snapshot.P95LatencyMs:F1}");
Console.WriteLine($"Rates: error={snapshot.ErrorRate:P1} timeout={snapshot.TimeoutRate:P1} fallback={snapshot.FallbackRate:P1}");
Console.WriteLine($"Queue: current={snapshot.CurrentQueueDepth} peak={snapshot.PeakQueueDepth} saturation={snapshot.QueueSaturation:P1}");

This keeps operations measurable with no external dependencies required for the demo.

Reading the Full Run Output

A full successful run from this implementation produced:

Load summary:
- Total: 30
- Success: 30
- Failed: 0
- Timed out: 0
- Rate limited: 0
- Queue rejected: 0
- Canceled: 0
- Avg latency (served): 11964.0 ms
- P95 latency (served): 22886.2 ms

Final metrics snapshot:
Requests: recv=30 enq=30 done=30 ok=30 fail=0 timeout=0
Rates: error=0.0% timeout=0.0% fallback=33.3%
Model attempts=40 | circuit-open-skips=3
llama3.2:3b    attempts=27    ok=20    fail=7     timeout=7     skip=3
mistral:7b     attempts=10    ok=7     fail=3     timeout=3     skip=0
phi3:mini      attempts=3     ok=3     fail=0     timeout=0     skip=0

How to interpret this:

The gateway remained healthy under full test load with no dropped or timed-out requests
Fallback served roughly one third of successful requests, which means resilience controls actively protected availability
The primary model still timed out on some attempts, but the system recovered through deterministic fallback routing
Latency is high for this hardware profile, which is expected in local model setups and still operationally visible through p50 and p95 reporting

Parameter Notes That Matter Most

These settings control most runtime behavior:

RateLimitRequestsPerSecond: ingress safety valve against burst overload
QueueCapacity: buffered backlog limit before explicit rejection
WorkerCount: max in-process parallelism at the gateway layer
AttemptTimeoutMs: upper bound for a single model call attempt
EndToEndTimeoutMs: hard request deadline across all fallback stages
CircuitBreakerFailureThreshold: failures needed before opening a model circuit
CircuitBreakerOpenSeconds: cooldown before half-open probe attempt
LoadConcurrentClients: synthetic test pressure used by the load harness

Tune these as a system. Raising timeout without adjusting concurrency can still degrade p95 under local compute limits.

Why This Architecture Works

The model remains useful, but deterministic code owns service reliability:

Ingress is bounded by explicit rate and queue controls
Execution respects hard timeout budgets and bounded retry behavior
Circuit breakers prevent repeated failure amplification
Fallback is deterministic and inspectable, not heuristic
Metrics expose latency and failure behavior continuously
Each failure mode has explicit outcome labeling for debugging and policy gates

Potential Enhancements

To further harden this system, you can consider:

Add startup model warm-up calls to reduce cold-start latency spikes
Add per-tenant rate limits and fairness policies
Add adaptive timeout profiles by prompt class and model
Add persistent metrics export to OpenTelemetry backends
Add canary model rollout mode with automatic rollback thresholds

Final Notes

Reliable AI APIs are built with deterministic controls around the model path, not by trusting the model to manage service behavior.

When rate limits, queue bounds, timeout budgets, circuit states, fallback routing, and metrics are explicit, local AI systems become production-operable instead of demo-only.

Explore the source code at the GitHub repository.

See you in the next issue.

Stay curious.

Share this article with your network.

LinkedIn X Facebook

Join the Newsletter

Subscribe for AI engineering insights, system design strategies, and workflow tips.

Your information is safe. Unsubscribe anytime.