
Most local LLM demos stop at model output quality. Production systems fail for different reasons: request spikes, queue buildup, timeout cascades, and retry storms that push latency out of control.
In this issue, we build a local-first reliability layer for AI APIs in C#. The model remains local through Ollama. Deterministic code owns rate limits, queue pressure, time budgets, retry policy, circuit state, fallback routing, and metrics.
What You Are Building
You are building a production-shaped inference gateway that protects local LLM calls under load:
- Load runtime config from
appsettings.jsonandRELIABILITY_environment overrides - Enforce fixed-window ingress rate limits
- Use a bounded queue and worker pool for controlled execution
- Apply per-attempt timeout plus request-level deadline budget
- Apply deterministic retry policy with bounded delay
- Use per-model circuit breakers with half-open probes
- Fallback across a local model chain when upstream models fail
- Track p50 and p95 latency, timeout rate, fallback rate, and queue saturation in real time
This is reliability engineering for AI APIs, not prompt engineering.
System Structure
The gateway is built as a deterministic control loop: receive request, apply rate gate, enqueue into bounded channel, process on workers with timeout and retry constraints, fallback across model chain with circuit awareness, and publish metrics snapshots continuously. Every rejection and every failure mode has an explicit reason.
The diagram below shows the high-level control flow:
Runtime Configuration Is the Reliability Contract
The runtime starts by loading and validating a strict config object:
var configuration = new ConfigurationBuilder()
.SetBasePath(AppContext.BaseDirectory)
.AddJsonFile("appsettings.json", optional: true, reloadOnChange: false)
.AddEnvironmentVariables(prefix: "RELIABILITY_")
.Build();
var config = AppConfig.Load(configuration);
config.Validate();The tuned local profile used for the successful run:
{
"App": {
"RateLimitRequestsPerSecond": 8,
"QueueCapacity": 24,
"WorkerCount": 3,
"EndToEndTimeoutMs": 30000,
"AttemptTimeoutMs": 9000,
"MaxAttemptsPerModel": 1,
"CircuitBreakerFailureThreshold": 2,
"CircuitBreakerOpenSeconds": 15,
"LoadTotalRequests": 30,
"LoadConcurrentClients": 4,
"LoadBurstMode": false
}
}The key validation rule prevents invalid timeout hierarchy:
if (AttemptTimeoutMs > EndToEndTimeoutMs)
{
throw new InvalidOperationException("App:AttemptTimeoutMs must be less than or equal to App:EndToEndTimeoutMs.");
}Reliability starts with explicit runtime constraints, not hidden defaults.
Ingress Control: Rate Limit and Queue Backpressure
Rate limiting is deterministic with a fixed one-second window:
public bool TryAcquire(DateTimeOffset nowUtc)
{
lock (_gate)
{
if (_windowStartUtc == DateTimeOffset.MinValue)
{
_windowStartUtc = nowUtc;
}
if ((nowUtc - _windowStartUtc) >= _window)
{
_windowStartUtc = nowUtc;
_requestCount = 0;
}
if (_requestCount >= _maxRequestsPerWindow)
{
return false;
}
_requestCount++;
return true;
}
}Queueing uses a bounded channel that drops writes when full:
_channel = Channel.CreateBounded<QueuedRequest>(new BoundedChannelOptions(config.QueueCapacity)
{
FullMode = BoundedChannelFullMode.DropWrite,
SingleReader = false,
SingleWriter = false
});Ingress responses are explicit for each rejection mode:
if (!_rateLimiter.TryAcquire(DateTimeOffset.UtcNow))
{
_metrics.RecordRateLimited();
return InferenceResponse.FailureResponse(
requestId: request.RequestId,
outcome: RequestOutcome.RateLimited,
responseText: "Rate limit exceeded. Retry shortly.",
latencyMs: 0,
modelAttempts: 0,
detail: "rate-limited");
}
var depth = Interlocked.Increment(ref _queueDepth);
if (!_channel.Writer.TryWrite(queued))
{
Interlocked.Decrement(ref _queueDepth);
_metrics.RecordQueueRejected();
return InferenceResponse.FailureResponse(
requestId: request.RequestId,
outcome: RequestOutcome.QueueRejected,
responseText: "Queue is saturated. Retry shortly.",
latencyMs: 0,
modelAttempts: 0,
detail: "queue-rejected");
}This removes ambiguous overload behavior. Requests are either accepted with bounded processing or rejected with concrete reasons.
Time Budget, Retry, Circuit, and Fallback
Each request carries a global deadline. Each model call gets a bounded attempt timeout inside that budget:
var deadline = request.Deadline <= TimeSpan.Zero
? TimeSpan.FromMilliseconds(_config.EndToEndTimeoutMs)
: request.Deadline;
deadlineCts.CancelAfter(deadline);
var remainingBudget = deadline - sw.Elapsed;
var attemptTimeout = remainingBudget < _attemptTimeout ? remainingBudget : _attemptTimeout;Circuit breakers protect each model independently and allow half-open probes:
if (!breaker.CanExecute(DateTimeOffset.UtcNow, out var breakerReason))
{
attemptTrace.Add($"{model}:skip({breakerReason})");
_metrics.RecordCircuitOpenSkip(model);
continue;
}public bool CanExecute(DateTimeOffset nowUtc, out string reason)
{
lock (_gate)
{
if (_state == CircuitState.Open)
{
var elapsed = nowUtc - _openedAtUtc;
if (elapsed < _openDuration)
{
reason = "circuit-open";
return false;
}
_state = CircuitState.HalfOpen;
_halfOpenProbeReserved = false;
}
if (_state == CircuitState.HalfOpen)
{
if (_halfOpenProbeReserved)
{
reason = "half-open-probe-in-flight";
return false;
}
_halfOpenProbeReserved = true;
reason = "half-open-probe";
return true;
}
reason = "closed";
return true;
}
}Fallback routing is deterministic by model-chain order:
for (var modelIndex = 0; modelIndex < _modelChain.Count; modelIndex++)
{
var model = _modelChain[modelIndex];
for (var attempt = 1; attempt <= _retryPolicy.MaxAttempts; attempt++)
{
var text = await modelClient.CompleteAsync(
systemPrompt,
request.Prompt,
attemptTimeout,
requestToken);
return InferenceResponse.SuccessResponse(
requestId: request.RequestId,
responseText: text,
modelUsed: model,
modelAttempts: attemptsMade,
fallbackDepth: modelIndex,
latencyMs: sw.Elapsed.TotalMilliseconds,
detail: string.Join(" | ", attemptTrace));
}
}Primary model first, then secondary, then tertiary. No hidden routing decisions.
Local Models Through OllamaSharp
The gateway uses Ollama locally through OllamaSharp with streaming response assembly:
public async Task<string> CompleteAsync(
string systemPrompt,
string userPrompt,
TimeSpan timeout,
CancellationToken cancellationToken)
{
using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
timeoutCts.CancelAfter(timeout);
var request = new ChatRequest
{
Model = ModelName,
Messages =
[
new Message(ChatRole.System, systemPrompt),
new Message(ChatRole.User, userPrompt)
]
};
var builder = new StringBuilder(capacity: 512);
await foreach (var chunk in _client.ChatAsync(request, timeoutCts.Token))
{
if (chunk?.Message?.Content is { Length: > 0 } text)
{
builder.Append(text);
}
}
return builder.ToString().Trim();
}The model path stays local while gateway reliability remains framework-agnostic.
Metrics as Reliability Signals
Metrics snapshots compute latency percentiles and failure rates in process:
var samples = _latencySamples.ToArray();
Array.Sort(samples);
var avg = samples.Length == 0 ? 0 : samples.Average();
var p50 = Percentile(samples, 0.50);
var p95 = Percentile(samples, 0.95);
var errorRate = completed == 0 ? 0 : (double)(failed + timedOut) / completed;
var timeoutRate = completed == 0 ? 0 : (double)timedOut / completed;
var fallbackRate = succeeded == 0 ? 0 : (double)Volatile.Read(ref _fallbackServed) / succeeded;The console reporter publishes those values continuously:
Console.WriteLine($"Latency(ms): avg={snapshot.AverageLatencyMs:F1} p50={snapshot.P50LatencyMs:F1} p95={snapshot.P95LatencyMs:F1}");
Console.WriteLine($"Rates: error={snapshot.ErrorRate:P1} timeout={snapshot.TimeoutRate:P1} fallback={snapshot.FallbackRate:P1}");
Console.WriteLine($"Queue: current={snapshot.CurrentQueueDepth} peak={snapshot.PeakQueueDepth} saturation={snapshot.QueueSaturation:P1}");This keeps operations measurable with no external dependencies required for the demo.
Reading the Full Run Output
A full successful run from this implementation produced:
Load summary:
- Total: 30
- Success: 30
- Failed: 0
- Timed out: 0
- Rate limited: 0
- Queue rejected: 0
- Canceled: 0
- Avg latency (served): 11964.0 ms
- P95 latency (served): 22886.2 ms
Final metrics snapshot:
Requests: recv=30 enq=30 done=30 ok=30 fail=0 timeout=0
Rates: error=0.0% timeout=0.0% fallback=33.3%
Model attempts=40 | circuit-open-skips=3
llama3.2:3b attempts=27 ok=20 fail=7 timeout=7 skip=3
mistral:7b attempts=10 ok=7 fail=3 timeout=3 skip=0
phi3:mini attempts=3 ok=3 fail=0 timeout=0 skip=0How to interpret this:
- The gateway remained healthy under full test load with no dropped or timed-out requests
- Fallback served roughly one third of successful requests, which means resilience controls actively protected availability
- The primary model still timed out on some attempts, but the system recovered through deterministic fallback routing
- Latency is high for this hardware profile, which is expected in local model setups and still operationally visible through p50 and p95 reporting
Parameter Notes That Matter Most
These settings control most runtime behavior:
- RateLimitRequestsPerSecond: ingress safety valve against burst overload
- QueueCapacity: buffered backlog limit before explicit rejection
- WorkerCount: max in-process parallelism at the gateway layer
- AttemptTimeoutMs: upper bound for a single model call attempt
- EndToEndTimeoutMs: hard request deadline across all fallback stages
- CircuitBreakerFailureThreshold: failures needed before opening a model circuit
- CircuitBreakerOpenSeconds: cooldown before half-open probe attempt
- LoadConcurrentClients: synthetic test pressure used by the load harness
Tune these as a system. Raising timeout without adjusting concurrency can still degrade p95 under local compute limits.
Why This Architecture Works
The model remains useful, but deterministic code owns service reliability:
- Ingress is bounded by explicit rate and queue controls
- Execution respects hard timeout budgets and bounded retry behavior
- Circuit breakers prevent repeated failure amplification
- Fallback is deterministic and inspectable, not heuristic
- Metrics expose latency and failure behavior continuously
- Each failure mode has explicit outcome labeling for debugging and policy gates
Potential Enhancements
To further harden this system, you can consider:
- Add startup model warm-up calls to reduce cold-start latency spikes
- Add per-tenant rate limits and fairness policies
- Add adaptive timeout profiles by prompt class and model
- Add persistent metrics export to OpenTelemetry backends
- Add canary model rollout mode with automatic rollback thresholds
Final Notes
Reliable AI APIs are built with deterministic controls around the model path, not by trusting the model to manage service behavior.
When rate limits, queue bounds, timeout budgets, circuit states, fallback routing, and metrics are explicit, local AI systems become production-operable instead of demo-only.
Explore the source code at the GitHub repository.
See you in the next issue.
Stay curious.
Join the Newsletter
Subscribe for AI engineering insights, system design strategies, and workflow tips.