Issue #24: Grounded Multi-Agent Research Review with Microsoft Agent Framework

12 min read | April 18, 2026

A lot of multi-agent examples still confuse coordination with control. They add more agents, more message passing, and more orchestration layers, but do not materially improve grounding or answer reliability.

That is the wrong optimization target. If a system cannot explain where its claims came from, whether those claims were reviewed, and which deterministic checks stand between model output and the final answer, adding more agents usually adds more surface area for failure.

In this issue, we build a small grounded multi-agent research workflow in C#. Microsoft Agent Framework handles the agent runtime, but deterministic code still owns retrieval, evidence formatting, structured-output parsing, normalization, citation validation, and final approval boundaries.

What You Are Building

You are building a production-shaped research-and-review workflow that keeps agent behavior narrow and inspectable:

Load runtime config from appsettings.json and MARR_ environment overrides
Resolve either an Azure AI Foundry project endpoint or an Azure OpenAI v1 endpoint into a direct chat endpoint
Seed a small local knowledge base with stable document identifiers
Run deterministic lexical retrieval before any agent executes
Ask ResearchAgent to draft a grounded JSON answer from retrieved evidence only
Ask ReviewerAgent to remove weak claims, preserve supported content, and decide approval
Normalize text, strip formatting artifacts, validate citations, and trust only the final reviewed structure

This is not multi-agent orchestration for its own sake. It is a compact control loop for grounded answer generation.

System Structure

The architecture is intentionally small. Configuration loads first, a deterministic retrieval pass selects the evidence set, the first agent drafts a structured answer, the second agent reviews grounding and unsupported claims, and deterministic validation runs after both model calls before anything is treated as final.

The diagram below shows the high-level control flow:

Runtime Configuration First

The app starts by loading the agent runtime profile and validating it before any model interaction happens:

var configuration = new ConfigurationBuilder()
  .SetBasePath(AppContext.BaseDirectory)
  .AddJsonFile("appsettings.json", optional: true, reloadOnChange: false)
  .AddEnvironmentVariables(prefix: "MARR_")
  .Build();

var config = AgentAppConfig.Load(configuration);
config.Validate();
var chatEndpoint = config.GetChatEndpoint();

The default configuration in this repo:

{
  "Agent": {
    "Provider": "foundry-openai",
    "BaseUrl": "Your-Foundry-OpenAI-Base-URL",
    "ApiKey": "Your-Foundry-OpenAI-API-Key",
    "ModelId": "gpt-oss-120b",
    "TopDocumentCount": 3
  }
}

This matters because grounded systems begin with explicit runtime assumptions. Endpoint style, model identity, and retrieval depth are all visible control surfaces instead of hidden local state.

Endpoint Resolution Is Explicit

The configuration layer accepts more than one Azure-style endpoint and normalizes it into a direct chat endpoint:

public Uri GetChatEndpoint()
{
  if (!Uri.TryCreate(BaseUrl, UriKind.Absolute, out var inputUri))
  {
      throw new InvalidOperationException("Agent:BaseUrl must be a valid absolute URI.");
  }

  var host = inputUri.Host;
  if (host.EndsWith(".openai.azure.com", StringComparison.OrdinalIgnoreCase))
  {
      return EnsureOpenAiPath(inputUri);
  }

  if (host.EndsWith(".services.ai.azure.com", StringComparison.OrdinalIgnoreCase))
  {
      var resourceName = host[..host.IndexOf(".services.ai.azure.com", StringComparison.OrdinalIgnoreCase)];
      return new Uri($"https://{resourceName}.openai.azure.com/openai/v1/");
  }

  return EnsureOpenAiPath(inputUri);
}

That keeps the app operationally simple. The caller can point at a Foundry project endpoint or an Azure OpenAI endpoint, and the runtime resolves the model-call boundary once up front instead of scattering endpoint assumptions through the codebase.

Deterministic Retrieval Comes Before Agents

Before either agent runs, the system performs a local lexical retrieval pass over seeded documents:

public IReadOnlyList<KnowledgeHit> Search(string query, int top)
{
  var queryTerms = Tokenize(query);

  return _documents
      .Select(document =>
      {
          var haystack = $"{document.Title} {document.Category} {document.Content}";
          var score = queryTerms.Count == 0
              ? 0
              : queryTerms.Count(term => haystack.Contains(term, StringComparison.OrdinalIgnoreCase));

          return new KnowledgeHit(document, score);
      })
      .Where(hit => hit.Score > 0)
      .OrderByDescending(hit => hit.Score)
      .ThenBy(hit => hit.Document.DocumentId, StringComparer.Ordinal)
      .Take(top)
      .ToList();
}

The seeded knowledge base is small, but it encodes the right engineering pattern: stable document IDs, bounded evidence, and deterministic selection before generation:

new(
  "DOC-005",
  "Agent Role Design",
  "Agents",
  "Small multi-agent systems work best when each agent has a narrow role. One agent can gather evidence and draft an answer, while another agent reviews grounding, unsupported claims, and output quality before approval.")

This is one of the most important design decisions in the repo. The model does not decide what the evidence set is. Deterministic code does that first, then the agents are constrained to work inside that boundary.

Two Agents, Two Narrow Roles

The app creates two ChatClientAgent instances that share the same model client but not the same responsibility:

AIAgent researchAgent = new ChatClientAgent(
  chatClient,
  new ChatClientAgentOptions
  {
      Name = "ResearchAgent",
      Instructions = ResearchReviewService.BuildResearchInstructions()
  });

AIAgent reviewerAgent = new ChatClientAgent(
  chatClient,
  new ChatClientAgentOptions
  {
      Name = "ReviewerAgent",
      Instructions = ResearchReviewService.BuildReviewerInstructions()
  });

The research role is intentionally constrained:

Non-negotiable rules:
1. Use only the supplied evidence.
2. Every key factual claim must be supported by at least one citation.
3. If evidence is weak or incomplete, say so.
4. Do not fabricate sources, metrics, dates, or product behavior.
5. Return only valid JSON matching the required schema.
6. Keep the answer concise: one short paragraph plus keyPoints.

The reviewer then acts as a second control boundary, not a second free-form writer:

Non-negotiable rules:
1. Keep only claims that are supported by the supplied evidence.
2. Remove any fabricated or weak claims and list them in unsupportedClaims.
3. citations must contain only document ids from the supplied evidence.
4. confidence must be one of high, medium, or low.
5. Set approved=true only when the answer is properly grounded.

That separation is subtle but important. The first agent is optimized for synthesis under evidence constraints. The second agent is optimized for rejection, cleanup, and approval discipline.

Structured Output Is the Contract

Both agents must return the same JSON shape:

{
  "answer": "string",
  "keyPoints": [
    "string",
    "string"
  ],
  "citations": [
    "DOC-001"
  ],
  "unsupportedClaims": [
    "string"
  ],
  "confidence": "high|medium|low",
  "approved": false
}

The runtime extracts the JSON payload, deserializes it into a typed contract, and normalizes the result before validation:

private static ResearchAnswer ParseStructuredAnswer(string response)
{
  var payload = ExtractJsonObject(response);
  var answer = JsonSerializer.Deserialize<ResearchAnswer>(payload, SerializerOptions);

  if (answer is null)
  {
      throw new InvalidOperationException("Model returned an empty structured answer.");
  }

  Normalize(answer);
  return answer;
}

This is the point where agent output stops being opaque text. Once the response is forced into a stable shape, deterministic code can decide what is acceptable and what is not.

Normalization Removes Common Model Artifacts

The service does not trust the raw text even after JSON parsing. It cleans common artifacts before the reviewed answer is accepted:

private static string CleanText(string text)
{
  if (string.IsNullOrWhiteSpace(text))
  {
      return string.Empty;
  }

  var cleaned = InlineCitationRegex.Replace(text, " ");
  cleaned = cleaned.Replace("?.", ".")
      .Replace("?,", ",")
      .Replace("?;", ";")
      .Replace("?:", ":")
      .Replace("?)", ")")
      .Replace("(?", "(");
  cleaned = MultiSpaceRegex.Replace(cleaned, " ").Trim();
  cleaned = cleaned.Trim(' ', '-', '*');

  if (cleaned.Length > 1 && char.IsDigit(cleaned[0]) && cleaned[1] is '.' or ')')
  {
      cleaned = cleaned[2..].TrimStart();
  }

  return cleaned;
}

That cleanup step is a practical engineering detail many demos ignore. Real model output often arrives with inline citation markers, stray numbering, or formatting leftovers. If you want stable downstream behavior, you normalize those artifacts deliberately.

Final Validation Keeps Trust in Code

After the reviewer returns a structured answer, the service validates the final result against deterministic rules tied to the retrieved evidence set:

private static void Validate(ResearchAnswer answer, IReadOnlyList<KnowledgeHit> hits)
{
  if (string.IsNullOrWhiteSpace(answer.Answer))
  {
      throw new InvalidOperationException("Research answer must include non-empty answer text.");
  }

  if (answer.KeyPoints.Count is < 2 or > 4)
  {
      throw new InvalidOperationException("Research answer must include between 2 and 4 key points.");
  }

  if (!ValidConfidenceLevels.Contains(answer.Confidence.Trim().ToLowerInvariant()))
  {
      throw new InvalidOperationException("Research answer confidence must be high, medium, or low.");
  }

  var allowedCitations = hits
      .Select(hit => hit.Document.DocumentId)
      .ToHashSet(StringComparer.Ordinal);

  if (answer.Citations.Count == 0)
  {
      throw new InvalidOperationException("Research answer must cite at least one retrieved document.");
  }
}

This is where the architecture earns its credibility. The reviewer can approve an answer, but the system still checks that the answer has text, that the key-point count is sane, that confidence is normalized, and that every citation points only to retrieved documents.

The Workflow Is Sequential by Design

The actual control loop is compact and readable:

var researchPrompt = BuildResearchPrompt(question, hits);
var researchResponse = await _researchAgent.RunAsync(researchPrompt, cancellationToken: ct);
var researchText = GetLastText(researchResponse);
var researchDraft = ParseStructuredAnswer(researchText);

var reviewerPrompt = BuildReviewerPrompt(question, hits, researchText);
var reviewResponse = await _reviewerAgent.RunAsync(reviewerPrompt, cancellationToken: ct);
var reviewText = GetLastText(reviewResponse);
var reviewedAnswer = ParseStructuredAnswer(reviewText);

Validate(reviewedAnswer, hits);
return new ResearchReviewRun(researchText, researchDraft, reviewText, reviewedAnswer);

That is a better fit for this problem than a more elaborate agent graph. The second step depends on the first step's draft, and both depend on the same fixed evidence set. A sequential workflow keeps the reasoning path clear and the failure boundaries obvious.

Walking the Sample Question

The sample question bundled into the app is:

What are the most important engineering controls for a small multi-agent AI system, and how should we keep it grounded?

Given the seeded knowledge base, the deterministic retrieval layer will naturally favor documents about agent role design, retrieval design, observability, and prompt change control. That means the model is not being asked to improvise from general prior knowledge. It is being asked to synthesize from a bounded evidence packet.

The console application then prints the retrieved evidence, the raw JSON from ResearchAgent, the reviewed JSON from ReviewerAgent, and the final validated answer object. That output sequence is useful because it exposes where the draft changed, which claims were retained, and whether unsupported content was pushed into unsupportedClaims.

For a repo like this, that visibility is more valuable than trying to simulate a complex autonomous swarm. You can inspect the exact handoff between retrieval, drafting, review, and validation.

Reading a Real Live Run

A real run against gpt-oss-120b through the resolved Azure OpenAI v1 endpoint produced:

=== Minimal Multi-Agent Research and Review System ===
Provider: foundry-openai
Model: gpt-oss-120b
Configured endpoint: https://aoai-newsletter-lab.services.ai.azure.com/api/projects/newsletter-agent-lab
Resolved chat endpoint: https://aoai-newsletter-lab.openai.azure.com/openai/v1/
Knowledge documents: 6

Question> What are the most important engineering controls for a small multi-agent AI system, and how should we keep it grounded?
Retrieved evidence:
- DOC-005 score=7 title=Agent Role Design
- DOC-006 score=4 title=Azure AI Cost Controls
- DOC-001 score=3 title=Prompt Change Control Checklist

=== Agent 1: ResearchAgent Draft ===
Draft Approved: False
Draft Confidence: high
Draft Key Points:
- Assign narrow roles and a grounding reviewer to each agent
- Version and evaluate prompts on frozen data and enforce request budgets with caching
Draft Citations:
- DOC-005
- DOC-006
- DOC-001

=== Agent 2: ReviewerAgent Review ===
Reviewed Approved: True
Reviewed Confidence: high
Reviewed Key Points:
- Define narrow roles and use a grounding reviewer for each agent
- Version and test prompts on frozen data before rollout
- Apply pre-execution cost limits, route low-priority work cheaply, and reuse cached deterministic responses
Reviewed Citations:
- DOC-005
- DOC-006
- DOC-001

=== Final Validated Output ===
Final Approved: True
Final Confidence: high

How to interpret this:

The retrieval layer selected the evidence set before any agent reasoning happened, and every final citation stayed inside that retrieved boundary
ResearchAgent produced a grounded draft but left approved as false, which preserves the approval boundary for the second stage
ReviewerAgent kept the same evidence base, refined the wording, expanded the key points, and explicitly promoted the result to approved=true
The final validated output passed deterministic checks for answer text, citation scope, confidence shape, and key-point count before the system trusted it

This is the practical value of the design. The model is not just answering a question. It is moving through a controlled sequence where evidence selection, draft generation, review, and final trust are all visible as separate engineering steps.

Why This Architecture Works

The model remains useful, but trust stays in deterministic code and narrow responsibilities:

Retrieval is deterministic and runs before any agent starts generating
Each agent has a single job instead of a vague shared mandate
Both agents are constrained to the same structured output contract
Normalization removes formatting artifacts before outputs are trusted
Final citations are validated against the retrieved document IDs only
Approval is explicit instead of implied by the existence of an answer

Potential Enhancements

To extend this project further, you can consider:

Replace the seeded in-memory knowledge base with Azure AI Search or another external retrieval layer
Add automated tests for retrieval ranking, endpoint resolution, JSON parsing, and citation validation
Add richer retrieval scoring while preserving deterministic ranking and inspectability
Introduce a policy or safety agent only if its approval semantics remain explicit and bounded
Persist research-review runs so unsupported-claim patterns can be analyzed over time

Final Notes

Multi-agent systems become more useful when the extra agent is introduced to enforce a distinct control boundary, not just to create more conversational output.

When retrieval is deterministic, roles are narrow, outputs are structured, and final validation remains outside the model, you get a multi-agent workflow that is still understandable as software.

Explore the source code at the GitHub repository.

See you in the next issue.

Stay curious.

Share this article with your network.

LinkedIn X Facebook

Join the Newsletter

Subscribe for AI engineering insights, system design strategies, and workflow tips.

Your information is safe. Unsubscribe anytime.