Wednesday, June 24, 2026

How not To Trust an LLM?

 

As impressive as Large Language Models (LLMs) have become, one fundamental reality remains: they are not deterministic systems. Their outputs are built from a chaotic blend of training data, fine-tuning, context windows, and billions of opaque parameters. Even the engineers who build these models cannot fully explain why a specific output appears when it does.

Despite this, modern AI architectures often hand LLMs the keys to the kingdom. The current industry standard typically looks like this:

frontend → LLM → tool use

In this flow, the model acts as the brain: it interprets the user's goal, decides which tool to call, generates the necessary parameters, and effectively acts as the system’s primary decision-maker.

From an engineering perspective, this is a flawed premise. We are asking a probabilistic engine to make deterministic choices—often involving file systems, sensitive data, or critical infrastructure. That isn't just risky; it’s a failure of architectural design.

There is a more robust way to build these systems: treat the LLM as a contributor, not a commander.


A Deterministic Alternative

frontend → LLM → router → MCP server

In a secure architecture, the LLM’s role is stripped back to what it actually does well: producing structured text. Instead of granting the model permission to execute commands, it is restricted to emitting a simple, inert JSON object:

        {

              "action": "savefile", 

              "filename": "notes.txt" 

        }

This output is not "intelligence"—it is data. It is not executed; it is not trusted. It is simply passed to a deterministic backend for verification.

The Router

Think of the router as a lightweight gatekeeper. It performs one task: reading the action field and forwarding the request to the corresponding Model Context Protocol (MCP) server. It does not "interpret intent" or guess at user meaning. If the JSON structure is malformed, missing required fields, or syntactically incorrect, the router terminates the process immediately.

The MCP Server

This is where true decision-making occurs. Each MCP server is a hard-coded, sandboxed executor designed for a single category of action. It validates the JSON, checks file system boundaries, enforces security policies, and rejects anything outside its predefined scope. If the LLM tries to access a restricted directory or perform an unauthorized operation, the MCP server simply refuses.

We can turn invalid requests into a security asset:

  • Silently log them for auditing.
  • Return a clean, standardized error to the frontend.
  • In this model, the LLM never has the opportunity to bypass safety filters or cause unintended harm.


Why This Architecture Works?

This approach succeeds because it shifts the paradigm: LLMs are workers, not decision-makers.

  • Constraint: The LLM generates text and nothing else.
  • Determinism: The router dispatches requests via rigid, predictable logic.
  • Enforcement: MCP servers maintain human-defined, hard-coded boundaries.
  • Isolation: Every action is sandboxed.

By handling edge cases explicitly, we eliminate the risks that plague the older "LLM-as-decision-maker" model. No hallucination can break the system, no prompt injection can escalate privileges, and no LLM output can directly touch the underlying operating environment.

The previous paradigm relied on the assumption that a model could reliably choose safe actions. That assumption has never been true—and given the nature of current architecture, it never will be.


The Bottom Line

If you want to build a truly robust AI system, stop asking the LLM to think like a security auditor or an operating system. Ask it to do what it is optimized for: producing structured data. Then, let deterministic, human-designed components decide the execution path.

It isn't about being cynical or "distrusting" AI. It is about building systems that simply don't require you to trust them in the first place.




No comments:

Post a Comment