LLM and Variability

One of the most surprising discoveries when deploying Large Language Models for real-world use is variability in judgment.

When we first started experimenting with LLMs to support risk mitigation in contracts (by having the LLM directly revise contract clauses), we quickly learned that getting the prompt right was only part of the battle.

The initial assumption was: "If it worked once, it will always work." But in reality, even when we sent the exact same input, the model could generate different outputs in terms of quality.

This problem is made worse by the pace at which providers like OpenAI update their models. You might think you're using GPT-4, but that same label may refer to different sub-versions over time updated for speed, cost, or optimization goals.

One day, the API returned flawless English risk assessments. The next day - with no code or prompt changes - it inexplicably returned content in Chinese. And no, there was absolutely nothing in the prompt that directed the model to respond in any language other than English. That’s when we realized: reliability cannot be taken for granted.

In legal work, traceability and clarity are non-negotiable. It wasn’t enough for the machine to revise text - the user needed to see exactly what was changed: additions, deletions, and modifications. Because no matter how advanced the AI, humans remain accountable for the output. This is Machine-Assisted Human Decision Making and it must be engineered deliberately.

On paper, the value proposition of LLMs sounds simple. But real innovation requires deliberate engineering, especially when you want to differentiate with new services rather than just automate old ones. Our use case of recommending mitigations by revising contract clauses would not be possible without LLMs.

Turning that into a production-grade system required addressing four key challenges:

1. Output Normalization: Require LLMs to generate outputs in a predefined, structured format (e.g., JSON or marked-up text) to ensure consistency and predictability across responses.

2. Diff Alignment: Implement a service that visually displays changes between original and revised contract text, akin to Microsoft Word’s “Track Changes,” highlighting additions, deletions, and modifications for review.

3. Human Override Interfaces: Design interfaces that clearly show what changes were made, where they occur in the text, and why (e.g., risk mitigation rationale), enabling users to accept, reject, or modify AI suggestions.

4. Resilience to Upstream Model Changes: Lock model versions to maintain consistent behavior, as provider updates may alter performance and are not guaranteed to improve results.

In short, deploying LLMs in high-stakes environments isn’t about hype; it’s about reliability, explainability, and execution.

Report abuse