Large enterprises run on rules — eligibility, pricing, underwriting, compliance, claims. These systems are deterministic by design because businesses need consistent decisions and auditors need clear explanations. So what happens when you introduce LLMs, which are inherently probabilistic, into these environments?
The core principle that emerged: Use LLMs as a compiler front-end; keep your deterministic pipeline as the back-end.
We tried three approaches for automated rule generation in a Drools-based platform:
Attempt 1: End-to-End Generation
The LLM acted like a full-stack engineer: select dependencies, write executable Drools DRL, and generate the SQL needed to persist rules into the platform. The DRL executed successfully, but the approach was brittle. In practice, "inserting a rule" required coordinated inserts across 11 tables, and context limits forced us to generate SQL for only three and hard-code references to the rest, undermining determinism and auditability. Dependency selection introduced nondeterminism, semantic correctness was hard to verify, and the solution was tightly coupled to platform internals, making it unrealistic for production.
Attempt 2: Constrained Translation
We constrained the LLM to translator mode: convert natural language requirements into a structured JSON template matching our existing Excel ingestion format, then feed that to our proven Java ingestion utility. The LLM proposed; the deterministic system validated and executed. This was slower to prototype but far more realistic: easier validation, better governance, operational fit with existing workflows, and separation of concerns between interpretation (LLM) and execution (deterministic pipeline).
Attempt 3: Agentic Validation Loops
Building on Attempt 2, we added self-correction capability. The loop works like this:
- LLM generates JSON template from requirements
- Run Java ingestion utility
- Validate against historical template corpus → if mismatches, LLM investigates
- Run regression tests → if failures, LLM analyzes and corrects
- Present final candidate to human for review
- Deploy with full traceability
This combines LLM interpretation speed with iterative self-correction — all within deterministic guardrails. The LLM learns from its mistakes in real-time, dramatically improving output quality while maintaining the structured, governed approach of Attempt 2.
A Simple Framework: Six Dimensions That Matter
When an LLM touches deterministic systems, the bar isn't "looks right" — it's repeatable, auditable, testable, and safe to ship. Evaluate any LLM initiative along six axes:
- Contract Strength: Free-form text vs. structured schema with tight constraints
- Semantic Correctness: Does it implement requirements exactly, including edge cases?
- Reproducibility: Same input → consistent output (through validation, not generation)
- Verification Harness: Automated checks, tests, agentic loops that iterate on failures
- Auditability: Full traceability from requirement to deployed rule
- Operational Fit: Works with existing tools, UIs, approval workflows
Rule of thumb: The more "code-shaped" the output, the more verification you need. The more "form-shaped," the more you can rely on existing deterministic machinery.
Key Learnings From These Attempts
-
Avoid end-to-end code generation in governed systems. It introduces hidden nondeterminism (dependency selection that compiles but is subtly wrong), brittle workarounds from context limits (generating SQL for 3 of 11 tables), and creates review bottlenecks in regulated environments.
-
Use LLMs as translators to frozen schemas, not autonomous engineers. Convert requirements into tight, reviewable schemas (JSON templates matching existing formats), predefine dependency bundles and contracts to eliminate creative variance, then hand off execution to the deterministic pipeline.
-
Leverage existing deterministic infrastructure. Feed LLM-generated structured inputs into proven ingestion/compilation utilities instead of replacing them — reliability for free, LLM speed as bonus.
-
Validate against historical examples early. Comparing generated templates to hundreds of historical examples catches structural problems before they touch the rules engine.
-
Enforce determinism through agentic validation loops, not generation. Run schema checks, compilation, regression tests, and corpus comparisons programmatically. Let the LLM iterate on error messages until tests pass — self-correction within guardrails dramatically improves output quality.
-
Put humans at decision points, not in every iteration. Experts review the final validated artifact; automated loops handle refinement. Store full traceability: requirement version, artifact version, test results, approvals, reviewer, timestamp, deployed rule IDs, rollback pointers.
The winning architecture
The winning architecture separates concerns so each component does what it does best:
- LLM as intelligent translator → converts requirements to structured representations
- Agentic validation loops → iterative refinement based on automated test failures
- Deterministic compilation/execution → existing systems remain source of truth
- Frozen dependencies and schemas → eliminate nondeterministic selection
- Human-in-the-loop at decision points → experts review final output; LLM handles iteration speed
This isn't "AI replaces developers" or "AI is just autocomplete." It's "AI as a force multiplier for specialized knowledge workers operating within well-defined guardrails."
Conclusion: The Right Question
The enterprises succeeding with LLMs aren't asking "How do we replace our legacy systems with AI?"
They're asking: How do we make our legacy systems 10x more responsive to business needs using LLMs as an intelligent interface layer?
Start there. Build from there. And measure success not by what you've replaced, but by how much faster your organization can turn business requirements into business value.