Original Paper: Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation
Authors: 1Soohan Lim , 1Joonghyuk Hahn11footnotemark: 1 , 2Hyunwoo Park , 2Sang-Ki Ko , 1Yo-Sub Han 1Yonsei University, Seoul, Republic of Korea 2University of Seoul, Seoul, Republic of Korea
{aness1219,greghahn,emmous}@yonsei.ac.kr , {hwpark03,sangkiko}@uos.ac.kr---
TLDR:
- Large Language Models consistently prioritize functional correctness over the explicit adherence to contractual constraints (preconditions) in generated code.
- Standard code benchmarks (e.g., HumanEval+) are inadequate metrics for assessing the robustness and real-world deployability of LLM-generated software.
- Enforcing contract adherence requires concrete examples of failure (negative test cases) in the prompt, as descriptive natural language instructions alone are ineffective.
The prevailing discourse on Large Language Model (LLM) efficacy often centers on speed and functional output quality. However, the critical question of robustness and compliance remains undertheorized. A recent paper by Lim, Hahn, Park, Ko, and Han—titled “Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation”—shifts the focus from simple functionality to contractual fidelity, revealing a significant gap in current generative AI capabilities.
The critical technical and legal knot this research untangles is the discrepancy between “code that works” and “code that complies.” In software engineering, a contract defines the preconditions, post-conditions, and validity constraints that dictate how a function must behave, especially when receiving ill-formed or hostile inputs. Current evaluation metrics (such as pass@k on HumanEval+) assess if code works with valid inputs, ignoring the crucial requirement that robust code must explicitly reject inputs that violate its contract (e.g., rejecting a negative number when a positive value is mandated).
Failure to enforce these constraints is not merely a technical oversight; it introduces predictable vulnerabilities, violates security requirements (CWEs), and creates clear lines of commercial liability. When deployed in enterprise or regulated environments, code that ignores preconditions can lead to exploitable system states, data corruption, or denial of service. The authors introduce the Program Assessment and Contract-Adherence Evaluation Framework (PACT) to systematically measure this ignored dimension of code quality, providing the first rigorous mechanism to quantify this risk.
Key Findings
- Functional Correctness Masks Compliance Failure: LLMs can achieve high scores on conventional benchmarks, indicating functional correctness, while simultaneously exhibiting poor contract adherence. This confirms that high pass rates are misleading indicators of deployability or robustness, particularly when the generated code lacks necessary input validation or error handling mandated by the contract.
- Descriptive Prompting is Brittle: Simply augmenting the prompt with clear, natural language descriptions of the contract (e.g., “ensure the input list is non-empty”) is insufficient to compel the LLM to generate compliant code. The models often prioritize the core functional task over the explicit constraints, indicating a fundamental difficulty in translating abstract legalistic rules into mandatory defensive coding practices.
- Learning from Failure is Key: Contract adherence significantly improves—often dramatically—when prompts are augmented not just with contract descriptions, but with specific, contract-violating test cases. This finding suggests that LLMs learn constraints more effectively from concrete examples of failure that must be prevented than from abstract rules of constraint that must be followed.
Legal and Practical Impact
These findings directly challenge the common practice of using LLMs for rapid prototyping and deployment in sectors sensitive to reliability and compliance, such as finance, healthcare, and critical infrastructure.
If an LLM-generated function fails to validate input constraints, and that failure subsequently leads to a system exploit or operational failure, the development organization faces a difficult defense. Litigators will increasingly argue that the failure to implement standard input validation (a clearly foreseeable risk, well-documented in security standards) demonstrates a breach of the duty of care in development.
The PACT framework provides compliance and audit teams with necessary metrics for due diligence. Organizations can no longer accept high pass@k scores as proof of robustness; they must demand quantitative proof of contract adherence. This necessitates making the generation, evaluation, and inclusion of negative or contract-violating test cases a mandatory, auditable step in the software supply chain when relying on generative AI tools. Furthermore, organizations must update their prompt engineering guidelines to mandate the inclusion of specific failure examples, shifting the focus from simply optimizing for desired output to explicitly constraining unacceptable behavior.
Risks and Caveats
While PACT establishes a crucial baseline, its current scope primarily focuses on explicit pre- and post-conditions within single functions. It does not fully address complex, system-level contracts, inter-module dependencies, or the evolving nature of enterprise security policies which often function as implicit contracts overriding local code specifications.
Furthermore, the most effective solution identified—augmenting prompts with contract-violating test cases—adds significant overhead to the prompt engineering process. These additions consume valuable token context, increase generation latency, and may exceed the input limits of smaller or less capable models, potentially limiting the scalability of this compliance solution in large-scale, real-time generation pipelines. The technical community must also grapple with the inherent difficulty of defining complete and exhaustive contract violations for highly complex functions.
For professionals deploying LLM-generated code, functional correctness is a necessary but insufficient condition for demonstrating legal and commercial robustness; contractual fidelity is the new threshold for safety and compliance.