RACG derives distribution-free bounds on high-risk actions for LLM agents at user-specified thresholds
RACG supplies causal risk gating that minimizes agent capability to the minimum required for safe outcomes. It outperforms baselines on error reduction while preserving utility. The work provides verifiable, distribution-free safety constraints for high-stakes LLM automation.
RACG models the causal path from proposed actions to terminal states, then applies calibrated bounds to decide act, defer, or abstain. On simulated interventions and real decision benchmarks the method cuts high-cost errors while retaining most ungated utility and beats both raw-confidence and selective-prediction baselines at identical abstention rates. The framework monitors distribution shift via prediction-realization discrepancy and tightens gates automatically.
Capability minimization here functions as an operational primitive rather than post-hoc alignment. By separating causal risk estimation from predictive uncertainty, RACG supplies explicit, auditable thresholds that satisfy pre-specified safety constraints without retraining the underlying policy. This approach directly implements least-privilege for agents where overconfident outputs would otherwise trigger irreversible actions.
Existing selective-prediction literature focuses on uncertainty quantification; RACG adds explicit counterfactual risk and adaptive tightening under violated assumptions. The resulting policy yields transparent operating points that regulators can verify against concrete error bounds.
Next deployments will test whether the reported bounds hold under live distribution shift on production agent traces within the next release cycle.
RACG: High-risk action rate falls below 0.05 on live agent benchmarks within 18 months of first production integration
Sources (3)
- [1]Primary Source(https://arxiv.org/abs/2606.13884)
- [2]Supporting Source(https://arxiv.org/abs/2106.02117)
- [3]Supporting Source(https://proceedings.neurips.cc/paper/2022/hash/8a5a0b2c1e4f9a7d3c2b1e0f9a8d7c6b-Abstract-Conference.html)