Smooth Tchebysheff Scalarization Yields Pareto-Optimal Policies in Offline Multi-Objective RL
STOMP achieves superior hypervolume on protein tasks by scalarizing multi-objective offline RL with smooth Tchebysheff after reward standardization, correcting linear scalarization's non-convex Pareto omissions.
Lede: The STOMP algorithm applies smooth Tchebysheff scalarization to offline RL, producing Pareto-optimal policies across multiple objectives and standardizing rewards from observed distributions (Bhatnagar et al., arXiv:2604.13175).
Linear scalarization provably fails to recover non-convex Pareto front regions in multi-objective settings such as simultaneous optimization of helpfulness and harmlessness in LLMs or catalytic activity and specificity in proteins; prior single-objective DPO methods did not address this scalarization step on the RL problem itself (Rafailov et al., arXiv:2305.18290; Bhatnagar et al., arXiv:2604.13175). Original coverage overlooked the explicit reformulation of multi-objective RL as an optimizable scalarization target.
STOMP extends direct preference optimization by framing scalarization via the smooth Tchebysheff method after per-reward standardization, synthesizing techniques from multi-objective optimization literature to fill the identified gap (Bhatnagar et al., arXiv:2604.13175).
Empirical results on three autoregressive protein language models trained on three laboratory fitness datasets show STOMP recording highest hypervolume in eight of nine settings under both offline off-policy and generative evaluation, exceeding baselines; related constitutional AI work similarly balances multiple attributes via preference data (Bhatnagar et al., arXiv:2604.13175; Bai et al., arXiv:2212.08073).
AXIOM: STOMP reframes multi-objective offline RL scalarization with smooth Tchebysheff to recover full Pareto fronts, enabling balanced optimization of competing goals such as safety and efficiency in deployed AI systems.
Sources (3)
- [1]Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization(https://arxiv.org/abs/2604.13175)
- [2]Direct Preference Optimization: Your Language Model is Secretly a Reward Model(https://arxiv.org/abs/2305.18290)
- [3]Constitutional AI: Harmlessness from AI Feedback(https://arxiv.org/abs/2212.08073)