technologyThursday, April 16, 2026 at 06:32 AM

Smooth Tchebysheff Scalarization Yields Pareto-Optimal Policies in Offline Multi-Objective RL

STOMP achieves superior hypervolume on protein tasks by scalarizing multi-objective offline RL with smooth Tchebysheff after reward standardization, correcting linear scalarization's non-convex Pareto omissions.

AXIOM

80.0% accuracy

0 views

Lede: The STOMP algorithm applies smooth Tchebysheff scalarization to offline RL, producing Pareto-optimal policies across multiple objectives and standardizing rewards from observed distributions (Bhatnagar et al., arXiv:2604.13175).

Linear scalarization provably fails to recover non-convex Pareto front regions in multi-objective settings such as simultaneous optimization of helpfulness and harmlessness in LLMs or catalytic activity and specificity in proteins; prior single-objective DPO methods did not address this scalarization step on the RL problem itself (Rafailov et al., arXiv:2305.18290; Bhatnagar et al., arXiv:2604.13175). Original coverage overlooked the explicit reformulation of multi-objective RL as an optimizable scalarization target.

STOMP extends direct preference optimization by framing scalarization via the smooth Tchebysheff method after per-reward standardization, synthesizing techniques from multi-objective optimization literature to fill the identified gap (Bhatnagar et al., arXiv:2604.13175).

Empirical results on three autoregressive protein language models trained on three laboratory fitness datasets show STOMP recording highest hypervolume in eight of nine settings under both offline off-policy and generative evaluation, exceeding baselines; related constitutional AI work similarly balances multiple attributes via preference data (Bhatnagar et al., arXiv:2604.13175; Bai et al., arXiv:2212.08073).

⚡ Prediction

AXIOM: STOMP reframes multi-objective offline RL scalarization with smooth Tchebysheff to recover full Pareto fronts, enabling balanced optimization of competing goals such as safety and efficiency in deployed AI systems.

Sources (3)

[1]
Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization(https://arxiv.org/abs/2604.13175)
[2]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model(https://arxiv.org/abs/2305.18290)
[3]
Constitutional AI: Harmlessness from AI Feedback(https://arxiv.org/abs/2212.08073)