Research artifacts for variational / Bayesian approaches to reinforcement learning, centered on parameter-space exploration for RLVR.
Our current release accompanies the paper "Parameter Exploration for RLVR via Variational Learning", which introduces Perturbed Parameter Policy Optimization (3PO): a family of exploration strategies for Reinforcement Learning with Verifiable Rewards (RLVR). Rather than relying only on action-space heuristics (temperature, clipping, entropy bonuses), 3PO samples model weights from an approximate posterior learned with the variational optimizer IVON, turning the amount of weight noise into an explicit control lever for exploration.
📦 Code: insait-institute/c3po
| Variant | Brief Method Description |
|---|---|
| B3PO | One weight perturbation from the IVON posterior per gradient step, synced to the rollout engine. |
| M3PO | M Monte-Carlo perturbations per step; rollouts and advantages computed per sample, gradients averaged. |
| C3PO | Each GRPO group of G rollouts is split across N independent perturbations (G/N each); advantages are computed over the full, more-diverse group with a Seq-MIS importance-sampling correction. |
M3PO+ and decoupled-MC
ablations).allenai/Olmo-3-1025-7B and Qwen/Qwen2.5-Math-7B@misc{venkatkrishna2026parameter,
title={Parameter Exploration for RLVR via Variational Learning},
author={Vatsal Venkatkrishna and Nico Daheim and Iryna Gurevych},
year={2026},
}