BayesRL

Research artifacts for variational / Bayesian approaches to reinforcement learning, centered on parameter-space exploration for RLVR.

Our current release accompanies the paper "Parameter Exploration for RLVR via Variational Learning", which introduces Perturbed Parameter Policy Optimization (3PO): a family of exploration strategies for Reinforcement Learning with Verifiable Rewards (RLVR). Rather than relying only on action-space heuristics (temperature, clipping, entropy bonuses), 3PO samples model weights from an approximate posterior learned with the variational optimizer IVON, turning the amount of weight noise into an explicit control lever for exploration.

📦 Code: insait-institute/c3po

The 3PO family

Variant Brief Method Description
B3PO One weight perturbation from the IVON posterior per gradient step, synced to the rollout engine.
M3PO M Monte-Carlo perturbations per step; rollouts and advantages computed per sample, gradients averaged.
C3PO Each GRPO group of G rollouts is split across N independent perturbations (G/N each); advantages are computed over the full, more-diverse group with a Seq-MIS importance-sampling correction.

Collections

Models & data

Citation

@misc{venkatkrishna2026parameter,
      title={Parameter Exploration for RLVR via Variational Learning},
      author={Vatsal Venkatkrishna and Nico Daheim and Iryna Gurevych},
      year={2026},
}