You are viewing v0.4.7 version.
A newer version
v0.12.1 is available.
Logging
As reinforcement learning algorithms are historically challenging to debug, it’s important to pay careful attention to logging.
By default, the TRL PPOTrainer saves a lot of relevant information to wandb
or tensorboard
.
Upon initialization, pass one of these two options to the PPOConfig:
config = PPOConfig(
model_name=args.model_name,
log_with=`wandb`, # or `tensorboard`
)
If you want to log with tensorboard, add the kwarg project_kwargs={"logging_dir": PATH_TO_LOGS}
to the PPOConfig.
PPO Logging
Crucial values
During training, many values are logged, here are the most important ones:env/reward_mean
,env/reward_std
,env/reward_dist
: the properties of the reward distribution from the “environment”.ppo/mean_scores
: The mean scores directly out of the reward model.ppo/mean_non_score_reward
: The mean negated KL penalty during training (shows the delta between the reference model and the new policy over the batch in the step)
Training stability parameters:
Here are some parameters that are useful to monitor for stability (when these diverge or collapse to 0, try tuning variables):ppo/loss/value
: The value function loss — will spike / NaN when not going well.ppo/val/clipfrac
: The fraction of clipped values in the value function loss. This is often from 0.3 to 0.6.objective/kl_coef
: The target coefficient withAdaptiveKLController
. Often increases before numerical instabilities.