tinker_cookbook.rl.RetryOnFailure
class tinker_cookbook.rl.RetryOnFailure(RolloutStrategy)
Retry failed or timed-out trajectories with fresh environments.
When a trajectory fails (container crash, sandbox flake, transient error)
or exceeds per_rollout_timeout seconds, a fresh env is created via
make_envs() and the rollout is retried. This continues until either
all trajectories succeed or the retry budget is exhausted.
If the retry budget is exhausted and a failure still occurs, the remaining in-flight tasks are cancelled and the exception is re-raised. This avoids partial-group bias from training on an incomplete set of trajectories.
Uses asyncio.wait(FIRST_COMPLETED) so retries start as soon as a
failure is detected, without waiting for other in-flight rollouts.
Fields:
- max_retries (int, default:
3) - per_rollout_timeout (float, default:
0)
execute(env_group_builder, policy)
Run rollouts with automatic retry on individual trajectory failures.
Creates environments, launches all rollouts concurrently, and retries
any that fail (or time out) by creating a fresh environment. Uses
asyncio.wait(FIRST_COMPLETED) so retries begin immediately upon
detecting a failure.
Parameters:
- env_group_builder (EnvGroupBuilder) – Builder used to create (and re-create on retry) environments for this rollout group.
- policy (TokenCompleter) – The policy used to generate actions.
Returns: RolloutResult – Result containing the successfully completed trajectories, surviving environments, and a list of any errors encountered (including retried ones).
Raises:
- Exception: Re-raises the failing exception when the retry budget
- is exhausted, after cancelling all remaining in-flight tasks.