Skip to content

tinker_cookbook.rl.RetryOnFailure

class tinker_cookbook.rl.RetryOnFailure(RolloutStrategy)

Retry failed or timed-out trajectories with fresh environments.

When a trajectory fails (container crash, sandbox flake, transient error) or exceeds per_rollout_timeout seconds, a fresh env is created via make_envs() and the rollout is retried. This continues until either all trajectories succeed or the retry budget is exhausted.

If the retry budget is exhausted and a failure still occurs, the remaining in-flight tasks are cancelled and the exception is re-raised. This avoids partial-group bias from training on an incomplete set of trajectories.

Uses asyncio.wait(FIRST_COMPLETED) so retries start as soon as a failure is detected, without waiting for other in-flight rollouts.

Fields:

execute(env_group_builder, policy)

Run rollouts with automatic retry on individual trajectory failures.

Creates environments, launches all rollouts concurrently, and retries any that fail (or time out) by creating a fresh environment. Uses asyncio.wait(FIRST_COMPLETED) so retries begin immediately upon detecting a failure.

Parameters:

Returns: RolloutResult – Result containing the successfully completed trajectories, surviving environments, and a list of any errors encountered (including retried ones).

Raises:

  • Exception: Re-raises the failing exception when the retry budget
  • is exhausted, after cancelling all remaining in-flight tasks.