tinker_cookbook.rl.EnvGroupBuilder

class tinker_cookbook.rl.EnvGroupBuilder(ABC)

Builds a group of environments. The group will be used in the following way:

Algorithms like GRPO will center rewards across the group.
The reward function (compute_group_rewards) has access to the trajectories from the whole group, even though many reward functions will evaluate each one independently.
For example, this enables us to use pairwise reward models that look at a pair of trajectories at a time. With such a reward model, we effectively have a multi-agent environment, where the agents are playing a zero-sum game.

Groups can be used in two ways, in practice:

To define a multi-agent environment
As a part of the algorithm (e.g. GRPO), when dealing with single-agent tasks.

Picklability: Implementations must be pickleable (via standard pickle) to support distributed rollout execution where builders are serialized and sent to remote workers. Avoid storing live network connections, file handles, or other unpickleable objects as fields. Use get_renderer() to create Renderers (which are automatically pickle-safe). Store configuration strings (model names, connection params) and construct heavy objects in make_envs() when possible. See HarborEnvGroupBuilder for a reference implementation of the lazy-construction pattern.

make_envs()

Create the environments for this group.

Returns: Sequence[Env] – The environments to run rollouts in.

Abstract method.

compute_group_rewards(trajectory_group, env_group)

Compute a final reward for each trajectory that depends on the whole group.

This is called after all rollouts in the group complete. The total reward for each trajectory is the sum of the per-timestep rewards (from Env.step) plus the final group reward returned here.

Override this when the reward depends on comparing trajectories within the group (e.g., pairwise reward models). The default implementation returns (0.0, {}) for every trajectory, so only per-timestep rewards are used.

Parameters:

trajectory_group (list[Trajectory]) – The completed trajectories, one per environment in the group.
env_group (Sequence[Env]) – The corresponding environments (same order as trajectory_group).

Returns: list[tuple[float, Metrics]] – A list of (reward, metrics) pairs, one per trajectory. The reward is added to the per-timestep total; the metrics dict is merged into training logs.

cleanup()

Clean up resources created by make_envs().

Called after rollouts and reward computation complete, regardless of success or failure. Override this to release expensive resources like cloud sandboxes, remote browsers, etc.

Default is a no-op. Implementations should be idempotent (safe to call multiple times) and handle exceptions internally, as do_group_rollout does not catch exceptions from this method.

Returns: None

logging_tags()

Return tags used to aggregate metrics in training logs.

Tags let the training loop group metrics (rewards, episode lengths, etc.) by environment type. Return a short list of names at different levels of granularity.

Returns: list[str] – Tag strings for this environment group. Default is an empty list.

def logging_tags(self) -> list[str]:
return ["gsm", "math", "rlvr"]