tinker_cookbook.rl.trajectory_to_data

tinker_cookbook.rl.trajectory_to_data(traj, traj_advantage)

Return one or more Datum objects corresponding to the trajectory.

If the sequence grows by appending, i.e., each successive observation contains the previous observation+action as a prefix, then we can return a single Datum. However, if we get a sequence that's not an extension of the previous sequence, then that results in a new Datum.

For example, let O1 denote a chunk of observation tokens, and let A1 denote an action.

Then let's say ob_ac_pairs is as follows.

(O1, A1) (O1+A1+O2, A2) (O3, A3)

Then we will merge the first two observation-action pairs into a single Datum, and the last observation-action pair into a separate Datum.

Parameters:

traj (Trajectory) – A single trajectory containing transitions (observation-action pairs).
traj_advantage (float) – The scalar advantage to assign to all action tokens in this trajectory.

Returns: list[tinker.Datum] – One or more training datums, each containing model input, targets, sampled log-probs, advantages, and masks.

Referenced by

tinker_cookbook.rl.InitialObservationOverflow