tinker_cookbook.rl.trajectory_to_data
tinker_cookbook.rl.trajectory_to_data(traj, traj_advantage)
Return one or more Datum objects corresponding to the trajectory.
If the sequence grows by appending, i.e., each successive observation contains the previous observation+action as a prefix, then we can return a single Datum. However, if we get a sequence that's not an extension of the previous sequence, then that results in a new Datum.
For example, let O1 denote a chunk of observation tokens, and let A1 denote an action.
Then let's say ob_ac_pairs is as follows.
(O1, A1) (O1+A1+O2, A2) (O3, A3)
Then we will merge the first two observation-action pairs into a single Datum, and the last observation-action pair into a separate Datum.
Parameters:
- traj (Trajectory) – A single trajectory containing transitions (observation-action pairs).
- traj_advantage (float) – The scalar advantage to assign to all action tokens in this trajectory.
Returns: list[tinker.Datum] – One or more training datums, each containing model input, targets, sampled log-probs, advantages, and masks.