Skip to content

tinker_cookbook.distillation.sdft.Config

class tinker_cookbook.distillation.sdft.Config()

Configuration for SDFT training.

Key parameters:

  • topk: Number of top tokens for distillation (default 20). Set to 0 for the importance-sampling fallback. K=20 matches full-vocabulary KL in practice.

  • learning_rate: For LoRA, use 5e-4 to 1e-3. The top-K CE loss produces larger gradients than SFT at the same LR due to more completion tokens per step (on-policy generation), so use the lower end of the range.

  • teacher_sync_every: Optional periodic hard-sync of student weights into the teacher (approximating EMA). None = static frozen teacher, which works comparably to EMA in our experiments.

See main for the training loop.

Fields: