Tutorial 05-1: Export a Merged HuggingFace Model

Prerequisites

Weights Management

Run it interactively

curl -O https://raw.githubusercontent.com/thinking-machines-lab/tinker-cookbook/main/tutorials/501_export_hf.py && uv run marimo edit 501_export_hf.py

After training a LoRA adapter with Tinker, you typically want a standalone model you can deploy anywhere. This tutorial shows how to merge your LoRA adapter into the base model, producing a complete HuggingFace model directory.

What merging does: During LoRA training, Tinker only updates small low-rank matrices (the adapter). The base model weights stay frozen. Merging adds the adapter deltas back into the base weights: W_merged = W_base + (B @ A) * (alpha / rank). The result is a normal model with no LoRA dependency.

Tinker checkpoint          Merged HuggingFace model
+-------------------+      +---------------------------+
| adapter weights   |  -->  | model shards (.safetensors)|
| adapter config    |  -->  | config.json               |
+-------------------+      | tokenizer files ...        |
      + base model         +---------------------------+
      (from HF Hub)

Setup: create a checkpoint

First we need a Tinker checkpoint to export. We create a training client, run one step of SFT, and save the weights. In practice, you would use a checkpoint from a real training run.

import tinker
from tinker_cookbook import renderers
from tinker_cookbook.supervised.data import conversation_to_datum
from tinker_cookbook.tokenizer_utils import get_tokenizer

BASE_MODEL = "Qwen/Qwen3.5-4B"

service_client = tinker.ServiceClient()
training_client = await service_client.create_lora_training_client_async(
    base_model=BASE_MODEL, rank=16
)

# Build a minimal training example
_tokenizer = get_tokenizer(BASE_MODEL)
_renderer = renderers.get_renderer("qwen3", _tokenizer)
_messages = [
    {"role": "user", "content": "What is Tinker?"},
    {"role": "assistant", "content": "Tinker is a cloud training API for LLM fine-tuning."},
]
_datum = conversation_to_datum(_messages, _renderer, max_length=512)

# One training step + save
_fwd = await training_client.forward_backward_async([_datum], loss_fn="cross_entropy")
_opt = await training_client.optim_step_async(tinker.AdamParams(learning_rate=1e-4))
await _fwd.result_async()
await _opt.result_async()

_save_result = training_client.save_weights_for_sampler(name="export-tutorial")
sampler_path = _save_result.result().path
print(f"Base model:  {BASE_MODEL}")
print(f"Checkpoint:  {sampler_path}")

Output

Base model:  Qwen/Qwen3.5-4B
Checkpoint:  tinker://889f30af-10c9-59e2-b80e-a6bb5166b2d2:train:0/sampler_weights/export-tutorial

Step 1: Download the checkpoint

Use weights.download() to fetch a Tinker checkpoint to local disk. The tinker_path follows the format tinker://<run_id>/sampler_weights/<name>.

from tinker_cookbook import weights

adapter_dir = weights.download(
    tinker_path=sampler_path,
    output_dir="/tmp/tinker-export-tutorial/adapter",
)
print(f"Adapter downloaded to: {adapter_dir}")

Output

Adapter downloaded to: /tmp/tinker-export-tutorial/adapter

Step 2: Merge the adapter into a full model

build_hf_model downloads the base model from HuggingFace Hub, applies the LoRA deltas, and saves the merged result.

OUTPUT_PATH = "/tmp/tinker-export-tutorial/merged_model"

weights.build_hf_model(
    base_model=BASE_MODEL,
    adapter_path=adapter_dir,
    output_path=OUTPUT_PATH,
)
print(f"Merged model saved to: {OUTPUT_PATH}")

Output

Merged model saved to: /tmp/tinker-export-tutorial/merged_model

Step 3: Inspect the output

The output directory is a standard HuggingFace model -- it contains config, tokenizer files, and safetensors shards.

import os

for _f in sorted(os.listdir(OUTPUT_PATH)):
    _size_mb = os.path.getsize(os.path.join(OUTPUT_PATH, _f)) / 1e6
    print(f"  {_f:45s} {_size_mb:>8.1f} MB")

Output

  chat_template.jinja                                0.0 MB
  config.json                                        0.0 MB
  model-00001-of-00002.safetensors                5329.4 MB
  model-00002-of-00002.safetensors                3990.4 MB
  model.safetensors.index.json                       0.1 MB
  processor_config.json                              0.0 MB
  tokenizer.json                                    20.0 MB
  tokenizer_config.json                              0.0 MB

Step 4: Verify the output

The merged model is a standard HuggingFace model — you can load it with transformers, serve it with vLLM, or deploy with any HF-compatible framework:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./merged_model")
model = AutoModelForCausalLM.from_pretrained("./merged_model", device_map="auto")

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(output[0], skip_special_tokens=True))

import json

# Verify the config is valid
with open(f"{OUTPUT_PATH}/config.json") as _f:
    _config = json.load(_f)
# Some models nest text params under text_config (e.g. vision-language models)
_tc = _config.get("text_config", _config)
print(f"Architecture:    {_config.get('architectures', ['unknown'])[0]}")
print(f"Hidden size:     {_tc.get('hidden_size', 'unknown')}")
print(f"Num layers:      {_tc.get('num_hidden_layers', 'unknown')}")
print(f"Vocab size:      {_tc.get('vocab_size', 'unknown')}")

Output

Architecture:    Qwen3_5ForConditionalGeneration
Hidden size:     2560
Num layers:      32
Vocab size:      248320

Next steps

Build a PEFT LoRA Adapter -- Convert to PEFT format for vLLM --lora-modules
Publish to HuggingFace Hub -- Upload the merged model with a custom model card