Skip to content

Tutorial 01: Hello Tinker

Run it interactively

curl -O https://raw.githubusercontent.com/thinking-machines-lab/tinker-cookbook/main/tutorials/101_hello_tinker.py && uv run marimo edit 101_hello_tinker.py

Tinker is a remote GPU service for LLM training and inference. You write training loops in Python on your local machine; Tinker executes the heavy GPU operations (forward passes, backpropagation, sampling) on remote workers.

Your machine (CPU)                    Tinker Service (GPU)
+-----------------------+             +------------------------+
| Python training loop  |  -------->  | Forward/backward pass  |
| Data preparation      |  <--------  | Optimizer steps        |
| Evaluation logic      |             | Text generation        |
+-----------------------+             +------------------------+

You control the logic. Tinker runs the compute.

import warnings

warnings.filterwarnings("ignore", message="IProgress not found")

import tinker
from tinker import types

The client hierarchy

The entry point to Tinker is the ServiceClient. From it, you create specialized clients:

  • SamplingClient -- generates text from a model (inference)
  • TrainingClient -- runs forward/backward passes and optimizer steps (training)

Both talk to the same remote GPU workers. Let's start with the ServiceClient.

# Create a ServiceClient. This reads TINKER_API_KEY from your environment.
service_client = tinker.ServiceClient()

# Check what models are available
capabilities = await service_client.get_server_capabilities_async()
print("Available models:")
for model in capabilities.supported_models:
    print(f"  - {model.model_name}")
Output
Available models:
  - deepseek-ai/DeepSeek-V3.1
  - deepseek-ai/DeepSeek-V3.1-Base
  - moonshotai/Kimi-K2-Thinking
  - moonshotai/Kimi-K2.5
  - moonshotai/Kimi-K2.5:peft:131072
  - meta-llama/Llama-3.1-70B
  - meta-llama/Llama-3.1-8B
  - meta-llama/Llama-3.1-8B-Instruct
  - meta-llama/Llama-3.2-1B
  - meta-llama/Llama-3.2-3B
  - meta-llama/Llama-3.3-70B-Instruct
  - nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
  - nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
  - nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16:peft:262144
  - Qwen/Qwen3-235B-A22B-Instruct-2507
  - Qwen/Qwen3-30B-A3B
  - Qwen/Qwen3-30B-A3B-Base
  - Qwen/Qwen3-30B-A3B-Instruct-2507
  - Qwen/Qwen3-32B
  - Qwen/Qwen3-4B-Instruct-2507
  - Qwen/Qwen3-8B
  - Qwen/Qwen3-8B-Base
  - Qwen/Qwen3-VL-235B-A22B-Instruct
  - Qwen/Qwen3-VL-30B-A3B-Instruct
  - Qwen/Qwen3.5-27B
  - Qwen/Qwen3.5-35B-A3B
  - Qwen/Qwen3.5-397B-A17B
  - Qwen/Qwen3.5-397B-A17B:peft:262144
  - Qwen/Qwen3.5-4B
  - openai/gpt-oss-120b
  - openai/gpt-oss-120b:peft:131072
  - openai/gpt-oss-20b

Sampling from a model

Let's create a SamplingClient to generate text. We will use Qwen/Qwen3-4B-Instruct-2507, a compact model that keeps costs low.

The sampling workflow is: 1. Create a SamplingClient with a base model name 2. Encode your prompt into tokens using the model's tokenizer 3. Call sample() with the prompt and sampling parameters 4. Decode the returned tokens back into text

MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

# Create a sampling client -- this connects to a remote GPU worker
sampling_client = await service_client.create_sampling_client_async(base_model=MODEL_NAME)

# Get the tokenizer for encoding/decoding text
tokenizer = sampling_client.get_tokenizer()
# Encode a prompt into tokens
prompt_text = "The three largest cities in the world by population are"
prompt = types.ModelInput.from_ints(tokenizer.encode(prompt_text))

# Sample a completion
params = types.SamplingParams(max_tokens=50, temperature=0.7, stop=["\n"])
result = await sampling_client.sample_async(prompt=prompt, sampling_params=params, num_samples=1)

# Decode and print
completion_tokens = result.sequences[0].tokens
print(prompt_text + tokenizer.decode(completion_tokens))
Output
The three largest cities in the world by population are Tokyo, Shanghai, and Delhi, with Tokyo having the largest population. Let's say that the population of Tokyo is 10 million more than the population of Shanghai, and the population of Shanghai is 5 million more than the population of Delhi.

Inspecting the response

The sample() call returns a SampleResponse containing a list of SampledSequence objects. Each sequence has: - tokens -- the generated token IDs - logprobs -- log probability of each generated token (if requested) - stop_reason -- why generation stopped (e.g., hit max tokens, hit a stop string)

_seq = result.sequences[0]
print(f"Stop reason:    {_seq.stop_reason}")
print(f"Tokens generated: {len(_seq.tokens)}")
print(f"Token IDs:      {_seq.tokens[:10]}...")
print(f"Log probs:      {_seq.logprobs}")  # first 10
Output
Stop reason:    length
Tokens generated: 50
Token IDs:      [26194, 11, 37047, 11, 323, 21996, 11, 448, 26194, 3432]...
Log probs:      [-0.5095227956771851, -0.004706732928752899, -2.459627389907837, -0.0007908792467787862, 0.0, -0.02780775912106037, -4.47653865814209, -0.09116745740175247, -0.6356776356697083, -0.24344521760940552, -1.3870069980621338, -0.368590772151947, -0.0031325577292591333, -0.21025529503822327, -6.460402488708496, -0.06543213129043579, -2.1195731163024902, -0.6492109894752502, -0.13893219828605652, -0.08088330924510956, -0.0002803409588523209, -0.027131833136081696, -0.0028139064088463783, -0.25534406304359436, -0.04747806861996651, -0.27529969811439514, -0.07974009215831757, -1.1192808151245117, 0.0, -0.9094678163528442, -0.002930040005594492, -2.3841855067985307e-07, -0.11119800060987473, -0.004901773761957884, -0.0008088654140010476, -0.019186854362487793, -1.2159273865108844e-05, 0.0, -0.09361772984266281, 0.0, -9.798523387871683e-05, -0.22930651903152466, -2.622600959512056e-06, -0.01948232762515545, 0.0, -0.0020399729255586863, 0.0, 0.0, 0.0, -1.764281842042692e-05]

You can also generate multiple samples at once by setting num_samples. Each sample is an independent completion from the same prompt.

result_1 = await sampling_client.sample_async(
    prompt=prompt,
    sampling_params=types.SamplingParams(max_tokens=50, temperature=0.9, stop=["\n"]),
    num_samples=3,
)
for i, _seq in enumerate(result_1.sequences):
    text = tokenizer.decode(_seq.tokens)
    print(f"Sample {i}: {text}")
Output
Sample 0: : Tokyo, Delhi, and Shanghai. Based on this information, which of the following is true?


Sample 1:  Beijing, Tokyo, and Delhi. The population of Beijing is 22 million, Tokyo is 13 million, and Delhi is 16 million. What is the average of the three cities' populations?


Sample 2:  London, Tokyo, and Shanghai. What are the city's populations in millions of people?

What about training?

So far we have only done inference. The real power of Tinker is training -- running forward/backward passes and optimizer steps on remote GPUs while you control the training loop locally.

The workflow looks like this:

  1. Create a TrainingClient with service_client.create_lora_training_client()
  2. Prepare training data as Datum objects (input tokens + loss targets)
  3. Call training_client.forward_backward() to compute gradients
  4. Call training_client.optim_step() to update weights
  5. Save weights and create a SamplingClient to evaluate the trained model

We will walk through this in the next tutorial.

Next steps