tinker_cookbook.stores.EvalStore

class tinker_cookbook.stores.EvalStore()

Manages evaluation runs across checkpoints.

All file I/O goes through the Storage protocol, making this backend-agnostic (local disk, S3, GCS).

Pickle-serializable when freshly constructed.

url(path)

Return a human-readable URI for a path within this eval store.

Parameters:

path (str)

Returns: str

create_run(model_name, benchmarks, checkpoint_path, checkpoint_name, config, run_id)

Create a new evaluation run and return its run_id.

Parameters:

model_name (str)
benchmarks (list[str])
checkpoint_path (str | None)
checkpoint_name (str | None)
config (dict | None)
run_id (str | None)

Returns: str

run_dir(run_id)

Return filesystem path for backward compat with BenchmarkConfig.save_dir.

Only works with LocalStorage (returns a local path string). For cloud backends, use url() on the storage directly.

Parameters:

run_id (str)

Returns: str

finalize_run(run_id)

Collect scores from benchmark results and update metadata.

Parameters:

run_id (str)

Returns: RunMetadata

list_runs()

List all evaluation runs, most recent first.

Returns: list[RunMetadata]

read_run(run_id)

Load metadata for a specific run. Raises FileNotFoundError if missing.

Parameters:

run_id (str)

Returns: RunMetadata

list_benchmarks(run_id)

List benchmark names that have results for a run.

Parameters:

run_id (str)

Returns: list[str]

read_result(run_id, benchmark)

Get aggregated result for a benchmark.

Parameters:

run_id (str)
benchmark (str)

Returns: BenchmarkResult | None

read_trajectories(run_id, benchmark, correct_only, incorrect_only, errors_only)

Get trajectories with optional filtering.

Parameters:

run_id (str)
benchmark (str)
correct_only (bool)
incorrect_only (bool)
errors_only (bool)

Returns: list[StoredTrajectory]

read_single_trajectory(run_id, benchmark, idx)

Get a single trajectory by index (O(n) scan — loads all trajectories).

Parameters:

run_id (str)
benchmark (str)
idx (int)

Returns: StoredTrajectory | None

read_summary(run_id)

Read the combined summary for a run, or None if missing.

Parameters:

run_id (str)

Returns: dict[str, Any] | None

write_result(run_id, result)

Save a benchmark result.

Parameters:

run_id (str)
result (BenchmarkResult)

Returns: None

write_trajectory(run_id, benchmark, traj)

Append one trajectory to the JSONL file.

Parameters:

run_id (str)
benchmark (str)
traj (StoredTrajectory)

Returns: None

write_summary(run_id, results)

Save a combined summary.

Parameters:

run_id (str)
results (dict[str, BenchmarkResult])

Returns: None

delete_run(run_id)

Delete all data for a run. Idempotent (no error if already gone).

Removes metadata, summary, and all benchmark result/trajectory files. The runs.jsonl index is append-only and not modified; list_runs() checks for metadata.json existence so deleted runs are excluded.

Parameters:

run_id (str)

Returns: None

alist_runs()

Async version of list_runs.

Returns: list[RunMetadata]

aread_trajectories(run_id, benchmark, **kw)

Async version of read_trajectories.

Parameters:

run_id (str)
benchmark (str)
**kw (Any)

Returns: list[StoredTrajectory]

aread_result(run_id, benchmark)

Async version of read_result.

Parameters:

run_id (str)
benchmark (str)

Returns: BenchmarkResult | None