tinker_cookbook.stores.EvalStore
class tinker_cookbook.stores.EvalStore()
Manages evaluation runs across checkpoints.
All file I/O goes through the Storage protocol, making this
backend-agnostic (local disk, S3, GCS).
Pickle-serializable when freshly constructed.
url(path)
create_run(model_name, benchmarks, checkpoint_path, checkpoint_name, config, run_id)
Create a new evaluation run and return its run_id.
Parameters:
- model_name (str)
- benchmarks (list[str])
- checkpoint_path (str | None)
- checkpoint_name (str | None)
- config (dict | None)
- run_id (str | None)
Returns: str
run_dir(run_id)
Return filesystem path for backward compat with BenchmarkConfig.save_dir.
Only works with LocalStorage (returns a local path string).
For cloud backends, use url() on the storage directly.
Parameters:
- run_id (str)
Returns: str
finalize_run(run_id)
Collect scores from benchmark results and update metadata.
Parameters:
- run_id (str)
Returns: RunMetadata
list_runs()
List all evaluation runs, most recent first.
Returns: list[RunMetadata]
read_run(run_id)
Load metadata for a specific run. Raises FileNotFoundError if missing.
Parameters:
- run_id (str)
Returns: RunMetadata
list_benchmarks(run_id)
read_result(run_id, benchmark)
Get aggregated result for a benchmark.
Parameters:
Returns: BenchmarkResult | None
read_trajectories(run_id, benchmark, correct_only, incorrect_only, errors_only)
Get trajectories with optional filtering.
Parameters:
- run_id (str)
- benchmark (str)
- correct_only (bool)
- incorrect_only (bool)
- errors_only (bool)
Returns: list[StoredTrajectory]
read_single_trajectory(run_id, benchmark, idx)
Get a single trajectory by index (O(n) scan — loads all trajectories).
Parameters:
Returns: StoredTrajectory | None
read_summary(run_id)
Read the combined summary for a run, or None if missing.
Parameters:
- run_id (str)
Returns: dict[str, Any] | None
write_result(run_id, result)
write_trajectory(run_id, benchmark, traj)
Append one trajectory to the JSONL file.
Parameters:
Returns: None
write_summary(run_id, results)
Save a combined summary.
Parameters:
Returns: None
delete_run(run_id)
Delete all data for a run. Idempotent (no error if already gone).
Removes metadata, summary, and all benchmark result/trajectory files.
The runs.jsonl index is append-only and not modified; list_runs()
checks for metadata.json existence so deleted runs are excluded.
Parameters:
- run_id (str)
Returns: None
alist_runs()
Async version of list_runs.
Returns: list[RunMetadata]
aread_trajectories(run_id, benchmark, **kw)
Async version of read_trajectories.
Parameters:
Returns: list[StoredTrajectory]