Introduction
Evals (evaluations) let you measure how well your agent or task function performs across a set of test cases. Rather than checking a single output manually, you define a Dataset of inputs (with optional expected outputs), run your function over all of them, and score each result with Evaluators. This is a 1-1 TypeScript port of Pydantic AI’s evals module. Key benefits:- Systematic testing - catch regressions across many input scenarios
- Quantitative scoring - go beyond pass/fail with numeric scores and summary statistics
- LLM-as-judge - use a language model to evaluate subjective qualities like helpfulness or accuracy
- Concurrency - run cases in parallel with configurable limits
- Immutable data - all dataset transformations return new objects
Quick start
Dataset and case
A Dataset is an immutable collection of Cases. Each Case has:| Field | Type | Required | Description |
|---|---|---|---|
inputs | TInput | Yes | The input passed to your task function |
name | string | No | Human-readable label for this case |
expectedOutput | TExpected | No | Reference output for evaluators |
metadata | Record<string, unknown> | No | Arbitrary metadata attached to this case |
evaluators | Evaluator[] | No | Per-case evaluators (added to dataset evaluators) |
Creating datasets
Immutable transformations
Serialization
Built-in evaluators
All built-in evaluators implement theEvaluator interface and return an EvalScore.
Case-level evaluators
| Evaluator | Description | Score type |
|---|---|---|
equalsExpected() | Strict equality between output and expectedOutput | boolean |
equals(value) | Strict equality between output and a fixed value | boolean |
contains(str) | String output contains substring | boolean |
isInstance(type) | typeof output === typeName | boolean |
maxDuration(sec) | Task completed within the time limit | boolean |
hasMatchingSpan(pred) | Span tree contains a node matching predicate | boolean |
isValidSchema(schema) | Output validates against a Zod schema | boolean |
custom(name, fn) | User-defined evaluator function | any |
EvalScore
Every evaluator returns anEvalScore:
LLM-as-judge
For subjective qualities (helpfulness, accuracy, tone), use an LLM as the evaluator.Helper functions
For one-off judge calls (outside of a dataset):Custom evaluators
Implement theEvaluator interface for full control:
custom() factory:
Accessing context
TheEvaluatorContext gives evaluators access to:
Report-level evaluators
Report evaluators run once after all cases complete and receive the fullCaseResult[] array. Use them for aggregate metrics.
Experiment runner
UseDataset.evaluate() as your primary API. The runExperiment() function is a thin wrapper for cases where you want to merge extra evaluators at call time:
Evaluate options
| Option | Type | Default | Description |
|---|---|---|---|
maxConcurrency | number | 5 | Maximum concurrent case evaluations |
maxRetries | number | 1 | Maximum task retry attempts per case |
onCaseComplete | function | - | Callback invoked after each case completes |
Span-based evaluation
When your task function captures OTel spans, you can evaluate them withhasMatchingSpan: