Evals
Evals assess how well TAs handled student interactions during office hour simulations. An eval brings together models, rubrics, and configuration flags to run automated scoring across simulation transcripts, producing actionable feedback on TA performance.
What is an Eval?
An eval in Glow is an automated assessment pipeline that scores simulation conversations. Each eval combines:
- Models — the LLM configurations used as evaluator judges (via
model_ids) - Model Rubrics — the scoring criteria applied during evaluation (via
model_rubric_ids) - Model Flags — behavioral flags that tag specific patterns in conversations (via
model_flag_ids) - Model Positions — ordering of models in multi-judge setups (via
model_position_ids) - Configuration Flags — control whether the eval is active, dynamic, or uses groups
For example, a university might run an eval called “Fall 2025 TA Assessment” that uses GPT-4o as the evaluator model, applies the “Communication Skills” and “Policy Knowledge” rubrics, and flags instances of “Incorrect Information” or “Inappropriate Tone”.
How it Connects
Eval
|
+-- Models (evaluator LLMs via model_ids)
+-- Model Rubrics (scoring criteria via model_rubric_ids)
+-- Model Flags (behavioral tags via model_flag_ids)
+-- Model Positions (judge ordering via model_position_ids)
+-- Departments (scoping via department_ids)- Models serve as the evaluator judges. The eval uses these models to read simulation transcripts and produce scores. These are separate from the models that powered the simulation — they are the models that grade it.
- Rubrics connect through
model_rubric_ids(theeval_model_rubrics_junction). Each rubric’s standard groups and standards define what the evaluator model scores. - Model Flags tag specific patterns in conversations (e.g., “Gave Incorrect Deadline”, “Used Empathetic Language”). These are distinct from the eval-level configuration flags.
- Departments scope the eval to specific organizational units, enabling department-level reporting.
Create an eval
Via the CLI
Calls below use
$GLOW_INSTANCE_URL+$GLOW_TOKEN— see Authentication to export them once.
glow evals create --body '{
"evals": [{
"name": "Fall 2025 TA Assessment",
"description": "Automated scoring of TA office hour simulations for the fall semester",
"model_ids": ["GPT4O_MODEL_UUID"],
"model_rubric_ids": ["COMM_SKILLS_RUBRIC_UUID", "POLICY_KNOWLEDGE_RUBRIC_UUID"]
}]
}'Via the API
curl -X POST $GLOW_INSTANCE_URL/eval/create \
-H "Authorization: Bearer $GLOW_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"evals": [{
"name": "Spring 2026 CS Department Eval",
"description": "Evaluate TA performance across all CS course simulations",
"department_ids": ["CS_DEPT_UUID"],
"model_ids": ["EVALUATOR_MODEL_UUID"],
"model_rubric_ids": ["DE_ESCALATION_RUBRIC_UUID"]
}]
}'Wiring models, rubrics, and flags
Attaching Evaluator Models
Evals reference models through model_ids. These models act as judges — they read simulation transcripts and score TA performance against the attached rubrics. You can attach multiple models to compare scoring consistency across different LLMs.
glow evals update --body '{
"evals": [{
"eval_id": "EVAL_UUID",
"model_ids": ["GPT4O_MODEL_UUID", "CLAUDE_MODEL_UUID"]
}]
}'Attaching Rubrics
Use model_rubric_ids to attach rubrics that define the scoring criteria. Each rubric’s standard groups and standards become the dimensions on which the evaluator model scores the conversation.
For example, attaching both “Communication Skills” and “De-escalation” rubrics means every simulation transcript will be scored on active listening, empathy, clarity, conflict resolution, and other standards defined in those rubrics.
Behavioral Flags
Use model_flag_ids to define specific behaviors the evaluator should tag. Unlike rubric standards (which are scored on a point scale), flags are binary markers that call out notable patterns:
- “Provided Incorrect Information”
- “Successfully De-escalated Conflict”
- “Referred Student to Appropriate Resource”
Running an eval
Evals support several configuration flags that control how runs behave — whether the eval is active, dynamic (on-the-fly scoring as attempts complete), and whether results are bucketed into groups.
| Flag Section | Purpose |
|---|---|
active_flags | Whether the eval is currently active and running |
dynamic_flags | Whether the eval uses dynamic mode for on-the-fly scoring |
groups_flags | Whether the eval organizes results into groups |
These flags appear as sections in the GET response, each containing the current flag config and available options. The list response also surfaces these as boolean fields:
| Field | Description |
|---|---|
is_inactive | Whether the eval is currently inactive |
is_dynamic | Whether the eval uses dynamic mode |
use_groups | Whether the eval uses groups |
num_runs | Total number of eval runs completed |
num_groups | Number of eval groups |
The draft cycle
Evals support a draft workflow for staging changes before they affect scoring pipelines. This is important because modifying an active eval can change how future simulation runs are scored.
# Create or update an eval draft
glow evals draft --body '{
"name": "Fall 2025 TA Assessment v2",
"description": "Added de-escalation rubric and behavioral flags",
"model_ids": ["EVALUATOR_MODEL_UUID"],
"rubric_ids": ["COMM_SKILLS_UUID", "DE_ESCALATION_UUID"]
}'
# List your drafts
glow evals listThe draft endpoint uses PATCH semantics with expected_version for optimistic concurrency. The draft request supports model_ids, rubric_ids, flag_ids, and department_ids. The response includes draft_id, new_version, and a form_state with the resolved state of all fields.
Search & filter
glow evals search --body '{"search": "fall"}'
glow evals search --body '{
"filter_department_ids": ["CS_DEPT_UUID"],
"page_size": 25
}'Bulk operations
Bulk delete and update follow the canonical all-matching shape — explicit IDs, or all: true with flat filter fields plus optional excluded_ids and a patch body. The persona surface is the prove-out; evals follow the same shape.
Delete by explicit IDs:
glow evals delete --body '{"eval_ids": ["eval-1", "eval-2"]}'Delete all matching a filter (with exclusions):
glow evals delete --body '{
"all": true,
"filter_department_ids": ["dept-finished-semester"],
"excluded_ids": ["eval-still-running-reports"]
}'Bulk update via patch:
glow evals update --body '{
"all": true,
"filter_department_ids": ["dept-finished-semester"],
"patch": { "is_inactive": true }
}'Common Operations
| Task | CLI | API Endpoint |
|---|---|---|
| List all evals | glow evals search | POST /eval/search |
| Get eval details | glow evals get --body '{"eval_id": "..."}' | POST /eval/get |
| Create eval | glow evals create --body '{...}' | POST /eval/create |
| Update eval | glow evals update --body '{"eval_id": "...", ...}' | POST /eval/update |
| Duplicate eval | — | POST /eval/duplicate |
| Delete eval(s) | glow evals delete --body '{"eval_id": "..."}' | POST /eval/delete |
| Bulk delete (filter) | glow evals delete --body '{"all": true, "filter_…": "…"}' | POST /eval/delete |
| Export to CSV | glow evals export | POST /eval/export |
| Stage a draft | glow evals draft --body '{...}' | PATCH /eval/draft |
| List drafts | glow evals list | POST /eval/drafts |
Related
- Evals API Reference — full endpoint and type documentation
- Evals CLI Reference — all CLI commands
- Rubrics Guide — define scoring criteria used by evals
- Models Guide — configure the evaluator models that power evals
- Providers Guide — set up LLM backends for evaluator models