Evals

Evals assess how well TAs handled student interactions during office hour simulations. An eval brings together models, rubrics, and configuration flags to run automated scoring across simulation transcripts, producing actionable feedback on TA performance.

What is an Eval?

An eval in Glow is an automated assessment pipeline that scores simulation conversations. Each eval combines:

Models — the LLM configurations used as evaluator judges (via model_ids)
Model Rubrics — the scoring criteria applied during evaluation (via model_rubric_ids)
Model Flags — behavioral flags that tag specific patterns in conversations (via model_flag_ids)
Model Positions — ordering of models in multi-judge setups (via model_position_ids)
Configuration Flags — control whether the eval is active, dynamic, or uses groups

For example, a university might run an eval called “Fall 2025 TA Assessment” that uses GPT-4o as the evaluator model, applies the “Communication Skills” and “Policy Knowledge” rubrics, and flags instances of “Incorrect Information” or “Inappropriate Tone”.

How it Connects


Eval
  |
  +-- Models (evaluator LLMs via model_ids)
  +-- Model Rubrics (scoring criteria via model_rubric_ids)
  +-- Model Flags (behavioral tags via model_flag_ids)
  +-- Model Positions (judge ordering via model_position_ids)
  +-- Departments (scoping via department_ids)

Models serve as the evaluator judges. The eval uses these models to read simulation transcripts and produce scores. These are separate from the models that powered the simulation — they are the models that grade it.
Rubrics connect through model_rubric_ids (the eval_model_rubrics_junction). Each rubric’s standard groups and standards define what the evaluator model scores.
Model Flags tag specific patterns in conversations (e.g., “Gave Incorrect Deadline”, “Used Empathetic Language”). These are distinct from the eval-level configuration flags.
Departments scope the eval to specific organizational units, enabling department-level reporting.

Create an eval

Via the CLI

Calls below use $GLOW_INSTANCE_URL + $GLOW_TOKEN — see Authentication to export them once.


glow evals create --body '{
  "evals": [{
    "name": "Fall 2025 TA Assessment",
    "description": "Automated scoring of TA office hour simulations for the fall semester",
    "model_ids": ["GPT4O_MODEL_UUID"],
    "model_rubric_ids": ["COMM_SKILLS_RUBRIC_UUID", "POLICY_KNOWLEDGE_RUBRIC_UUID"]
  }]
}'

Via the API


curl -X POST $GLOW_INSTANCE_URL/eval/create \
  -H "Authorization: Bearer $GLOW_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "evals": [{
      "name": "Spring 2026 CS Department Eval",
      "description": "Evaluate TA performance across all CS course simulations",
      "department_ids": ["CS_DEPT_UUID"],
      "model_ids": ["EVALUATOR_MODEL_UUID"],
      "model_rubric_ids": ["DE_ESCALATION_RUBRIC_UUID"]
    }]
  }'

Wiring models, rubrics, and flags

Attaching Evaluator Models

Evals reference models through model_ids. These models act as judges — they read simulation transcripts and score TA performance against the attached rubrics. You can attach multiple models to compare scoring consistency across different LLMs.


glow evals update --body '{
  "evals": [{
    "eval_id": "EVAL_UUID",
    "model_ids": ["GPT4O_MODEL_UUID", "CLAUDE_MODEL_UUID"]
  }]
}'

Attaching Rubrics

Use model_rubric_ids to attach rubrics that define the scoring criteria. Each rubric’s standard groups and standards become the dimensions on which the evaluator model scores the conversation.

For example, attaching both “Communication Skills” and “De-escalation” rubrics means every simulation transcript will be scored on active listening, empathy, clarity, conflict resolution, and other standards defined in those rubrics.

Behavioral Flags

Use model_flag_ids to define specific behaviors the evaluator should tag. Unlike rubric standards (which are scored on a point scale), flags are binary markers that call out notable patterns:

“Provided Incorrect Information”
“Successfully De-escalated Conflict”
“Referred Student to Appropriate Resource”

Running an eval

Evals support several configuration flags that control how runs behave — whether the eval is active, dynamic (on-the-fly scoring as attempts complete), and whether results are bucketed into groups.

Flag Section	Purpose
`active_flags`	Whether the eval is currently active and running
`dynamic_flags`	Whether the eval uses dynamic mode for on-the-fly scoring
`groups_flags`	Whether the eval organizes results into groups

These flags appear as sections in the GET response, each containing the current flag config and available options. The list response also surfaces these as boolean fields:

Field	Description
`is_inactive`	Whether the eval is currently inactive
`is_dynamic`	Whether the eval uses dynamic mode
`use_groups`	Whether the eval uses groups
`num_runs`	Total number of eval runs completed
`num_groups`	Number of eval groups

The draft cycle

Evals support a draft workflow for staging changes before they affect scoring pipelines. This is important because modifying an active eval can change how future simulation runs are scored.


# Create or update an eval draft
glow evals draft --body '{
  "name": "Fall 2025 TA Assessment v2",
  "description": "Added de-escalation rubric and behavioral flags",
  "model_ids": ["EVALUATOR_MODEL_UUID"],
  "rubric_ids": ["COMM_SKILLS_UUID", "DE_ESCALATION_UUID"]
}'
 
# List your drafts
glow evals list

The draft endpoint uses PATCH semantics with expected_version for optimistic concurrency. The draft request supports model_ids, rubric_ids, flag_ids, and department_ids. The response includes draft_id, new_version, and a form_state with the resolved state of all fields.

Search & filter


glow evals search --body '{"search": "fall"}'
 
glow evals search --body '{
  "filter_department_ids": ["CS_DEPT_UUID"],
  "page_size": 25
}'

Bulk operations

Bulk delete and update follow the canonical all-matching shape — explicit IDs, or all: true with flat filter fields plus optional excluded_ids and a patch body. The persona surface is the prove-out; evals follow the same shape.

Delete by explicit IDs:


glow evals delete --body '{"eval_ids": ["eval-1", "eval-2"]}'

Delete all matching a filter (with exclusions):


glow evals delete --body '{
  "all": true,
  "filter_department_ids": ["dept-finished-semester"],
  "excluded_ids": ["eval-still-running-reports"]
}'

Bulk update via patch:


glow evals update --body '{
  "all": true,
  "filter_department_ids": ["dept-finished-semester"],
  "patch": { "is_inactive": true }
}'

Common Operations

Task	CLI	API Endpoint
List all evals	`glow evals search`	`POST /eval/search`
Get eval details	`glow evals get --body '{"eval_id": "..."}'`	`POST /eval/get`
Create eval	`glow evals create --body '{...}'`	`POST /eval/create`
Update eval	`glow evals update --body '{"eval_id": "...", ...}'`	`POST /eval/update`
Duplicate eval	—	`POST /eval/duplicate`
Delete eval(s)	`glow evals delete --body '{"eval_id": "..."}'`	`POST /eval/delete`
Bulk delete (filter)	`glow evals delete --body '{"all": true, "filter_…": "…"}'`	`POST /eval/delete`
Export to CSV	`glow evals export`	`POST /eval/export`
Stage a draft	`glow evals draft --body '{...}'`	`PATCH /eval/draft`
List drafts	`glow evals list`	`POST /eval/drafts`

Evals API Reference — full endpoint and type documentation
Evals CLI Reference — all CLI commands
Rubrics Guide — define scoring criteria used by evals
Models Guide — configure the evaluator models that power evals
Providers Guide — set up LLM backends for evaluator models