Skip to Content
Evals

Evals

Evals assess how well TAs handled student interactions during office hour simulations. An eval brings together models, rubrics, and configuration flags to run automated scoring across simulation transcripts, producing actionable feedback on TA performance.


What is an Eval?

An eval in Glow is an automated assessment pipeline that scores simulation conversations. Each eval combines:

  • Models — the LLM configurations used as evaluator judges (via model_ids)
  • Model Rubrics — the scoring criteria applied during evaluation (via model_rubric_ids)
  • Model Flags — behavioral flags that tag specific patterns in conversations (via model_flag_ids)
  • Model Positions — ordering of models in multi-judge setups (via model_position_ids)
  • Configuration Flags — control whether the eval is active, dynamic, or uses groups

For example, a university might run an eval called “Fall 2025 TA Assessment” that uses GPT-4o as the evaluator model, applies the “Communication Skills” and “Policy Knowledge” rubrics, and flags instances of “Incorrect Information” or “Inappropriate Tone”.

How it Connects

Eval | +-- Models (evaluator LLMs via model_ids) +-- Model Rubrics (scoring criteria via model_rubric_ids) +-- Model Flags (behavioral tags via model_flag_ids) +-- Model Positions (judge ordering via model_position_ids) +-- Departments (scoping via department_ids)
  • Models serve as the evaluator judges. The eval uses these models to read simulation transcripts and produce scores. These are separate from the models that powered the simulation — they are the models that grade it.
  • Rubrics connect through model_rubric_ids (the eval_model_rubrics_junction). Each rubric’s standard groups and standards define what the evaluator model scores.
  • Model Flags tag specific patterns in conversations (e.g., “Gave Incorrect Deadline”, “Used Empathetic Language”). These are distinct from the eval-level configuration flags.
  • Departments scope the eval to specific organizational units, enabling department-level reporting.

Create an eval

Via the CLI

Calls below use $GLOW_INSTANCE_URL + $GLOW_TOKEN — see Authentication to export them once.

glow evals create --body '{ "evals": [{ "name": "Fall 2025 TA Assessment", "description": "Automated scoring of TA office hour simulations for the fall semester", "model_ids": ["GPT4O_MODEL_UUID"], "model_rubric_ids": ["COMM_SKILLS_RUBRIC_UUID", "POLICY_KNOWLEDGE_RUBRIC_UUID"] }] }'

Via the API

curl -X POST $GLOW_INSTANCE_URL/eval/create \ -H "Authorization: Bearer $GLOW_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "evals": [{ "name": "Spring 2026 CS Department Eval", "description": "Evaluate TA performance across all CS course simulations", "department_ids": ["CS_DEPT_UUID"], "model_ids": ["EVALUATOR_MODEL_UUID"], "model_rubric_ids": ["DE_ESCALATION_RUBRIC_UUID"] }] }'

Wiring models, rubrics, and flags

Attaching Evaluator Models

Evals reference models through model_ids. These models act as judges — they read simulation transcripts and score TA performance against the attached rubrics. You can attach multiple models to compare scoring consistency across different LLMs.

glow evals update --body '{ "evals": [{ "eval_id": "EVAL_UUID", "model_ids": ["GPT4O_MODEL_UUID", "CLAUDE_MODEL_UUID"] }] }'

Attaching Rubrics

Use model_rubric_ids to attach rubrics that define the scoring criteria. Each rubric’s standard groups and standards become the dimensions on which the evaluator model scores the conversation.

For example, attaching both “Communication Skills” and “De-escalation” rubrics means every simulation transcript will be scored on active listening, empathy, clarity, conflict resolution, and other standards defined in those rubrics.

Behavioral Flags

Use model_flag_ids to define specific behaviors the evaluator should tag. Unlike rubric standards (which are scored on a point scale), flags are binary markers that call out notable patterns:

  • “Provided Incorrect Information”
  • “Successfully De-escalated Conflict”
  • “Referred Student to Appropriate Resource”

Running an eval

Evals support several configuration flags that control how runs behave — whether the eval is active, dynamic (on-the-fly scoring as attempts complete), and whether results are bucketed into groups.

Flag SectionPurpose
active_flagsWhether the eval is currently active and running
dynamic_flagsWhether the eval uses dynamic mode for on-the-fly scoring
groups_flagsWhether the eval organizes results into groups

These flags appear as sections in the GET response, each containing the current flag config and available options. The list response also surfaces these as boolean fields:

FieldDescription
is_inactiveWhether the eval is currently inactive
is_dynamicWhether the eval uses dynamic mode
use_groupsWhether the eval uses groups
num_runsTotal number of eval runs completed
num_groupsNumber of eval groups

The draft cycle

Evals support a draft workflow for staging changes before they affect scoring pipelines. This is important because modifying an active eval can change how future simulation runs are scored.

# Create or update an eval draft glow evals draft --body '{ "name": "Fall 2025 TA Assessment v2", "description": "Added de-escalation rubric and behavioral flags", "model_ids": ["EVALUATOR_MODEL_UUID"], "rubric_ids": ["COMM_SKILLS_UUID", "DE_ESCALATION_UUID"] }' # List your drafts glow evals list

The draft endpoint uses PATCH semantics with expected_version for optimistic concurrency. The draft request supports model_ids, rubric_ids, flag_ids, and department_ids. The response includes draft_id, new_version, and a form_state with the resolved state of all fields.


Search & filter

glow evals search --body '{"search": "fall"}' glow evals search --body '{ "filter_department_ids": ["CS_DEPT_UUID"], "page_size": 25 }'

Bulk operations

Bulk delete and update follow the canonical all-matching shape — explicit IDs, or all: true with flat filter fields plus optional excluded_ids and a patch body. The persona surface is the prove-out; evals follow the same shape.

Delete by explicit IDs:

glow evals delete --body '{"eval_ids": ["eval-1", "eval-2"]}'

Delete all matching a filter (with exclusions):

glow evals delete --body '{ "all": true, "filter_department_ids": ["dept-finished-semester"], "excluded_ids": ["eval-still-running-reports"] }'

Bulk update via patch:

glow evals update --body '{ "all": true, "filter_department_ids": ["dept-finished-semester"], "patch": { "is_inactive": true } }'

Common Operations

TaskCLIAPI Endpoint
List all evalsglow evals searchPOST /eval/search
Get eval detailsglow evals get --body '{"eval_id": "..."}'POST /eval/get
Create evalglow evals create --body '{...}'POST /eval/create
Update evalglow evals update --body '{"eval_id": "...", ...}'POST /eval/update
Duplicate evalPOST /eval/duplicate
Delete eval(s)glow evals delete --body '{"eval_id": "..."}'POST /eval/delete
Bulk delete (filter)glow evals delete --body '{"all": true, "filter_…": "…"}'POST /eval/delete
Export to CSVglow evals exportPOST /eval/export
Stage a draftglow evals draft --body '{...}'PATCH /eval/draft
List draftsglow evals listPOST /eval/drafts

Last updated on