Voice pipeline

Glow’s voice pipeline lets a learner hold a spoken conversation with a persona — audio in, audio out — by stitching the existing text generation loop between server-side STT and TTS stages. It is not a separate “voice model” or a custom realtime runtime: the same attempt/generate call that drives text chat is reused, with audio modalities tacked on either end.

End-to-end flow

The server-side loop is intentionally a chain of small primitives — each step persists what it produced, so any step can be replayed or inspected independently of the rest.


mic / file
   │  multipart POST  /attempt/audio_upload
   │  (or socket.io frame: attempt.chat_speak — PCM16 into the
   │   conversation's inbound_queue, no DB write)
   ▼
uploads_entry  →  audios_resource  →  audios_entry
   │  (resource handle: audios_id — canonical plural form,
   │   matches images_id / videos_id)
   ▼
/attempt/generate  with  audios_id  +  modalities=["text"]
   │
   ├──► STT executor       (app/infra/generation/stt.py)
   │       AsyncOpenAI → /v1/audio/transcriptions
   │       emits attempt.generate.text.complete with the transcript
   │
   ├──► LLM completion     (app/infra/generation/execute.py — text)
   │       same path as keyboard chat
   │
   └──► TTS executor       (app/infra/generation/tts.py)
           litellm.aspeech → audio bytes
           media_upload_impl writes  uploads_entry + audios_resource
           emits attempt.generate.audio.complete with the new audios_id
                       │
                       ▼
   Attempt_Chat_Message tool call receives audios_id in
   produced_media and forwards it onto the message row
                       │
                       ▼
   Client plays via  POST /attempt/audio_download  (range-supported)

Every audio asset — inbound mic capture, intermediate STT input, TTS output — lives behind the same audios_id resource handle described in Media, so the rest of the system (audit, search, message attachments) does not need to know “voice” is in play.

Audio upload

POST /attempt/audio_upload is the one entry point for any audio the server needs to reason about. From the route (core/app/routes/attempt/audio_upload.py) it accepts three shapes:

Shape	Inputs	Use
File only	multipart `file`	Browser `MediaRecorder` blob, CLI file path
`upload_id` only	`?upload_id=<uuid>`	Promote bytes already written by the realtime adapter
`upload_id` + file	both	Client pre-reserved a slot via `/attempt/audio/new` and is filling it

Allowed MIME types (parameters stripped before the check, so audio/webm;codecs=opus from MediaRecorder normalizes to audio/webm):


audio/mpeg  audio/mp3  audio/wav  audio/ogg  audio/webm
audio/flac  audio/aac  audio/x-m4a  audio/mp4  audio/x-wav

The response is always {audio_id, audios_id, upload_id}. Clients only need audios_id — that is the canonical resource handle that gets passed to /attempt/generate and ends up on the message row.

STT

The STT executor lives at core/app/infra/generation/stt.py and is dispatched as a modality inside execute_generation (the same engine that runs text completions and image generation).

Implementation notes, taken verbatim from the source:

The provider is OpenAI-compatible via AsyncOpenAI pointed at whatever base_url the artifact’s resolved LLM provider record exposes. The default deployment routes through the LiteLLM proxy to gpt-4o-transcribe, but any /v1/audio/transcriptions endpoint works.
The SDK boundary requires /v1 on base_url; the executor appends it when missing. This is not hardcoding our proxy — every OpenAI-compatible STT endpoint (direct OpenAI, Azure, vLLM, etc.) exposes the same path.
litellm.atranscription is intentionally not used because the SDK rewrites response_format to verbose_json, which the gpt-4o-transcribe family rejects.
Input is an audios_id; the executor resolves down through the canonical join (audios_resource → audios_audios_connection → audios_entry → audio_uploads_entry → uploads_entry) to the on-disk path before opening the file.

On success the transcript is emitted as attempt.generate.text.complete — exactly the same event keyboard chat produces — so downstream consumers do not branch on “was this typed or spoken?”.

Voice session control

Three small operations sit on top of the audio loop to give the client a session-style API:

Op	Route	WS event	What it does
`chat_voice`	`POST /attempt/chat_voice`	—	Opens an `attempt_conversation` row for a chat and returns `{chat_id, attempt_id, conversation_id, group_id}`. Does not trigger AI generation — the client calls `/attempt/generate` separately with `modalities` + `conversation_id`.
`chat_speak`	`POST /attempt/chat_speak`	`attempt.chat_speak`	Pure data primitive. Pushes base64-decoded PCM16 bytes into the session’s `inbound_queue`. No DB, no AI. Keyed on `conversation_id` (or `chat_id` to resolve it).
`chat_silence`	`POST /attempt/chat_silence`	—	Records `attempt_conversation_completion`, runs `cleanup_audio_session`, and emits `attempt.generate.audio.session_complete` + `attempt.chat.voice_ended` (which closes the mic UI). Silence on a non-existent session is a no-op `{stopped: false}`.

There is no server-side VAD. The client decides when to stop speaking and posts chat_silence. Frames in chat_speak are not buffered to disk — they live in the conversation’s in-memory inbound_queue until the realtime adapter consumes them.

TTS

The TTS executor lives at core/app/infra/generation/tts.py. It takes the last user-role message off the dispatch as input text, calls litellm.aspeech with the persona’s voice, and runs the synthesized bytes through the same media_upload_impl pipeline used by image/video generation.

Voice selection comes from llm_config.voice (defaulting to "alloy"), which in turn comes from the persona’s voice_id.
The model name is prefixed with openai/ when routing through a custom proxy with api_base set — without the prefix, litellm treats a name like glow-audio as an unknown provider and errors locally before any network hop. Same shape as the text-completion path in execute.py.
The MP3 bytes are persisted as uploads_entry → audios_resource → audios_entry, and the new audios_id is bubbled onto ArtifactGenerateResponse.produced_media so the LLM’s Attempt_Chat_Message tool call can attach it to the message row.
attribute_to_run=False on the media upload — the tool-call audit layer writes the run ↔ message ↔ upload junction itself; double attribution would create duplicate assistant messages.

Audio download

POST /attempt/audio_download (route at core/app/routes/attempt/audio_download.py) takes {audio_id} and streams the bytes via create_range_response, so an HTML <audio> element can scrub / resume. The route runs through the same run_artifact_operation_with_audit wrapper as other attempt ops, so playback is auditable per group.

For the resource-handle ladder (audio_id vs. audios_id vs. raw upload_id) and the range-streaming machinery, see Media.

Current state

Be honest: the voice pipeline is wired end-to-end but the canonical event hub is only partially integrated.

Server canonical-path gap. Per the in-progress canonical-path rework, voice events currently skip the wrap_emit_with_stream_bridge dual-emit that artifact operations flow through. The legacy attempt.generate.text.complete / attempt.generate.audio.complete events still fire and clients still receive them, but voice has not yet been wired into the same selective-rename + stream-bridge machinery that powers the artifact ops. Mapping entries are pending; until they land, voice events live on a parallel path rather than the canonical hub.

The pipeline is driven entirely through the HTTP / socket.io API (audio_upload + chat_voice / chat_speak / chat_silence), so any client can run it. There is no dedicated CLI mic-capture command — the glow attempts chat voice|speak|silence commands are thin calls to those same endpoints.

This gap doesn’t block the loop — it’s wiring-cleanup work, not a missing feature.

Chat — where voice surfaces in the learner UX (and the CLI deferral note)
Media — the audios_id resource ladder, the audio_upload_id message pattern, and range-streamed download
Generation — the text generate call that the voice loop wraps with STT in front and TTS behind