Voice pipeline
Glow’s voice pipeline lets a learner hold a spoken conversation with
a persona — audio in, audio out — by stitching the existing text
generation loop between server-side STT and TTS
stages. It is not a separate “voice model” or a custom realtime
runtime: the same attempt/generate call that drives text chat is
reused, with audio modalities tacked on either end.
End-to-end flow
The server-side loop is intentionally a chain of small primitives — each step persists what it produced, so any step can be replayed or inspected independently of the rest.
mic / file
│ multipart POST /attempt/audio_upload
│ (or socket.io frame: attempt.chat_speak — PCM16 into the
│ conversation's inbound_queue, no DB write)
▼
uploads_entry → audios_resource → audios_entry
│ (resource handle: audios_id — canonical plural form,
│ matches images_id / videos_id)
▼
/attempt/generate with audios_id + modalities=["text"]
│
├──► STT executor (app/infra/generation/stt.py)
│ AsyncOpenAI → /v1/audio/transcriptions
│ emits attempt.generate.text.complete with the transcript
│
├──► LLM completion (app/infra/generation/execute.py — text)
│ same path as keyboard chat
│
└──► TTS executor (app/infra/generation/tts.py)
litellm.aspeech → audio bytes
media_upload_impl writes uploads_entry + audios_resource
emits attempt.generate.audio.complete with the new audios_id
│
▼
Attempt_Chat_Message tool call receives audios_id in
produced_media and forwards it onto the message row
│
▼
Client plays via POST /attempt/audio_download (range-supported)Every audio asset — inbound mic capture, intermediate STT input, TTS
output — lives behind the same audios_id resource handle described
in Media, so the rest of the system (audit, search, message
attachments) does not need to know “voice” is in play.
Audio upload
POST /attempt/audio_upload is the one entry point for any audio
the server needs to reason about. From the route (core/app/routes/attempt/audio_upload.py)
it accepts three shapes:
| Shape | Inputs | Use |
|---|---|---|
| File only | multipart file | Browser MediaRecorder blob, CLI file path |
upload_id only | ?upload_id=<uuid> | Promote bytes already written by the realtime adapter |
upload_id + file | both | Client pre-reserved a slot via /attempt/audio/new and is filling it |
Allowed MIME types (parameters stripped before the check, so
audio/webm;codecs=opus from MediaRecorder normalizes to audio/webm):
audio/mpeg audio/mp3 audio/wav audio/ogg audio/webm
audio/flac audio/aac audio/x-m4a audio/mp4 audio/x-wavThe response is always {audio_id, audios_id, upload_id}. Clients
only need audios_id — that is the canonical resource handle that
gets passed to /attempt/generate and ends up on the message row.
STT
The STT executor lives at core/app/infra/generation/stt.py and is
dispatched as a modality inside execute_generation (the same engine
that runs text completions and image generation).
Implementation notes, taken verbatim from the source:
- The provider is OpenAI-compatible via
AsyncOpenAIpointed at whateverbase_urlthe artifact’s resolved LLM provider record exposes. The default deployment routes through the LiteLLM proxy togpt-4o-transcribe, but any/v1/audio/transcriptionsendpoint works. - The SDK boundary requires
/v1onbase_url; the executor appends it when missing. This is not hardcoding our proxy — every OpenAI-compatible STT endpoint (direct OpenAI, Azure, vLLM, etc.) exposes the same path. litellm.atranscriptionis intentionally not used because the SDK rewritesresponse_formattoverbose_json, which thegpt-4o-transcribefamily rejects.- Input is an
audios_id; the executor resolves down through the canonical join (audios_resource → audios_audios_connection → audios_entry → audio_uploads_entry → uploads_entry) to the on-disk path before opening the file.
On success the transcript is emitted as attempt.generate.text.complete
— exactly the same event keyboard chat produces — so downstream
consumers do not branch on “was this typed or spoken?”.
Voice session control
Three small operations sit on top of the audio loop to give the client a session-style API:
| Op | Route | WS event | What it does |
|---|---|---|---|
chat_voice | POST /attempt/chat_voice | — | Opens an attempt_conversation row for a chat and returns {chat_id, attempt_id, conversation_id, group_id}. Does not trigger AI generation — the client calls /attempt/generate separately with modalities + conversation_id. |
chat_speak | POST /attempt/chat_speak | attempt.chat_speak | Pure data primitive. Pushes base64-decoded PCM16 bytes into the session’s inbound_queue. No DB, no AI. Keyed on conversation_id (or chat_id to resolve it). |
chat_silence | POST /attempt/chat_silence | — | Records attempt_conversation_completion, runs cleanup_audio_session, and emits attempt.generate.audio.session_complete + attempt.chat.voice_ended (which closes the mic UI). Silence on a non-existent session is a no-op {stopped: false}. |
There is no server-side VAD. The client decides when to stop
speaking and posts chat_silence. Frames in chat_speak are not
buffered to disk — they live in the conversation’s in-memory
inbound_queue until the realtime adapter consumes them.
TTS
The TTS executor lives at core/app/infra/generation/tts.py. It
takes the last user-role message off the dispatch as input text,
calls litellm.aspeech with the persona’s voice, and runs the
synthesized bytes through the same media_upload_impl pipeline used
by image/video generation.
- Voice selection comes from
llm_config.voice(defaulting to"alloy"), which in turn comes from the persona’svoice_id. - The model name is prefixed with
openai/when routing through a custom proxy withapi_baseset — without the prefix, litellm treats a name likeglow-audioas an unknown provider and errors locally before any network hop. Same shape as the text-completion path inexecute.py. - The MP3 bytes are persisted as
uploads_entry → audios_resource → audios_entry, and the newaudios_idis bubbled ontoArtifactGenerateResponse.produced_mediaso the LLM’sAttempt_Chat_Messagetool call can attach it to the message row. attribute_to_run=Falseon the media upload — the tool-call audit layer writes the run ↔ message ↔ upload junction itself; double attribution would create duplicate assistant messages.
Audio download
POST /attempt/audio_download (route at core/app/routes/attempt/audio_download.py)
takes {audio_id} and streams the bytes via
create_range_response, so an HTML <audio> element can scrub /
resume. The route runs through the same run_artifact_operation_with_audit
wrapper as other attempt ops, so playback is auditable per group.
For the resource-handle ladder (audio_id vs. audios_id vs. raw
upload_id) and the range-streaming machinery, see Media.
Current state
Be honest: the voice pipeline is wired end-to-end but the canonical event hub is only partially integrated.
- Server canonical-path gap. Per the in-progress canonical-path
rework, voice events currently skip the
wrap_emit_with_stream_bridgedual-emit that artifact operations flow through. The legacyattempt.generate.text.complete/attempt.generate.audio.completeevents still fire and clients still receive them, but voice has not yet been wired into the same selective-rename + stream-bridge machinery that powers the artifact ops. Mapping entries are pending; until they land, voice events live on a parallel path rather than the canonical hub.
The pipeline is driven entirely through the HTTP / socket.io API
(audio_upload + chat_voice / chat_speak / chat_silence), so any
client can run it. There is no dedicated CLI mic-capture command — the
glow attempts chat voice|speak|silence commands are thin calls to
those same endpoints.
This gap doesn’t block the loop — it’s wiring-cleanup work, not a missing feature.
Related
- Chat — where voice surfaces in the learner UX (and the CLI deferral note)
- Media — the
audios_idresource ladder, theaudio_upload_idmessage pattern, and range-streamed download - Generation — the text
generatecall that the voice loop wraps with STT in front and TTS behind