Skip to Content
Voice Pipeline

Voice pipeline

Glow’s voice pipeline lets a learner hold a spoken conversation with a persona — audio in, audio out — by stitching the existing text generation loop between server-side STT and TTS stages. It is not a separate “voice model” or a custom realtime runtime: the same attempt/generate call that drives text chat is reused, with audio modalities tacked on either end.

End-to-end flow

The server-side loop is intentionally a chain of small primitives — each step persists what it produced, so any step can be replayed or inspected independently of the rest.

mic / file │ multipart POST /attempt/audio_upload │ (or socket.io frame: attempt.chat_speak — PCM16 into the │ conversation's inbound_queue, no DB write) uploads_entry → audios_resource → audios_entry │ (resource handle: audios_id — canonical plural form, │ matches images_id / videos_id) /attempt/generate with audios_id + modalities=["text"] ├──► STT executor (app/infra/generation/stt.py) │ AsyncOpenAI → /v1/audio/transcriptions │ emits attempt.generate.text.complete with the transcript ├──► LLM completion (app/infra/generation/execute.py — text) │ same path as keyboard chat └──► TTS executor (app/infra/generation/tts.py) litellm.aspeech → audio bytes media_upload_impl writes uploads_entry + audios_resource emits attempt.generate.audio.complete with the new audios_id Attempt_Chat_Message tool call receives audios_id in produced_media and forwards it onto the message row Client plays via POST /attempt/audio_download (range-supported)

Every audio asset — inbound mic capture, intermediate STT input, TTS output — lives behind the same audios_id resource handle described in Media, so the rest of the system (audit, search, message attachments) does not need to know “voice” is in play.


Audio upload

POST /attempt/audio_upload is the one entry point for any audio the server needs to reason about. From the route (core/app/routes/attempt/audio_upload.py) it accepts three shapes:

ShapeInputsUse
File onlymultipart fileBrowser MediaRecorder blob, CLI file path
upload_id only?upload_id=<uuid>Promote bytes already written by the realtime adapter
upload_id + filebothClient pre-reserved a slot via /attempt/audio/new and is filling it

Allowed MIME types (parameters stripped before the check, so audio/webm;codecs=opus from MediaRecorder normalizes to audio/webm):

audio/mpeg audio/mp3 audio/wav audio/ogg audio/webm audio/flac audio/aac audio/x-m4a audio/mp4 audio/x-wav

The response is always {audio_id, audios_id, upload_id}. Clients only need audios_id — that is the canonical resource handle that gets passed to /attempt/generate and ends up on the message row.


STT

The STT executor lives at core/app/infra/generation/stt.py and is dispatched as a modality inside execute_generation (the same engine that runs text completions and image generation).

Implementation notes, taken verbatim from the source:

  • The provider is OpenAI-compatible via AsyncOpenAI pointed at whatever base_url the artifact’s resolved LLM provider record exposes. The default deployment routes through the LiteLLM proxy to gpt-4o-transcribe, but any /v1/audio/transcriptions endpoint works.
  • The SDK boundary requires /v1 on base_url; the executor appends it when missing. This is not hardcoding our proxy — every OpenAI-compatible STT endpoint (direct OpenAI, Azure, vLLM, etc.) exposes the same path.
  • litellm.atranscription is intentionally not used because the SDK rewrites response_format to verbose_json, which the gpt-4o-transcribe family rejects.
  • Input is an audios_id; the executor resolves down through the canonical join (audios_resource → audios_audios_connection → audios_entry → audio_uploads_entry → uploads_entry) to the on-disk path before opening the file.

On success the transcript is emitted as attempt.generate.text.complete — exactly the same event keyboard chat produces — so downstream consumers do not branch on “was this typed or spoken?”.


Voice session control

Three small operations sit on top of the audio loop to give the client a session-style API:

OpRouteWS eventWhat it does
chat_voicePOST /attempt/chat_voiceOpens an attempt_conversation row for a chat and returns {chat_id, attempt_id, conversation_id, group_id}. Does not trigger AI generation — the client calls /attempt/generate separately with modalities + conversation_id.
chat_speakPOST /attempt/chat_speakattempt.chat_speakPure data primitive. Pushes base64-decoded PCM16 bytes into the session’s inbound_queue. No DB, no AI. Keyed on conversation_id (or chat_id to resolve it).
chat_silencePOST /attempt/chat_silenceRecords attempt_conversation_completion, runs cleanup_audio_session, and emits attempt.generate.audio.session_complete + attempt.chat.voice_ended (which closes the mic UI). Silence on a non-existent session is a no-op {stopped: false}.

There is no server-side VAD. The client decides when to stop speaking and posts chat_silence. Frames in chat_speak are not buffered to disk — they live in the conversation’s in-memory inbound_queue until the realtime adapter consumes them.


TTS

The TTS executor lives at core/app/infra/generation/tts.py. It takes the last user-role message off the dispatch as input text, calls litellm.aspeech with the persona’s voice, and runs the synthesized bytes through the same media_upload_impl pipeline used by image/video generation.

  • Voice selection comes from llm_config.voice (defaulting to "alloy"), which in turn comes from the persona’s voice_id.
  • The model name is prefixed with openai/ when routing through a custom proxy with api_base set — without the prefix, litellm treats a name like glow-audio as an unknown provider and errors locally before any network hop. Same shape as the text-completion path in execute.py.
  • The MP3 bytes are persisted as uploads_entry → audios_resource → audios_entry, and the new audios_id is bubbled onto ArtifactGenerateResponse.produced_media so the LLM’s Attempt_Chat_Message tool call can attach it to the message row.
  • attribute_to_run=False on the media upload — the tool-call audit layer writes the run ↔ message ↔ upload junction itself; double attribution would create duplicate assistant messages.

Audio download

POST /attempt/audio_download (route at core/app/routes/attempt/audio_download.py) takes {audio_id} and streams the bytes via create_range_response, so an HTML <audio> element can scrub / resume. The route runs through the same run_artifact_operation_with_audit wrapper as other attempt ops, so playback is auditable per group.

For the resource-handle ladder (audio_id vs. audios_id vs. raw upload_id) and the range-streaming machinery, see Media.


Current state

Be honest: the voice pipeline is wired end-to-end but the canonical event hub is only partially integrated.

  • Server canonical-path gap. Per the in-progress canonical-path rework, voice events currently skip the wrap_emit_with_stream_bridge dual-emit that artifact operations flow through. The legacy attempt.generate.text.complete / attempt.generate.audio.complete events still fire and clients still receive them, but voice has not yet been wired into the same selective-rename + stream-bridge machinery that powers the artifact ops. Mapping entries are pending; until they land, voice events live on a parallel path rather than the canonical hub.

The pipeline is driven entirely through the HTTP / socket.io API (audio_upload + chat_voice / chat_speak / chat_silence), so any client can run it. There is no dedicated CLI mic-capture command — the glow attempts chat voice|speak|silence commands are thin calls to those same endpoints.

This gap doesn’t block the loop — it’s wiring-cleanup work, not a missing feature.


  • Chat — where voice surfaces in the learner UX (and the CLI deferral note)
  • Media — the audios_id resource ladder, the audio_upload_id message pattern, and range-streamed download
  • Generation — the text generate call that the voice loop wraps with STT in front and TTS behind
Last updated on