# Malin Live-Voice ("Ears") — Build Spec (Cal, 6/6)
**Goal:** Malin HEARS Jun via the Logitech webcam mic and responds out loud in real time, with the FC face reacting. The ears are the INPUT half of the FC's responsiveness ([[project_malin_avatar]]). Runs on the 5090 in Malin's harness, alongside malin.py.

## The loop (a new `live_voice.py`, toggled ON)
1. **CAPTURE** — stream audio from the Logitech webcam mic (`sounddevice`, select the Logitech input device).
2. **VAD** — `silero-vad` detects speech start/end; ~700ms of trailing silence = utterance complete → hand the buffered audio to STT.
3. **STT** — `faster-whisper` (small or medium model, GPU, language="en") transcribes the utterance → text. (Use faster-whisper, NOT plain openai-whisper — ~4x faster via CTranslate2. The lipsync `whisper/tiny.pt` is a SEPARATE thing; don't reuse it for STT.)
4. **THINK** — feed the transcribed text into malin.py's EXISTING pipeline (`generate_reply`/`build_system`) exactly like a typed message → her reply (with her [VOICE]/[PERFORM:] tags).
5. **SPEAK** — her reply is spoken via the EXISTING Chatterbox (`maren_say.py`). In live-voice mode every reply is auto-spoken (voice mode implicitly on).
6. **REACT (FC)** — emit `[PERFORM:]` to the FC: a LISTENING face (focused/attentive) while Jun is talking (VAD active), then her emotional reaction while she replies. This is what ties the ears to the face.
7. **LOOP** — when she finishes speaking, resume listening.

## Activation (privacy — Jun approved)
- A **"live voice MODE" toggled ON/OFF** (a hotkey on the FC window — e.g. `v` — or a command). She ONLY listens when the mode is ON. **Off by default.** Respects the guardrail line ([[project_calypso_embodiment_intention]] — present-when-wanted, blind-when-not).
- Visual cue: when live-voice is ON, the FC shows a subtle "awake/listening" state so Jun knows she's hearing.

## Echo / turn-taking
- While SHE is speaking (Chatterbox playing), **gate the mic/STT** so she doesn't transcribe her own voice. Resume listening when playback ends.
- Barge-in (Jun interrupts her mid-reply) = v2, not now.

## Latency (Jun approved ~1-2s)
- faster-whisper small on GPU ~ a few hundred ms for a short utterance.
- Keep the voice-reply prompt tight + cap max_tokens to conversational length (Dolphin-24B inference is the biggest chunk).
- Keep Chatterbox warm (model resident).
- v2 optimization: start TTS on the first sentence while the rest generates (sentence-streaming).

## Exact tools (USE THESE; if any isn't available on the box, CONFER per [[feedback_hermes_confer_before_deviating]] — do not silently substitute)
- capture: `sounddevice` · VAD: `silero-vad` · STT: `faster-whisper` (small/medium, GPU) · brain: existing malin.py · TTS: existing Chatterbox/maren_say.py · FC: `[PERFORM:]` tags.

## Build split
**Cal** = this design + the loop logic + the FC listening/reaction mapping. **Hermes** = wire mic capture + silero-vad + faster-whisper into Malin's harness on the 5090 (`live_voice.py`), the toggle + echo-gating, and the malin.py + FC integration. Brain + voice already exist.

## FIRST milestone (dumb version first, per [[feedback_dont_oversplit_dumb_version_first]])
**P0:** toggle live-voice ON → speak a sentence → silero-vad detects end → faster-whisper transcribes → malin.py replies → Chatterbox speaks it. NO FC reaction yet, NO streaming. Prove she HEARS and answers out loud. **Then** P1: the FC listening/reacting face + echo-gating; P2: latency tuning + sentence-streaming; P3 (later): the webcam VISION/mood-reading off the same device.
