# Malin MIL Live-Test — Hit List (Jun's notes, 6/7)

Collecting Jun's test observations silently; compile → dispatch to Hermes after he's done. (Don't reply per-note.)

| # | observation | diagnosis (Cal) | proposed fix |
|---|-------------|-----------------|--------------|
| 1 | Told her about his ex contacting him (heavy emotional content) → she got STUCK looping the **sad** (or comparable) mood-MIL for ~**3 min**; status went **AMBER** during, kept looping, then **self-recovered** ("back to normal"). | MIL loop-during-speech isn't TERMINATING — the loop has no/failed exit. Likely the `speech_end` event didn't fire or wasn't received (long generation? blocked?), so the mood-MIL looped indefinitely. AMBER = supervisor flagged the prolonged/blocked state (heartbeat or stuck phase). | Add a HARD exit to the MIL loop: bound it to actual speech duration + a max-loop timeout (e.g. cap N loops / X seconds) that forces offset→neutral even if `speech_end` is missed. Ensure `speech_end` reliably fires at TTS-playback-end. Investigate why AMBER (heartbeat starved during loop?). |
| 2 | **No background** — she's missing her grey background again. Jun: "Hermes defaults to that when he's fixing her." | Hermes's restart/reboot path doesn't re-composite the grey bg (fc_float_window bg = neutral frame under her). Recurring regression (lost on the float-switch before, too). | Bake the bg into the STANDARD startup/restart so it's never dropped; add "bg present" to the startup verification checklist. |
| 3 | **Turn-taking / sentence-split (BIG)** — reply timing wildly variable; she treats a mid-sentence PAUSE as end-of-turn → replies to half a sentence, confused (not enough content yet). Vowel-elongation workaround ("liiiike") still failed. Sentence split in two: 2nd half queued right after the 1st → both out of context. His clarifying interjection mid-reply was backlogged/lost ("she was busy"). | VAD endpointing too aggressive (min_silence_ms=250 ends the turn on natural pauses). No barge-in: new user speech queues serially BEHIND her in-flight reply instead of interrupting it → stale out-of-context responses + lost interjections. | (a) Raise end-of-turn silence threshold (250 → ~700-1000ms) and/or add semantic endpointing so pauses don't end turns. (b) BARGE-IN: new user speech cancels/supersedes the in-flight or queued reply, never backlogs. (c) Coalesce rapid consecutive utterances into ONE turn; latest wins; drop stale queued replies. (d) Utterance-aggregation: wait a beat after detected end-of-speech for a continuation before responding. |
| 4 | **Emotions don't LAND on short replies** — the FC emotion runs ONLY while she speaks, so a short sentence cuts the emotion off mid-movement and snaps to neutral, instead of rendering fully then easing into neutral. (Jun: not the renders — the timing.) | Emotion display is hard-coupled to speech duration + hard-cuts to neutral at speech_end. Short speech → emotion truncated mid-onset/movement → abrupt neutral. **This is the FLIP SIDE of #1** (there it stuck looping too long; here it's cut too short) — same emotion-timing subsystem. | Decouple from speech length: give each emotion a MINIMUM play-through (full onset + brief peak) regardless of how short the speech is, and ALWAYS play the OFFSET ease-out into neutral (never hard-cut). Net design for the subsystem = FLOOR (don't truncate, #4) + CEILING (don't loop forever, #1) + graceful OFFSET every time. |
| 5 | **Not yet practical for natural convo (V1 caveat).** Wait time >3s reads as an issue, but Jun says it's AMPLIFIED by the abrupt transitions. **Jun's proposed design (THE fix for #1+#4):** once an emotion fires, the MIL keeps firing REGARDLESS of whether she's speaking, until EITHER (a) a time limit (default **10s** now; per-emotion expiration later) → SLOW ease back to neutral, OR (b) a new emotion is activated by conversation flow (supersedes it). | **This is the canonical solution — implement it:** emotion duration is TIME-based (10s default) + interruptible by the next emotion, fully DECOUPLED from speech; on timeout, offset ease → neutral; never speech-coupled, never hard-cut. Latency (>3s) is a separate V1-known item, but smoother transitions cut the perceived awkwardness now. |

## CONSOLIDATION (for the dispatch)
**Items #1, #4, #5 are ONE fix = the emotion-timing redesign** (Jun's design): emotion fires → time-based MIL hold (10s default, decoupled from speech) → ease to neutral on timeout OR switch on new emotion. Solves the stuck-loop (#1 ceiling), the short-reply truncation (#4 floor), and the abrupt transitions (#5). This is the headline FC v1.1-ish fix.

## QUEUED — MIL SEAMLESS LOOP (proper render) — Jun's call 6/7, HIGHER PRI than the tray
**#7 — Proper loop-matched MILs.** Jun watched the montage + caught it: current MILs are truncated expression-halves (end abruptly, don't loop) — "nothing different than the regular expressions." Render each MIL (happy_idle, pensive_idle, content_idle) as a SEAMLESSLY LOOPING settled-mood clip via **FLF2V** (same settled-mood frame as BOTH first AND last → subtle idle motion returns to start → loop-closed, no seam). Subtle idle life only (breath/blink/drift), staying IN-mood, NOT an expression arc. ~3-5s/loop. VERIFY looped playback has no seam-jump (first==last frame). These loop under the 10s time-based hold (#5). Jun chose PROPER over ping-pong ("won't know if it works unless we do the proper version"). Render offline (Malin soft-off). Sequence BEFORE #6. **REFINED 6/7 (Jun): render ONE emotion first as a full end-to-end DEMO clip** — neutral → onset → loop-matched MIL (looped, with her TTS voice over it saying a line) → offset/tail → neutral, assembled into ONE video posted to the group — to validate the whole flow before rendering all 3.
  - **REFINED 6/7 (Jun's clearer format — supersedes the fixed-10s hold):** target sequence = **FIRST HALF of expression (onset) → seamless MIL loop WHILE she talks (covers speech duration, however long) → MIL continues ~2 more seconds after she stops → SECOND HALF of expression (offset) → clean transition to neutral/default.** So MIL hold duration = speech_length + ~2s tail (SPEECH-ADAPTIVE, not fixed 10s), always resolving through the offset (works for a 2s reply AND a 20s reply). Requires: the expression split into onset/offset HALVES + the proper seamless MIL bridging them. This is the demo + integration target when the render resumes. NOTE: this changes the emotion-timing logic Hermes built (fixed-10s → speech+2s-tail).
  - **Cal's implementation notes (the real challenges):** (i) THE BRIDGES — onset-last-frame, the MIL loop, and offset-first-frame must all share ONE matched "settled-mood" anchor pose, or the jump just moves from the loop seam to the joins. Render all three around one common anchor frame. (ii) Always let the ONSET fully complete before the MIL even on a 1-word reply (don't truncate the reaction). (iii) The 2s tail should be tunable. Jun + Cal aligned on this design 6/7.
  - **MAJOR REFINEMENT 6/7 (Jun) — FIRE THE ONSET AT COMPREHENSION (the expression LEADS):** the eye-glow already fires at comprehension (she "knows" the emotion before she speaks), so fire the expression ONSET right then — she SITS in the expression while the reply generates + TTS renders. This FILLS the latency naturally (solves the >3s dead-wait, #5 — the wait becomes *her reacting*, like a real person whose face moves before they talk). **ENABLER:** decide the emotion EARLY — emit the [PERFORM:emotion] tag FIRST/LEADING (parse it off the generation stream within the first tokens) instead of trailing, OR a fast emotion-classify on Jun's input. Then the onset fires at comprehension while the rest of the reply generates. **FULL FLOW:** COMPREHENSION (eye-glow + onset) → hold expression through generation + speech (MIL loop while talking) → ~2s tail → offset → neutral. Folds the latency-fix (#5) INTO the emotion-timing design.
  - **DEMO FORMAT (Jun 6/7): pre-rendered EXAMPLE VIDEO, NOT a live test.** TWO instances — a SHORT answer + a LONG answer — to validate the speech-adaptive timing for both lengths. Jun watches → tweak the design → THEN the live coding (VALIDATE-BEFORE-CODE). Render + assemble the two example videos only; no live integration until Jun approves the mockup. Tonight offline, Malin OFF.

## QUEUED FEATURE (after the MIL render + Jun's re-test)
**#6 — System tray switch for Malin's state** (Jun, 6/7): so Jun self-manages her without interrupting Hermes (today he had to ping Hermes to reboot a stalled Malin). Tray menu: ALL-ON/HOME · RESTART/REBOOT (self-service) · SOFT KILL (graceful) · HARD KILL (END+latch) · GREEN/AMBER/RED status indicator. Wire to existing hotkey/supervisor/kill-latch semantics. Interim manual control while the always-on/run-on-boot design is DEFERRED until she's reliable. **✅ BUILT + TESTED 6/7** (malin_live_tray.py + Start_Malin_Tray_Switch.cmd) — Jun self-serves the bring-up now.

**#8 — Cal room-voice** (Jun 6/7, "let's make that happen"): Cal speaks through Jun's **5090 speakers** for hands-free status announcements ("she's ready", "render done") while he works on another task. Cal generates her Kokoro voice → a small always-on PLAYER on the 5090 (HTTP endpoint or watched folder, no GPU) plays it through the 5090's speakers (same ones Malin uses). MUTE toggle for company/focus. OPEN Q: network path Cal's Mac → 5090 (Hermes confirms; one-way-net caveat — Mac→5090 may work even though 5090→Mac is blocked). Hermes builds the 5090-side player; Cal does the audio-POST. Connects to [[project_calypso_embodiment_intention]]. QUEUE after the current Malin re-test settles.

**#9 (reliability) — SPLIT status sign + bring-up GPU guard (Jun's design 6/7).** The GPU-contention crash left her zombie-GREEN (process up, models NOT loaded → unresponsive). FIX:
- (a) **SPLIT the status indicator** down the middle: **LEFT = senses** (mic/speaker/cam; green=on/available, red=off/killed). **RIGHT = BRAIN** — green ONLY when models loaded AND a real inference ping (STT→brain→TTS or at least model-loaded+respond) passes; AMBER=loading; RED=not-plugged-in. Both green = truly ready; right-red = brain not plugged in (the zombie-green scenario, now visible at a glance).
- (b) **PREVENTION:** the bring-up (HOME / tray ALL-ON) CHECKS free VRAM / pauses any running render BEFORE loading the brain, so it loads clean instead of crashing into a render.
- Synergy: once #8 room-voice is in, Cal audibly announces "brain online" when truly ready.
QUEUE after the current re-test settles.

**#10 — LIP-SYNC (Jun 6/7, curiosity → greenlit):** her mouth doesn't move to the TTS words (audio just plays over the held expression). Add audio-driven lip-sync (Wav2Lip / MuseTalk / LatentSync on the 5090) to the TALKING section — mouth syncs to the words while eyes/expression hold the mood (model animates mouth region, preserves the rest). PRE-RENDERED on the example first; LIVE real-time lip-sync (MuseTalk-class) = bigger future lift. The detail that makes her feel like she's actually talking. Folds into the next example re-build (with eyes-open MIL + frame-trim).

**#11 — CAL TWO-WAY ROOM-VOICE / HEAR-YOU (Jun 6/7 "make it so") — BUILDING:** Cal hears Jun via the 5090 mic. DELIBERATE / on-demand (privacy-by-design, NOT always-on). Flow: Jun summons (v1 = texts "Cal, listen"; later button/wake-phrase) → Cal opens a listen window → 5090 mic capture (~10s/until-silence) → whisper STT → transcript → Cal responds via room-voice + Telegram. 5090-side: a "Cal listen" endpoint, request/response (Mac polls, since 5090→Mac blocked); GRAB-AND-RELEASE the mic, coordinate w/ Malin's STT. Calypso embodiment milestone (two-way presence) — [[project_calypso_embodiment_intention]]. Builds alongside the MIL re-render (STT light, no GPU conflict). Pairs with #8 room-voice (output, DONE 6/7). **ENDPOINTING (Alexa-style, Jun 6/7):** start-timeout ~8s (no speech → close "didn't hear"); once speaking, FOLLOW via silero VAD (reuse Malin's) with NO cap; end-of-speech = ~1.5s silence → close + transcribe + report (tunable, longer than Malin's 900ms so it won't cut mid-explanation); hard max-cap ~60s so it never hangs. Handles quick "yep" AND long ramble.

## Notes
- She self-recovered (not a hard hang) — but a 3-min stuck-loop is a real defect. High-priority MIL fix.
- Watch for: does it recur on ANY long/emotional reply, or only sad? Is AMBER caused by the loop blocking the heartbeat thread?