# FC Live Interaction — Issues & Solutions
*Calypso, 2026-06-08. Anticipating the real problems Jun will hit running Malin's animated face (FC) in live, real-time conversation. Companion to the Method A emotion-unit template (POC-locked 6/8).*

## The reframe: a state machine, not a clip player

A pre-rendered library plays fixed-length clips. Live conversation is variable-length and unpredictable. Every problem below comes from that mismatch. The fix is to drive the avatar as a **real-time state machine** reacting to conversation **events**, where clips are the *vocabulary*, not the timeline.

**States:** `listening-idle` · `thinking-idle` · `speaking(emotion)` = onset → **loopable sustain** → offset · `bridge(emotionA→emotionB)` · `safe-neutral`
**Events:** user-speech-start · user-speech-end · malin-audio-start · malin-audio-end · emotion-tag · **barge-in** · missing-asset

The hot path only **selects + plays** pre-baked clips — no live diffusion (see #9).

---

## A. TIMING / SYNC — the hardest class

**1. Variable speech length vs fixed clip length** *(must-solve now)*
- *Problem:* her spoken reply is 2s–30s; the sustained hold is fixed. Expression can't end mid-sentence or freeze after.
- *Solution (Jun's unification, 6/8):* the sustained hold = the **Foundation-Four SHUFFLE-BAG, tinted by emotion**. Short anchored *variations* of the expression (e.g. several micro-smiles), weighted-random stitched, sharing the emotion's anchor, so it holds for ANY duration without visibly looping — the exact engine already proven for the idle FF. Flow: onset → shuffle-bag (held until the beat ends / audio stops) → exit via offset→FF idle, OR (V2) via the **GASKET** (Jun's term = the interrupt bridge into the next emotion). So the MILs "work" for free — they inherit the FF mechanism. Solves variable-length hold AND the repetition-tell (#7) in one move. *Dumb-version-first:* a single sustain looped = a bag of one; prove that seam, then add 2–3 variations. **#1 real-time requirement.**

**2. The thinking/latency gap** *(must-solve now)*
- *Problem:* LLM + TTS + emotion-pick take seconds. A frozen/dead-neutral face during that gap reads as "broken."
- *Solution:* a dedicated **`thinking-idle`** living loop — small considering-glance, head tilt, breath, micro-nod. Fires the instant the user stops, holds until her audio is ready. Latency reads as "she's thinking," not "she froze."

**3. A/V sync (voice ↔ face)** *(must-solve now)*
- *Problem:* voice (:1250) and animation are separate; mouth/expression starting off from the audio reads dubbed.
- *Solution:* event handshake — trigger onset+speaking on **malin-audio-start** (not text-ready); loop sustain for the **measured WAV duration**; release offset on **malin-audio-end**. Audio is the clock.

## B. SPEECH / MOUTH

**4. Lip-sync during speech** *(biggest architecture fork — start dumb)*
- *Problem:* pre-rendered emotion clips don't move the mouth with the words.
- *Options:* (a) audio-driven phoneme mouth layer (InfiniteTalk / audio2face) — most real, most complex; (b) **amplitude-gated generic mouth motion** — moves when audio is loud, not phoneme-accurate, cheap, reads as "talking"; (c) stylized VTuber mouth-flaps, voice carries the words.
- *Recommendation:* **ship (b) first** (dumb version that works), upgrade to (a) only if literal sync matters.

## C. EMOTION CORRECTNESS

**5. Wrong-emotion picks** *(must-solve now — Jun flagged this earlier)*
- *Problem:* the model's `[PERFORM:emotion]` tag doesn't fit what she's saying → sours the relationship.
- *Solution:* a light guard between LLM and player — cross-check the response text's sentiment against the chosen emotion; on mismatch/low-confidence, fall back to **safe warm-neutral** or the nearest validated emotion. Plus: keep the taxonomy trimmed to emotions the model picks reliably (the legibility pass already does this). Plus an **intensity dial** (subtle vs full) so a near-miss isn't jarring.

**6. Interrupt / barge-in** *(V2, Jun's emotion→emotion insight)*
- *Problem:* user interrupts mid-expression; she must stop + redirect emotionally, NOT reset to neutral first.
- *Solution:* **VAD** on the mic during her speech → detect barge-in → duck/stop her audio → jump via an **emotion→emotion bridge** (or to listening-idle). Engine must interrupt the current clip at any frame and launch the bridge from the nearest valid exit. Build the **likely** emotion pairs first (Jun), defer rare ones.

## D. ANTI-ROBOTIC

**7. Repetition tells** *(later)* — 2–3 variants per emotion, random-select; randomize idle micro-motion + onset/offset timing.

**8. Clip-to-clip pop** *(must-solve now — generalizes the loop-seam glitch)*
- *Problem:* clip-A end frame ≠ clip-B start frame → visual pop at every handoff (the same aspect-ratio glitch, between clips).
- *Solution:* a **canonical shared neutral ANCHOR frame** — every unit starts AND ends on the identical neutral frame, so any clip hands off to any other with zero pop. The loop-seam fix generalizes to "everything shares the anchor."

## E. SYSTEM

**9. Compute contention on the 5090** *(design constraint)* — PRE-RENDER everything (units + bridges); the live engine only selects + plays. No live diffusion in the hot path; it competes with LLM + TTS + vision for VRAM.

**10. Failure / edge states** *(must-solve now)* — missing tag / missing clip / audio fail → always fall to the **living neutral/listening idle**. Never a black screen or freeze. Graceful degradation.

---

## Recommended build order (dumb-version-first)
1. **Seamless sustain loop** (#1) + **audio-clocked handshake** (#3) + **shared neutral anchor** (#8) — the real-time spine.
2. **thinking-idle** (#2) + **safe-neutral fallback** (#10) — never-frozen guarantee.
3. **amplitude-gated mouth** (#4b) — "she's talking" for cheap.
4. **emotion guard + intensity dial** (#5) — stop the souring.
5. **barge-in + priority emotion bridges** (#6) — V2 interrupt handling.
6. Polish: variants (#7), phoneme lip-sync (#4a) if needed.

*Everything in 1–2 is what makes her feel alive instead of like a video on a loop. Start there.*
