# Floating Desktop Companion Survey — for Malin (layered-2D stylized blink/breathe avatar)
**Calypso research deep-dive, 6/5. Input for V1.1 + Hermes's feasibility pass via the bridge.**

## (A) Notable builds

| Project | Stack | What's good |
|---|---|---|
| **Project AIRI** (`moeru-ai/airi`, ~40k ⭐) | TS/Vue + **Electron**; Three.js (VRM) + Live2D; WebGPU/WebAudio | Reference architecture: ships **Auto-blink / Auto-look-at / Idle eye movement** as named primitives; multi-provider TTS incl. **local Kokoro** (our stack). |
| **Open-LLM-VTuber** (~10k ⭐) | **Python** + JS frontend, local-first | Closest to ours: explicit **"Desktop Pet Mode"** = transparent bg + global top-most + **mouse click-through**; fully offline; emotion→expression from a Python backend. |
| **amica** (`semperai/amica`) | TS, Three.js + three-vrm | Clean emotion-engine + VAD (Silero) + Whisper + pluggable TTS. 3D/VRM, not 2D. |
| **Desktop Mate** (Steam, Unity 3D VRM) | Unity, VRM | Commercial benchmark for *presence* (reacts to cursor, head-pats). Cautionary — see AVOID. |
| **Shimeji** (Java) | **Layered-PNG sprite-swap** | The OG layered-2D: multi-frame sprite sets, drag, rare occasional animations. |
| **VPet** (`LorisYounger/VPet`) | C#/WPF, Steam Workshop mods | Moddable, plugin-extensible desktop pet on Windows. |
| **VTube Studio** | Unity + Live2D Cubism | Canonical **idle-life + physics** reference (auto-blink, breath, sway, phoneme lipsync). |
| **pixi-live2d-display** (`guansss`) | PixiJS plugin, web | If we ever go Live2D-in-a-window: unified Cubism API, cursor-focus, hit-areas. |
| **AITuberKit / aituber-onair** | Next.js/TS | Support VRM, Live2D, **AND animated-PNG** — validates simple PNG avatars as a legit first tier. |
| **z-waif / Mate-Engine / ZcChat** | Python / VRM / C++ | Smaller actively-maintained alts; z-waif = "fully local AI waifu," MIT. |

## (B) BORROW
1. **Live2D's breath math as a sprite-driver (no Live2D needed).** Decorrelated sine waves at mismatched periods — head `ParamAngleX` **6.53s**, `ParamAngleY` **3.53s**, `ParamBreath` **3.23s**, weight ~0.5. Drive layered-PNG transforms (subtle torso vertical-scale, few-px head bob) with these prime-ish-mismatched periods — that mismatch is *why* it never visibly loops.
2. **Auto-blink = random down-to-zero, not a fixed timer** (~2–6s, occasional double-blinks). Regular blinking is a tell. *(We already did randomized 3–5s — tighten toward 2–6 + double-blinks.)*
3. **Three idle primitives** (AIRI/Open-LLM-VTuber): auto-blink, idle eye-drift, **look-at = gaze follows cursor**. Gaze-to-cursor is the cheapest "she's aware of me" win; works on swappable eye layers.
4. **Desktop-pet window recipe** (Open-LLM-VTuber): transparent bg + global top-most + **click-through**. Windows: **WS_EX_LAYERED + per-pixel ARGB (UpdateLayeredWindow)** for smooth edges + **WM_NCHITTEST/window-region** for click-through holes.
5. **Premultiplied alpha to kill the halo/fringe** — THE #1 transparency-fringe fix (forces transparent pixels to black RGB so no color bleeds into soft edges) + pad sprite edges + linear filtering. **Bake into PNG export.** *(This is the exact fix for the magenta fringe Hermes hit → makes the true transparent cutout viable.)*
6. **Asymmetry on purpose.** The eye reads bilateral symmetry as uncanny/dead. Offset left/right micro-motions (one shoulder lower, head float lags breath by a beat). Trivial with layered PNG.
7. **Subprocess-isolated TTS worker** (Reachy Mini): main app ↔ TTS worker over JSON on stdin/stdout — decouples orchestrator from synth, hot-swap Kokoro without touching the renderer.

## (C) UPGRADE
- **Lip-sync: skip phonemes, use volume-envelope WITH easing.** Drive mouth-open from the **TTS audio RMS envelope with attack/decay smoothing** + slight `mouthForm` width modulation. ~90% of phoneme realism for ~5% of the work; sidesteps "uncanny lipsync."
- **Stylized = our moat.** Animated-PNG is a first-class tier (AITuberKit). Photoreal/3D pays an uncanny tax on every micro-motion; a stylized character's breathing reads as *charm*, not "a photo that twitches." *(Validates Jun's anti-photoreal direction.)*
- **State-aware idle, not one loop.** Shimeji's trick: rare/occasional "alive beats" (a sigh, a glance away, a stretch) fired stochastically over a calm baseline. Kills the repetitive-loop tell.
- **Reactivity is the feeling, not the rig.** Responsiveness to the user is what flips "alive" on. Gaze-to-cursor + a reaction when she speaks/receives a message beats any amount of fancier idle.

## (D) AVOID
- **The "photo that twitches."** Still = dead; one symmetric looping micro-motion = uncanny. The failure is the *middle* — commit to layered/asymmetric/multi-period, or keep her mostly still with deliberate beats.
- **Over-animation / drift-sickness.** Physics is the #1 cause of jitter/"jelly"/lag (fix = **damping**, use delay sparingly). Constant desktop-wandering induces low-grade motion-sickness. **Keep her parked, motion internal.** *(Exactly the nausea Jun caught.)*
- **GPU/CPU hog.** Desktop Mate ≈ **15% CPU / 50% GPU / 500MB RAM** (Unity 3D) → why people quit an always-on app. Layered-2D is our edge: low idle frame budget, render-on-change, pause when occluded/unfocused.
- **Paywalled personality / no modding** (Desktop Mate's top complaints). The character + behavior is the product, not cosmetics.
- **Intrusiveness.** Top-most that steals focus / pushes windows / floats over fullscreen = fast uninstall. (Electron has known always-on-top+transparent bugs — test occlusion explicitly.)
- **Hard metronome timing** on anything. Regularity is the single biggest "it's a program" tell.

## (E) Audio plumbing
- **Canonical local loop** (Reachy Mini / Open-LLM-VTuber / amica converge): `mic → Silero VAD → faster-whisper STT → LLM → TTS → speaker`. Malin's inbound is text/Telegram, so drop STT; the live part is **LLM → Kokoro TTS → playback → mouth-drive**.
- **AIRI already runs local Kokoro** as first-class TTS → validates our af_heart+af_nicole stack on an on-screen avatar; reuse the same Kokoro endpoint that feeds Calypso's voice.
- **Drive the mouth from the audio we already generate:** RMS envelope of the Kokoro WAV → smoothed amplitude → mouth-open. No viseme/phoneme model for v1.
- **Isolate the synth in a subprocess worker** (JSON stdin/stdout) so the orchestrator stays renderer-agnostic. fastrtc-style streaming is overkill for v1.

## (F) The single most important lesson
**Aliveness = de-correlated, asymmetric micro-motion + reactivity — NOT more animation.** Three sine waves at mismatched periods (breath ~3.2s, head-Y ~3.5s, head-X ~6.5s) + randomized blinks + gaze that drifts and snaps to the cursor + rare stochastic beats — that's the whole illusion, nearly free in layered-2D. The fatal mistakes are the *opposite* of effort: a single symmetric loop (uncanny), a metronome (mechanical), constant drift (nauseating). **Build the dumb version first** — breath + random blink + cursor-gaze on swappable PNGs, premultiplied alpha, click-through window — confirm it reads alive for a full day, *then* add speaking/reacting. Stylized is the cheat code: every imperfect motion reads as character, not glitch.

## Key sources
AIRI `github.com/moeru-ai/airi` · Open-LLM-VTuber `github.com/Open-LLM-VTuber/Open-LLM-VTuber` · amica `github.com/semperai/amica` · `github.com/proj-airi/awesome-ai-vtubers` · Live2D Breath SDK (sine params) `docs.live2d.com/en/cubism-sdk-manual/breath/` · Live2D Lip-sync (volume vs phoneme) `docs.live2d.com/en/cubism-sdk-manual/lipsync/` · VTS Lipsync wiki · WS_EX_LAYERED / SetLayeredWindowAttributes (MS Learn) · Electron click-through #1335 · Premultiplied-alpha fringe (Courrèges blog) · idle-animation asymmetry (mocaponline / animschool) · VTS physics/damping · Desktop Mate Steam + negative reviews · Reachy Mini conversation app + Jetson assistant (subprocess-TTS) · pixi-live2d-display

*(GitHub stars are order-of-magnitude, as-fetched. Architecture/technique findings well-corroborated across sources.)*
