Project: ESP32 Voice Assistant with Emotional Display (XiaoZhi)
Purpose
Build (and document) a voice assistant device using ESP32 hardware + a display that reacts with emotions (emoji/face changes) while speaking responses. Use the XiaoZhi platform concept: ESP32 handles audio I/O + UI, while a server does STT/LLM/TTS.
Scope
- Deliverable: a working demo device + project documentation page in this KB.
- In scope:
- ESP32-based client device with mic + speaker + display
- Backend (self-hosted or cloud) for STT + LLM + TTS
- “Emotion tag → emoji/face” UI behavior
- Optional: MCP/tool calls for doing actions (weather, smart home, etc.)
- Out of scope (for first iteration):
- Custom PCB
- Production casing
- Mobile app
Architecture
Client (ESP32 device)
- Wake word detection + voice activity detection (depending on XiaoZhi build)
- Captures microphone audio → streams to server
- Receives response (text + emotion tag + audio) → plays audio + renders emoji face
Server (self-hosted recommended for privacy)
- STT: e.g., Whisper/FunASR
- LLM: local via Ollama or compatible API
- TTS: e.g., CosyVoice / EdgeTTS
- Returns: transcript + response + emotion tag + audio
Optional tool layer (MCP)
- MCP servers expose actions (weather, home automation, internal APIs)
Prereqs
- A working Docusaurus KB workflow (repo + deployments)
- Basic ESP32 development tooling (PlatformIO/ESP-IDF) on at least one dev machine
- Team access to:
- Discord
- docs.aurbotstem.com (private)
Inputs
Reference (source)
Recommended hardware options
Choose one track:
Track A (closest to article / premium):
- Elecrow CrowPanel Advance (ESP32-P4 + companion ESP32-C6) with built-in mic/speaker/display
Track B (budget / common):
- ESP32-S3 board + small SPI/I2C display + I2S mic + I2S amp/speaker
Network
- Wi‑Fi access for device and server
Procedure
Step 1 — Decide build track (A or B)
- Track A is faster to get a polished UI.
- Track B is cheaper and more flexible but requires wiring and more driver work.
Step 2 — Stand up backend (self-hosted)
Backend must provide 3 things:
- Speech-to-text (STT)
- LLM inference
- Text-to-speech (TTS)
Minimum acceptance:
- Send a short audio clip → get text
- Send text → get response text
- Convert response text → audio
Step 3 — Bring up ESP32 client
- Confirm mic capture
- Confirm speaker playback
- Confirm display rendering
- Implement/enable the “emotion tag → face asset” mapping
Step 4 — Add MCP tools (optional but high-value)
Start with a tiny set:
- Weather
- Home Assistant action
- “Set backlight” / “volume” / “status” actions
Step 5 — Documentation (team standard)
- Keep a running changelog
- Add screenshots/photos
- Document:
- hardware used
- wiring (if Track B)
- backend compose files / environment
- troubleshooting + known issues
Config
Emotion tagging
- Define a small stable set:
- HAPPY, SAD, CONFUSED, NEUTRAL, ANGRY, THINKING
- Ensure the LLM response includes one of these tags.
Privacy
- If self-hosted: ensure audio does not leave the LAN.
Verification
- Device responds to voice prompt end-to-end (speak → hear reply)
- Display changes expression based on response tag
- Latency acceptable (< ~3–5s on LAN for first iteration)
- (Optional) MCP action can be triggered via voice and verified
Rollback
- Disable MCP tools first if instability occurs
- Fall back to cloud backend if local inference is too slow
- Revert to a minimal UI (single neutral face) if emotion assets cause crashes
Troubleshooting
- No audio capture: verify mic wiring/I2S config and sample rate
- Crackly audio: check I2S format, amplifier gain, speaker power
- Wi‑Fi instability: confirm antenna, power supply, and (if applicable) SDIO bus settings
- Wrong face/emotion: enforce strict tag list and validate parser
References
- XiaoZhi concept explained in the XDA article (client-server split; emotion display)
- Elecrow port (mentioned in article): https://github.com/Elecrow-RD/CrowPanel-Advanced-7inch-ESP32-P4-HMI-AI-Display-1024x600-IPS-Touch-Screen/tree/master/example/V1.0/idf-code/7_9_10.1_P4_HMI_AI
Changelog
- 2026-02-13: Created project page from XDA reference.