Project: ESP32 Voice Assistant with Emotional Display (XiaoZhi)

Purpose

Build (and document) a voice assistant device using ESP32 hardware + a display that reacts with emotions (emoji/face changes) while speaking responses. Use the XiaoZhi platform concept: ESP32 handles audio I/O + UI, while a server does STT/LLM/TTS.

Scope

Deliverable: a working demo device + project documentation page in this KB.
In scope:
- ESP32-based client device with mic + speaker + display
- Backend (self-hosted or cloud) for STT + LLM + TTS
- “Emotion tag → emoji/face” UI behavior
- Optional: MCP/tool calls for doing actions (weather, smart home, etc.)
Out of scope (for first iteration):
- Custom PCB
- Production casing
- Mobile app

Architecture

Client (ESP32 device)

Wake word detection + voice activity detection (depending on XiaoZhi build)
Captures microphone audio → streams to server
Receives response (text + emotion tag + audio) → plays audio + renders emoji face

Server (self-hosted recommended for privacy)

STT: e.g., Whisper/FunASR
LLM: local via Ollama or compatible API
TTS: e.g., CosyVoice / EdgeTTS
Returns: transcript + response + emotion tag + audio

Optional tool layer (MCP)

MCP servers expose actions (weather, home automation, internal APIs)

Prereqs

A working Docusaurus KB workflow (repo + deployments)
Basic ESP32 development tooling (PlatformIO/ESP-IDF) on at least one dev machine
Team access to:
- Discord
- docs.aurbotstem.com (private)

Inputs

Reference (source)

XDA article: https://www.xda-developers.com/turned-esp32-display-voice-assistant-emotions/

Recommended hardware options

Choose one track:

Track A (closest to article / premium):

Elecrow CrowPanel Advance (ESP32-P4 + companion ESP32-C6) with built-in mic/speaker/display

Track B (budget / common):

ESP32-S3 board + small SPI/I2C display + I2S mic + I2S amp/speaker

Network

Wi‑Fi access for device and server

Procedure

Step 1 — Decide build track (A or B)

Track A is faster to get a polished UI.
Track B is cheaper and more flexible but requires wiring and more driver work.

Step 2 — Stand up backend (self-hosted)

Backend must provide 3 things:

Speech-to-text (STT)
LLM inference
Text-to-speech (TTS)

Minimum acceptance:

Send a short audio clip → get text
Send text → get response text
Convert response text → audio

Step 3 — Bring up ESP32 client

Confirm mic capture
Confirm speaker playback
Confirm display rendering
Implement/enable the “emotion tag → face asset” mapping

Step 4 — Add MCP tools (optional but high-value)

Start with a tiny set:

Weather
Home Assistant action
“Set backlight” / “volume” / “status” actions

Step 5 — Documentation (team standard)

Keep a running changelog
Add screenshots/photos
Document:
- hardware used
- wiring (if Track B)
- backend compose files / environment
- troubleshooting + known issues

Config

Emotion tagging

Define a small stable set:
- HAPPY, SAD, CONFUSED, NEUTRAL, ANGRY, THINKING
Ensure the LLM response includes one of these tags.

Privacy

If self-hosted: ensure audio does not leave the LAN.

Verification

Device responds to voice prompt end-to-end (speak → hear reply)
Display changes expression based on response tag
Latency acceptable (< ~3–5s on LAN for first iteration)
(Optional) MCP action can be triggered via voice and verified

Rollback

Disable MCP tools first if instability occurs
Fall back to cloud backend if local inference is too slow
Revert to a minimal UI (single neutral face) if emotion assets cause crashes

Troubleshooting

No audio capture: verify mic wiring/I2S config and sample rate
Crackly audio: check I2S format, amplifier gain, speaker power
Wi‑Fi instability: confirm antenna, power supply, and (if applicable) SDIO bus settings
Wrong face/emotion: enforce strict tag list and validate parser

References

XiaoZhi concept explained in the XDA article (client-server split; emotion display)
Elecrow port (mentioned in article): https://github.com/Elecrow-RD/CrowPanel-Advanced-7inch-ESP32-P4-HMI-AI-Display-1024x600-IPS-Touch-Screen/tree/master/example/V1.0/idf-code/7_9_10.1_P4_HMI_AI

Changelog

2026-02-13: Created project page from XDA reference.

Purpose​

Scope​

Architecture​

Prereqs​

Inputs​

Reference (source)​

Recommended hardware options​

Network​

Procedure​

Step 1 — Decide build track (A or B)​

Step 2 — Stand up backend (self-hosted)​

Step 3 — Bring up ESP32 client​

Step 4 — Add MCP tools (optional but high-value)​

Step 5 — Documentation (team standard)​

Config​

Emotion tagging​

Privacy​

Verification​

Rollback​

Troubleshooting​

References​

Changelog​