Skip to main content

Project: ESP32 Voice Assistant with Emotional Display (XiaoZhi)

Purpose

Build (and document) a voice assistant device using ESP32 hardware + a display that reacts with emotions (emoji/face changes) while speaking responses. Use the XiaoZhi platform concept: ESP32 handles audio I/O + UI, while a server does STT/LLM/TTS.

Scope

  • Deliverable: a working demo device + project documentation page in this KB.
  • In scope:
    • ESP32-based client device with mic + speaker + display
    • Backend (self-hosted or cloud) for STT + LLM + TTS
    • “Emotion tag → emoji/face” UI behavior
    • Optional: MCP/tool calls for doing actions (weather, smart home, etc.)
  • Out of scope (for first iteration):
    • Custom PCB
    • Production casing
    • Mobile app

Architecture

Client (ESP32 device)

  • Wake word detection + voice activity detection (depending on XiaoZhi build)
  • Captures microphone audio → streams to server
  • Receives response (text + emotion tag + audio) → plays audio + renders emoji face

Server (self-hosted recommended for privacy)

  • STT: e.g., Whisper/FunASR
  • LLM: local via Ollama or compatible API
  • TTS: e.g., CosyVoice / EdgeTTS
  • Returns: transcript + response + emotion tag + audio

Optional tool layer (MCP)

  • MCP servers expose actions (weather, home automation, internal APIs)

Prereqs

  • A working Docusaurus KB workflow (repo + deployments)
  • Basic ESP32 development tooling (PlatformIO/ESP-IDF) on at least one dev machine
  • Team access to:
    • Discord
    • docs.aurbotstem.com (private)

Inputs

Reference (source)

Choose one track:

Track A (closest to article / premium):

  • Elecrow CrowPanel Advance (ESP32-P4 + companion ESP32-C6) with built-in mic/speaker/display

Track B (budget / common):

  • ESP32-S3 board + small SPI/I2C display + I2S mic + I2S amp/speaker

Network

  • Wi‑Fi access for device and server

Procedure

Step 1 — Decide build track (A or B)

  • Track A is faster to get a polished UI.
  • Track B is cheaper and more flexible but requires wiring and more driver work.

Step 2 — Stand up backend (self-hosted)

Backend must provide 3 things:

  1. Speech-to-text (STT)
  2. LLM inference
  3. Text-to-speech (TTS)

Minimum acceptance:

  • Send a short audio clip → get text
  • Send text → get response text
  • Convert response text → audio

Step 3 — Bring up ESP32 client

  • Confirm mic capture
  • Confirm speaker playback
  • Confirm display rendering
  • Implement/enable the “emotion tag → face asset” mapping

Step 4 — Add MCP tools (optional but high-value)

Start with a tiny set:

  • Weather
  • Home Assistant action
  • “Set backlight” / “volume” / “status” actions

Step 5 — Documentation (team standard)

  • Keep a running changelog
  • Add screenshots/photos
  • Document:
    • hardware used
    • wiring (if Track B)
    • backend compose files / environment
    • troubleshooting + known issues

Config

Emotion tagging

  • Define a small stable set:
    • HAPPY, SAD, CONFUSED, NEUTRAL, ANGRY, THINKING
  • Ensure the LLM response includes one of these tags.

Privacy

  • If self-hosted: ensure audio does not leave the LAN.

Verification

  • Device responds to voice prompt end-to-end (speak → hear reply)
  • Display changes expression based on response tag
  • Latency acceptable (< ~3–5s on LAN for first iteration)
  • (Optional) MCP action can be triggered via voice and verified

Rollback

  • Disable MCP tools first if instability occurs
  • Fall back to cloud backend if local inference is too slow
  • Revert to a minimal UI (single neutral face) if emotion assets cause crashes

Troubleshooting

  • No audio capture: verify mic wiring/I2S config and sample rate
  • Crackly audio: check I2S format, amplifier gain, speaker power
  • Wi‑Fi instability: confirm antenna, power supply, and (if applicable) SDIO bus settings
  • Wrong face/emotion: enforce strict tag list and validate parser

References

Changelog

  • 2026-02-13: Created project page from XDA reference.