You have seen a nice speaker photo above.
Aigiz and his team of enthusiasts remember this smart speaker from a different perspectives throughout the years of evolution:
The smart speaker device is crucial for adoption - it has to be cost-effective, powerful and sturdy enough. Only then it could be placed in many families and kindergartens for the children to talk to.
Under the hood, it's essentially a simple PCB powered by an ESP32-S3 microcontroller, complete with a microphone, a few diodes, and buttons. The ESP32-S3 is an inexpensive, compact microcontroller typically favored by hobbyists for creating remote temperature sensors and IoT automation.
It has only a tiny amount of memory (just 512 KB of RAM for both data and instructions, 8 MB of PSRAM and 16MB of storage) and only two cores.
Despite these limitations, Aigiz and his team have successfully transformed this into a functional smart Wi-Fi-connected speaker. This device runs:
A small, specialized wake-word detection model (a machine learning model trained to detect a specific word or phrase).
An audio processing pipeline.
A streaming client to maintain continuous communication with the “brains” of the system.
The software itself is native C++ code developed using the ESP-IDF framework, with firmware updates deployed over-the-air (OTA).
Homai Server is a more traditional application. At the heart of it is a an event-driven coordination server written in golang. It maintains connections with all the devices via web sockets and manages their state through various stages like: “listening”, “running ASR” or “sending response”.
This backend is also responsible for resiliency in face of failure, authentication, and managing jobs for the machine learning models and external services.
If the backend is a heart, then GPU servers running custom LLM models are the muscles.
There are custom fine-tuned machine learning models per language. Speech recognition is currently based on Wav2Vec2.0 Bert, while text-to-speech runs on VITS (conditional variational autoencoder with adversarial learning). However, these architectures are a current implementation detail. They can change rapidly, tracking current state of the art in linguistics.
These GPU servers run python-based agents. They continuously pull jobs from the backend server, run them through the GPU pipelines and push results back to the server. GPU servers can be run in parallel, to provide redundancy and load balancing.
There also is a special type of the server that is responsible for the cultural intelligence and overall conversations - agentic. This server is also written in Python. It keeps track of conversations, detects intents and integrates with third party information sources.
For example, when Homai is used in classes, teachers would frequently prepare class notes and exercises and upload them to their profile via a special website for the pedagogues. Then associated Homai device would be able to refer to the data and exercises, when teachers mention these during the class. Agentic server implements all the required functionality for that, along with managing other language-specific bits of knowledge:
As you could’ve already guessed, this part is implemented as a specialised advanced RAG system. It uses patterns and practices similar to the ones discussed in Enterprise RAG Challenge (
How I Won the Enterprise RAG Challenge).
Technology-wise, the project uses Nix and NixOS to manage multiple different deployments (and deployment stages) and connect them via a private secure network. In addition to the servers mentioned above, there also are component for observability and logging, SSL termination, serving content and APIs, managing firmware updates.
AI gets no credit on this team slide
Advancements in AI, particularly through open-source research and the release of powerful, multimodal language models, have made it technically viable to capture speech and preserve cultural heritage effectively. Collaborative efforts from linguists worldwide have further lowered the barriers to training custom speech recognition and text-to-speech models, enabling even individual researchers to accomplish this.
However, speech recognition and generation alone are insufficient for cultural preservation; a smart assistant requires intelligence and cultural insight. Recent breakthroughs in LLMs, advanced RAG, and reasoning architectures helped here.
LLM Benchmarks , Enerprise RAG Challenge and our insights in AI Cases contributed to this progress. They helped to make efficient design decisions and, in turn, have drawn inspiration from the successes of Homai. The power of international collaboration between talented teams became a source of inspiration and motivation for “AI Strategy & Research Hub” at TIMETOACT GROUP Austria. Its purpose is to coordinate practical AI R&D in the community and push forward State-of-the-Art together.