Building Crito Voice AI: what actually worked in production
Crito Voice AI is the AI phone receptionist in the Crito suite at Skyware IT. I built the first production version independently, end to end, for the company. It picks up the phone for a hotel, handles a real conversation with a guest, books rooms, answers questions about check-in times and policies, and transfers to a human when the conversation needs one. The work taught me more about real-time systems than anything else I have shipped.
I want to write down the things that mattered most while they are still fresh, because almost every important decision in the system came from an actual failure in production, not from reading a tutorial about voice AI. The model is the easy part. Everything around the model is where the work lives.
Latency is the actual product
People can feel the difference between a six hundred millisecond response and a nine hundred millisecond response, even if they cannot tell you why. Below six hundred milliseconds, the agent sounds alive. Above nine hundred milliseconds, the conversation feels broken in a way that makes guests hang up. The first version I built was correct and slow. It produced perfect answers, and nobody could stand talking to it. The second version was correct and fast, and people stopped noticing it was a bot.
Most of the speed improvements did not come from a smarter model. They came from streaming text to speech instead of waiting for full responses, from processing partial transcripts as soon as they arrived, from running tool calls in parallel where it was safe, and from pre-warming connections to every service in the pipeline before the call connected. None of these are exotic ideas, but doing all of them at once is what made the agent feel natural.
The tool layer is where the system lives or dies
The model is responsible for understanding what the caller wants and deciding which action to take. Everything that actually happens, the availability check, the booking creation, the payment link, the transfer to the front desk, runs through a tool layer that I wrote by hand. Each tool is a small, strictly typed function with clear input validation, clear failure modes, and a clear contract for what it does. The model picks the right tool most of the time, and when it does not, the system has to behave well anyway.
I have learned to treat tool design with at least as much care as the prompt. A clean tool signature gives the model fewer ways to make a mistake. A clean error response gives the model a fair chance to recover. A vague tool with optional fields and ambiguous behaviour will produce confusing conversations no matter how good the model is at the language layer.
Log everything that is hard to reproduce later
Real-time voice calls are almost impossible to debug from a single complaint. By the time someone tells you a call went wrong, the call is over, the audio is gone, and the model has long forgotten the context. The only way I have found to stay sane is to log everything that would be hard to reconstruct later. That means full transcripts, every tool call with its arguments and result, latency on every step of the pipeline, and the reason every call ended.
With that data in one place, I can replay a flaky call from start to finish and see exactly where the assistant lost the thread. Without it, every investigation is guesswork. The cost of building the logging layer felt high at the time. Six months in, it has paid for itself many times over.
Voice AI is not just chat with audio in front of it
People do not talk to a phone agent the way they type to a chatbot. They interrupt. They start a sentence, change their mind halfway through, and expect the agent to follow. They give half answers and assume context. They go quiet for two seconds because they are thinking, then expect the agent to wait. None of that is something you handle by adding more capacity to the language model. You handle it by building a conversation loop that supports barge-in, partial hypotheses, and the difference between a silence that means thinking and a silence that means the network dropped a packet.
What I am working on next
Evaluation is the next big piece. Right now I still review failed calls by hand. That works at our current volume, but it does not scale. The next step is a structured eval suite that scores every call against a few axes, things like booking success rate, transfer appropriateness, and conversational flow, and flags regressions before they reach production. Getting that right is a real engineering problem on its own, and it is the work that will let the product keep improving without me reviewing every recording.