Table of Contents
Deepgram’s “State of Voice AI” report shows the industry is broadly bullish about using AI for voice interfaces. While the survey contains useful insights, the main takeaway is clear: AI is bringing voice back into the conversation.
This shouldn’t be news to anyone in the industry.
The idea is simple: AI will make voice matter again, sparking fresh revenue growth. But how?
Voice Is Not Text
Voice is not text. Not from a technology perspective, not from a use case perspective, nor from the regulatory treatment. This may be obvious to many, but it needs to be repeated for context.
For this discussion, we’ll only talk about the technical aspect of the call. Like I do when trying to have a big-picture conversation, I’ll draw broad strokes when describing them, adding resolution to the picture only when absolutely necessary.
How Voice Works, the Substack Version
A voice call is computationally intensive. It is a real-time operation. As in, any lag, and people are asking “Hello?” “Are you still there?” over each other. This also means voice is an expensive medium. Connecting two endpoints at the network level is intensive and often thankless work. Just ask the wireless operators (who are still pushing Wi-Fi calling because a voice call is just too expensive on the cellular network).
To Understand Voice, Understand How It’s Billed
Because setting up a call takes a ton of resources, voice billing is fundamentally different from texting, where every text is basically the same.
Voice billing has two key components: 1) the initial billing increment and 2) every successive increment after that.
In retail—the plans you and I buy from wireless carriers—it’s typically a sixty-and-sixty model. That means calls are billed by the minute, rounded up. Whether your call lasts six seconds or fifty-nine seconds, you’re billed for a full minute. If your call hits sixty-one seconds, you’re billed for two minutes.
Of course, in today’s world of unlimited talk-and-text plans, most consumers rarely notice. But in wholesale and A2P markets, this billing model still significantly impacts revenue and costs.
In wholesale markets, billing increments can go as low as six-and-one. That means the initial billing increment is six seconds, with subsequent increments billed per second. If a call lasts four seconds, it’s billed at six seconds. But if it lasts seven seconds, it’s billed exactly at seven seconds.
This is a good billing model for both parties. The buyer ensures they’re not paying for any more talk time than they need. The seller, especially with the initial billing increment, is making sure that the network setup time is worth at least something.
Sellers rarely charge for disconnected calls; however, most include contractual clauses ensuring that frequent disconnects incur a fee for network overhead.
This matters because short-duration calls have become the scourge of the industry. Every brief call still demands significant computational resources and incurs substantial network costs—without generating meaningful revenue.
Think of every call set up like a handshake before a conversation. If you are only shaking hands and having no conversations, it’s a lot of work for nothing (unless you’re in politics). AI shifts this balance.
By enhancing the user experience through automation, real-time intelligence, and improved voice interactions, AI encourages longer, more productive calls. In other words, AI doesn’t just make voice appealing again; it turns costly short calls into valuable, long-duration conversations.
We’ve Been Here Before (Sort Of)
The telecom industry has chased automated, personalized voice interactions for decades. Remember VoiceXML? It promised automated call handling, reduced costs, and smarter interactions. But VoiceXML fell short—too rigid, too scripted, too cumbersome for practical, everyday use.
In the voice world, VoiceXML went the way Expert Systems did in the AI world. Reality is just too diverse, too complex, and too unpredictable to codify all possible rules in neat text. Don’t believe me? How many times have you enjoyed getting an IVR during a customer service call and just ended up yelling “Representative!!” at the machine on the other end? No? Just me? Ok.
AI, however, represents a genuine leap. It delivers what VoiceXML promised but never quite achieved: dynamic, conversational, adaptable interactions. Rather than forcing users into rigid menu-driven scripts, AI adapts naturally to human behavior, bringing voice interactions closer to genuine conversations.
Finally
Voice has been “dying” for decades—yet it’s still here. From VoiceXML to AI, we keep reimagining what voice interactions can be. But this time feels different. As Naval Ravikant says, “AI is eating UI.” Rather than users adapting to rigid structures like buttons, menus, and screens, AI lets technology adapt to natural human behaviors—speaking, typing, or gesturing.
“Computers,” Naval says, “are learning our language, rather than us having to learn theirs.” This means user interfaces are becoming simpler, more intuitive, almost invisible. As AI makes voice interactions richer and calls longer, revenue opportunities will expand accordingly.
The technology is ready; networks are ready; businesses say they’re ready. Now all that’s left is to see how fast the revenue arrives.