The Forge Event 1: Making AI Voice Conversational Interfaces Work for Everyone

Foto del autor

By Damian Calderon

February 20, 2025

Voice AI is one of the fastest-growing areas in artificial intelligence, with rapid advancements in speech recognition, synthesis, and agentic interactions. As noted in a16z’s AI Voice Update 2025, voice-first interfaces are becoming an integral part of AI-driven applications, from virtual assistants to customer service automation. The promise? Seamless, intuitive communication. The reality? Voice AI still struggles with accuracy, especially when dealing with diverse names, accents, and real-world variability.

At Arionkoder, we experienced this firsthand while working on an AI-driven proof of concept for our Healthcare Agents Initiative. The goal was simple: enable an agent to call patients and confirm their identity. But a real-world test exposed a major flaw—when asking for a patient’s last name, our system, powered by state-of-the-art tech like Bland.ai, couldn’t correctly interpret a Portuguese last name spoken with a Portuguese accent. A simple task became a frustrating roadblock.

This problem wasn’t just an edge case; it highlighted broader challenges with precision in voice-based interactions. Confirming names, street addresses, or alphanumeric codes over the phone is already difficult for humans—AI struggles even more. Instead of brushing this aside, we turned it into an opportunity. This became the foundation of The Forge Episode 1, our first event under the new ‘The Forge’ format.

Turning Frustration Into a Spike-Worthy Challenge

The Forge was created to provide flexible, focused spaces for rapid experimentation. Instead of a traditional hackathon, we launched a technical spike—a short, intense period of exploration to tackle a single problem and validate potential solutions.

The challenge: improve the accuracy and flexibility of voice interactions when dealing with diverse names, accents, and precise spellings. Participants didn’t start from scratch. We provided:

  • A pre-built repository to fast-track experimentation.
  • Background material on voice processing challenges.
  • Real-world failure cases from our own AI tests.

From there, teams picked from a list of specific challenges to work on, such as: recognizing accents earlier and adapting to that, making the agent use a spelling tool when needed, developing fallback strategies when speech recognition fails.

This wasn’t about perfecting a final product—it was about exploring what works and what doesn’t, rapidly.

The results

Team 1: Elisa Lovers (Winner)

Solution: The team presented a demo that showcased their solution for voice recognition and generation using a real-time audio model, GPT Real Stream API, and WebSocket with the API of OpenAI. They demonstrated the ability to recognize and generate audio in real-time and also showed the ability to change languages and accents using a native audio model.

Positive aspects: The team’s solution was able to recognize and generate audio in real-time, which was impressive. They demonstrated the ability to change languages and accents, which showed the flexibility of their solution. The team’s demo was well-structured and easy to follow, and they used a Telegram chatbot to assist users further.

Areas for improvement: The team’s solution was not able to handle long conversations well, and the model started to lose its understanding of the context, which could be due to the high cost of using the real-time model. The team deployed their application using ECR/EC2, managing WebSocket connections, which could be a limitation for large-scale deployment.

Technical libraries and tools used:

  • OpenAI API
  • GPT Real Stream API
  • WebSocket
  • AWS ECR/EC2
  • Telegram API

Team 2: Skillet

Solution: The team presented a demo that showcased their solution for recognizing surnames using a multi-language engine, Vosk, and models downloaded from the Vosk page. They demonstrated the ability to recognize surnames in different languages and adapt to the user’s language.

Positive aspects: The team’s solution was able to recognize surnames in different languages, which was impressive. They demonstrated the ability to adapt to the user’s language, which showed the flexibility of their solution. The team’s demo was well-structured and easy to follow, and they didn’t deploy the project to the cloud, instead they ran it from the local machine for the demo.

Areas for improvement: The team’s solution was not able to handle surnames with double letters or accents well, which could be due to the limitations of the Vosk transcription model. The team mentioned that they could use AWS Polly for voice generation, but it would incur some cost, which could be a limitation for large-scale deployment.

Technical libraries and tools used:

  • Vosk
  • Vosk models
  • AWS Polly (optional)

Team 3: Surename

Solution: The team presented a demo that showcased their solution for recognizing last names using a native voice module and prompt, forked from a voice demo from OpenAI and modified using cursor. They demonstrated the ability to recognize last names in different languages and guide the user to speak slowly or provide the name with different weights.

Positive aspects: The team’s solution was able to recognize last names in different languages, which was impressive. They demonstrated the ability to guide the user to speak slowly or provide the name with different weights, which showed the flexibility of their solution. The team’s demo was well-structured and easy to follow, and they used a Gemini deep research tool to research and test their solution.

Areas for improvement: The team’s solution was not able to handle last names with accents or double letters well, which could be due to the language bias of the model. The team mentioned that they found another model that could understand video, but they didn’t have time to show it, which could be a potential area for improvement.

Technical libraries and tools used:

  • OpenAI API
  • Cursor
  • Gemini deep research tool
  • Native voice module (not specified which one)

Overall, all three teams presented impressive solutions, and the winner, ElisaLovers, demonstrated a well-rounded solution that was able to recognize and generate audio in real-time.

Why This Matters

Voice interfaces are growing, but their shortcomings often go unnoticed until they create friction for real users. If AI-powered voice interactions can’t reliably handle something as fundamental as people’s names, how can they be trusted in critical applications? By exploring this challenge early, we’re ensuring that Voice AI solutions work for real users in real contexts—not just in perfect test conditions.

With the growth of voice agents expected to accelerate in 2025 and beyond, addressing these gaps will be essential. The Forge will continue to serve as a playground for solving real AI problems, where we can experiment, iterate, and refine solutions that impact both our products and the broader AI landscape.

Stay tuned for our next event, where we will tackle another real challenge in AI to foster collaboration and innovation among our teams.