Revolutionizing Conversational AI: The World’s Fastest Voice Bot

In the ever-evolving field of artificial intelligence, speed is paramount, particularly for voice AI interfaces. Daily, in partnership with Cerebrium, has reached a remarkable milestone by developing a voice bot that boasts a voice-to-voice response time as low as 500 milliseconds. This blog delves into the technological innovations and architectural strategies that have made this groundbreaking achievement possible.

Link for the demo AI voice bot : https://fastvoiceagent.cerebrium.ai

This voice bot returns the reply as fast as human, with a natural voice.

The Importance of Speed in Voice AI

For natural and seamless conversations, humans typically expect response times around 500 milliseconds. Delays longer than 800 milliseconds can disrupt the conversational flow and feel unnatural. Achieving such rapid response times in AI systems requires meticulous optimization across various technological components.

Core Components and Architecture

A dynamic and futuristic illustration representing the integration of WebRTC, Deepgram, and Llama LLM in Voice AI. The scene includes a sleek AI model resembling a llama, with a metallic body, glowing blue eyes, and intricate circuitry patterns. The llama is surrounded by waves and lines symbolizing sound waves and speed, with holographic data displays showing WebRTC and Deepgram logos, floating binary code, and advanced computer equipment. The background features motion blur effects and streaks of light to emphasize rapid movement and fast processing capabilities. The overall setting is high-tech and illuminated with cool, blue-toned light, creating a sense of advanced technology and efficiency.

To construct this high-speed voice bot, Daily and Cerebrium employed cutting-edge AI models and optimized their deployment within a highly efficient network architecture. Here are the key elements:

  • WebRTC for Audio Transmission: WebRTC (Web Real-Time Communication) is utilized to transmit audio from the user’s device to the cloud, ensuring minimal latency and high reliability.
  • Deepgram’s Models: Deepgram provides fast transcription (speech-to-text) and text-to-speech (TTS) models, both optimized for low latency. Deepgram’s Nova-2 transcription model can deliver transcript fragments in as little as 100 milliseconds, while their Aura voice model achieves a time to first byte as low as 80 milliseconds.
  • Llama 3 LLM: The Llama 3 70B model, a highly capable large language model (LLM), is used for natural language processing. Running on NVIDIA H100 hardware, it can deliver a median time to first token latency of 80 milliseconds. (Check our blog about Llama 3 here )

Benefits of Self-Hosting

An infographic illustrating the benefits of self-hosting. The image should be in a 16:9 ratio, with a clean and modern design. Include icons and short descriptions for each benefit, such as increased security, control over data, customization options, cost savings, improved performance, and enhanced privacy. Use a mix of colors to make the infographic visually appealing and easy to understand.

A significant strategy employed is self-hosting the AI models and bot code within the same infrastructure. This approach offers several advantages:

  • Latency Reduction: Running transcription, LLM, and TTS models on the same hardware avoids the latency overhead associated with external network requests, saving 50-200 milliseconds per interaction.
  • Enhanced Control: Self-hosting allows for precise tuning of latency-critical parameters such as voice activity detection and phrase end-pointing.
  • Operational Efficiency: Efficient data piping between models ensures rapid processing of each conversational loop.

Overcoming Technical Challenges

Achieving low latency requires addressing several technical challenges:

  • AI Model Performance: Ensuring that AI models generate output faster than human speech while maintaining high quality.
  • Network Optimization: Minimizing the time taken for audio data to travel from the user’s device to the cloud and back.
  • GPU Management: Efficiently managing GPU infrastructure to handle the computational demands of AI models.

Looking Forward

The development of the world’s fastest voice bot represents a significant leap in conversational AI, but the journey continues. With ongoing advancements in AI models and network technologies, further improvements in speed and reliability are anticipated. As AI technology evolves, we can expect even more responsive and natural interactions with voice bots, enhancing user experiences across various applications.

Key Takeaway

The collaboration between Daily and Cerebrium has set a new standard in voice AI by achieving unprecedented response times. By leveraging state-of-the-art AI models, optimized network architecture, and self-hosting strategies, they have created a system that meets and exceeds human expectations for conversational speed. This innovation paves the way for new possibilities in real-time voice applications, setting the stage for future advancements in AI-driven communication.

Source link

RAG with LLama 3 (Olama), LlamaIndex, Streamlit

Building a robust RAG application involves a lot of moving parts, the architecture diagram presented below illustrates some of the key components & how they interact with each other, followed by detailed descriptions of each component, we’ve used:

– LlamaIndex for orchestration

– Streamlit for creating a Chat UI

– Meta AI’s Llama3 as the LLM

– “BAAI/bge-large-en-v1.5” for embedding generation

1. Custom knowledge base

Custom Knowledge Base: A collection of relevant and up-to-date information that serves as a foundation for RAG. It can be a database, a set of documents, or a combination of both. In this case it’s a PDF provided by you that will be used as a source of truth to provide answers to user queries.

2. Chunking

Chunking is the process of breaking down a large input text into smaller pieces. This ensures that the text fits the input size of the embedding model and improves retrieval efficiency.

Following code will load pdf documents from a directory specified by the user using LlamaIndex’s SimpleDirectoryReader:

3. Embeddings model

A technique for representing text data as numerical vectors, which can be input into machine learning models. The embedding model is responsible for converting text into these vectors.

4. Vector databases

A collection of pre-computed vector representations of text data for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering, and horizontal scaling. By default, LlamaIndex uses a simple in-memory vector store that’s great for quick experimentation.

5. User chat interface

A user-friendly interface that allows users to interact with the RAG system, providing input query and receiving output. We have built a streamlit app to do the same. The code for it can be found in app.py

6. Query engine

The query engine takes a query string to use it to fetch relevant context and then sends them both as a prompt to the LLM to generate a final natural language response. The LLM used here is Llama3 which is served locally, thanks to Ollama The final response is displayed in the user interface.

7. Prompt template

A custom prompt template is use to refine the response from LLM & include the context as well:

Conclusion

In this studio, we developed a Retrieval Augmented Generation (RAG) application that allows you to “Chat with your docs.” Throughout this process, we learned about LlamaIndex, the go to library for building RAG applications & Ollama for locally serving LLMs, in this case we served Llama3 that was recently released by MetaAI.

We also explored the concept of prompt engineering to refine and steer the responses of our LLM. These techniques can similarly be applied to anchor your LLM to various knowledge bases, such as documents, PDFs, videos, and more.