Building Hindi Voice Agents: Using Sarvam Saaras (STT) and Bulbul (TTS)

Building voice agents for the Indian market presents a unique set of challenges that Western models often fail to address. The primary hurdle is code-mixing (Hinglish, Tanglish, etc.) and the sheer diversity of accents.

Most developers default to OpenAI's Whisper for STT and ElevenLabs for TTS. While these are excellent general-purpose tools, they struggle with the nuances of Indian conversational speech.

Enter Sarvam AI. Their full-stack platform—comprising Saaras (Speech-to-Text) and Bulbul (Text-to-Speech)—is built specifically to handle these edge cases.

In this guide, we'll explore how to architect a Hindi/Hinglish voice agent using Sarvam's stack.

The Stack: Saaras + Bulbul

1. The Ear: Saaras (STT)

Saaras is Sarvam's flagship ASR model. Unlike Whisper, which is trained on global data, Saaras is fine-tuned on thousands of hours of Indian vernacular speech.

Key Advantages for Agents:

Hinglish Native: It doesn't just "tolerate" code-mixing; it expects it. It correctly transcribes "Main kal market ja raha hoon" without trying to force it into pure Hindi script or pure English translation.
Streaming Support: Essential for voice agents. Saaras supports WebSocket-based real-time transcription, crucial for keeping latency low (sub-500ms).

2. The Voice: Bulbul (TTS)

Bulbul is designed to sound like a native Indian speaker, not a Westerner trying to speak Hindi.

Key Advantages for Agents:

Contextual Prosody: It understands the "flow" of Indian speech patterns.
Low Latency: Optimized for real-time generation, reducing the dreaded "awkward silence" in voice conversations.

Architecture of a Hindi Voice Agent

A typical voice agent pipeline looks like this:

Input: User speaks (Audio Stream).
Transcribe (Saaras): Audio -> Text (Hinglish).
Brain (LLM): Text -> Response Text. (You can use Sarvam-1 or GPT-4o here).
Speak (Bulbul): Response Text -> Audio.
Output: Audio played back to user.

Implementation: Using the API

Note: This is a conceptual implementation based on standard API patterns for Sarvam AI.

Step 1: Transcribing with Saaras

To use Saaras, you would typically make a POST request to their transcription endpoint.

import requests

url = "https://api.sarvam.ai/speech-to-text"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "multipart/form-data"
}

# Saaras handles standard audio formats (wav, mp3, flac)
files = {
    'file': ('audio.wav', open('audio.wav', 'rb'), 'audio/wav')
}
data = {
    'language_code': 'hi-IN', # Explicitly targeting Hindi/India
    'model': 'saaras-v1'
}

response = requests.post(url, headers=headers, files=files, data=data)
print(response.json()['transcript'])

Handling the Output: The output from Saaras is often in the script that matches the dominant language or a romanized version depending on configuration. For a voice agent, you usually want the text in the script your LLM understands best. For GPT-4, Devanagari script (Hindi) works well, but Romanized Hindi (Hinglish) is also very effective for casual conversation.

Step 2: The "Brain" (LLM)

For the reasoning layer, you need a model that understands the cultural context.

Option A: GPT-4o / Claude 3.5 Sonnet. Great general reasoning. You must prompt them to "reply in natural, conversational Hinglish."
Option B: Sarvam-1. A 2B parameter model trained specifically on Indic datasets. It is far cheaper and faster for this specific task than a massive general model.

Example Prompt for the LLM:

"You are a helpful assistant for Indian users. Reply in a mix of Hindi and English (Hinglish) that sounds natural to a young professional in Delhi. Keep responses concise for voice output."

Step 3: Speaking with Bulbul

Once you have the text response, send it to Bulbul.

url = "https://api.sarvam.ai/text-to-speech"
data = {
    'text': "Haan, main samajh gaya. Kal subah milte hain!",
    'speaker_id': 'meera-hindi', # Hypothetical speaker ID
    'model': 'bulbul-v1'
}

response = requests.post(url, headers=headers, json=data)

with open('output.mp3', 'wb') as f:
    f.write(response.content)

Optimizing for Latency

In a real-time voice agent, latency is the enemy. If the user says "Hello" and waits 3 seconds for a "Hi", the illusion breaks.

Use Streaming APIs: Don't wait for the user to finish a whole paragraph. Send audio chunks to Saaras.
Speculative Execution: If the user's intent is obvious ("Stop", "Cancel"), trigger the action immediately without a full LLM round-trip.
VAD (Voice Activity Detection): Use a local VAD engine (like Silero VAD) on the device to detect when the user stops speaking, rather than relying on the cloud API to decide end-of-speech.

Conclusion

Building for India requires Indian tools. While global giants like OpenAI and Google are catching up, Sarvam AI offers a specialized, "sovereign" stack that just works better for the messy, beautiful reality of Indian languages. By combining Saaras for accurate transcription and Bulbul for natural speech, you can create voice experiences that feel truly local, not just translated.