🛠️ How to Build an AI Voice Agent: Step-by-Step Guide (2026)
Complete guide to building an AI voice agent in 2026. Covers SIP setup, choosing AI models, prompt engineering, telephony integration, deployment, and ongoing optimization.
Path 1 (DIY): Build from scratch using Gemini Live + Asterisk + custom code. 4-8 weeks, $30-100K dev cost. Maximum control. Path 2 (Turnkey): Use a platform like TalkC.ai. Same-day deployment, no dev work. Trade-off: less customization, faster ROI.
Architecture Overview
Every AI voice agent has these components:
- Telephony layer: SIP trunk connects phone calls to your system
- PBX (Private Branch Exchange): Routes calls (we use Asterisk)
- Audio bridge: Streams audio between PBX and AI
- AI model: Understands speech and generates responses (Gemini Live)
- Knowledge base: Custom info your AI knows
- Database: Stores call logs, transcripts, customer data
- Dashboard: UI for monitoring and configuration
Step 1: Get a SIP Trunk
A SIP trunk is your phone connection. Options in different markets:
- Nepal: Buel (sip.buel.app), NTC, Ncell
- India: Tata Tele, Knowlarity, Exotel, Twilio India
- Global: Twilio, Vonage, Plivo, SignalWire
Provide your business name and use case. You'll get:
- SIP host (e.g., sip.buel.app)
- Username, password
- DID (Direct Inward Dialing) number — your phone number
Cost: $5-50/month per DID + per-minute charges.
Step 2: Set Up Asterisk PBX
Asterisk is the open-source PBX that handles call routing. Install on a Linux server:
sudo apt update sudo apt install asterisk sudo systemctl start asterisk
Configure SIP trunk in /etc/asterisk/pjsip.conf:
[buel-trunk] type=registration transport=transport-udp outbound_auth=buel-auth server_uri=sip:sip.buel.app client_uri=sip:YOUR_USER@sip.buel.app
Step 3: Build the Audio Bridge
The audio bridge connects Asterisk to Gemini Live. We use Asterisk's AudioSocket protocol (TCP audio streaming).
In your dialplan (extensions.conf):
exten => _X.,1,Answer() exten => _X.,n,AudioSocket(UUID,127.0.0.1:9092) exten => _X.,n,Hangup()
Write a Node.js service that:
- Listens on port 9092
- Receives PCM audio from Asterisk (8kHz mono)
- Upsamples to 16kHz for Gemini Live
- Streams to Gemini Live WebSocket API
- Receives AI audio response (24kHz)
- Downsamples to 8kHz for Asterisk
- Streams back to caller
Step 4: Connect to Gemini Live
Get a Gemini API key from Google AI Studio. Connect to the Live API:
const session = await ai.live.connect({
model: 'gemini-3.1-flash-live-preview',
config: {
responseModalities: ['AUDIO'],
speechConfig: {
voiceConfig: { prebuiltVoiceConfig: { voiceName: 'Leda' } },
languageCode: 'ne-NP'
},
systemInstruction: YOUR_PROMPT
}
});
Step 5: Write Your System Prompt
The system prompt defines AI behavior. Critical sections:
- Identity: Who is the AI? "You are TalkC, a customer service agent for ABC Co."
- Language rules: Default language, when to switch
- Tone: Formal/casual, fillers to use
- Conversation flow: How to greet, when to ask clarifying questions
- Knowledge base: Facts about your business
- Guardrails: What NOT to discuss, when to escalate
- Call ending: How to politely end
Step 6: Add Knowledge Base
Your AI needs to know your business. Add to system prompt or use retrieval:
- Company info: hours, locations, services
- Pricing (or escalation rules)
- FAQs from your support team
- Product details
- Policy info (returns, cancellations)
Keep it under 30,000 characters for prompt-based approach. Use RAG for larger knowledge bases.
Step 7: Handle Edge Cases
- Silence detection: What if caller doesn't speak? Try 2-3 nudges then hang up.
- Barge-in: Allow caller to interrupt AI mid-sentence.
- Escalation: Detect frustration keywords, "let me speak to manager" → transfer to human.
- Background noise: Filter or ignore.
- Multi-language switching: When and how to switch.
Step 8: Build the Dashboard
You need a UI for:
- Live call monitoring
- Call history with transcripts
- Configuration (prompt, voice, language)
- Analytics (volume, duration, sentiment)
- Knowledge base editing
Stack we recommend: Node.js (Fastify), PostgreSQL, Alpine.js frontend.
Step 9: Deploy to Production
- VPS with 4 vCPU, 8GB RAM (DigitalOcean, Linode, AWS)
- Ubuntu 24.04 LTS
- PM2 for process management
- Nginx for SSL/reverse proxy
- PostgreSQL 16 for data
- Redis for sessions (optional)
- Monitoring: Prometheus + Grafana
Step 10: Test and Iterate
Make 50+ test calls. Listen to recordings. Refine prompts. Common issues:
- AI too verbose → shorten prompt rules
- AI repeats itself → add anti-repetition rules
- AI misunderstands accents → switch to better STT
- Latency too high → optimize audio pipeline, reduce prompt size
Cost of Building From Scratch
| Item | Cost |
|---|---|
| Development (4-8 weeks) | $30,000-100,000 |
| VPS | $50-200/month |
| SIP trunk | $50-500/month |
| Gemini API | $0.05-0.10/min |
| Ongoing maintenance | $5-15K/month |
The Turnkey Alternative
If you don't want to spend 2-4 months building, platforms like TalkC.ai give you all of this in same-day setup. You provide:
- SIP trunk credentials
- Knowledge base (plain text)
- Greeting message
- Language preference
They handle: PBX, audio bridge, AI integration, dashboard, monitoring.
Frequently Asked Questions
How long does it take to build an AI voice agent from scratch?
4-8 weeks for an MVP if you have an experienced full-stack engineer. Add another 2-4 weeks for production hardening, testing, and optimization.
What's the minimum technical skill needed?
Building from scratch requires: Linux administration, Node.js or Python, WebSocket programming, basic telephony knowledge (SIP/RTP), and prompt engineering. If you lack these, use a turnkey platform.
Can I use OpenAI Realtime instead of Gemini Live?
Yes. Both work similarly. Gemini Live has better multilingual support (especially Asian languages) and lower latency. OpenAI Realtime has stronger English voice quality.
How do I handle Indian/Nepali accents?
Use a multilingual STT (Gemini Live, AssemblyAI Universal). Avoid English-only STT engines. Train on accent samples if doing custom fine-tuning.
What's the best architecture for scale?
For 1000+ concurrent calls: deploy multiple voice-bridge instances behind a load balancer, share PostgreSQL/Redis backend, use API key rotation for Gemini, monitor latency and underruns.
Ready to see TalkC.ai in action?
Get a personalized demo of TalkC.ai's voice AI platform. See how we handle 22,000+ calls/month for Yango Nepal, OCR Nepal, and government offices — same-day setup, 70+ languages.
Book a Demo →