AI Call Assistant: Chapter 1 - Building The Listening Layer
This post is the one off a three-part series on building an AI call assistant, explaining how to create the "listening layer" that streams live phone audio through Twilio and converts it to real-time transcripts using pluggable speech-to-text services.
Series Overview: We're building an AI-powered call assistant that acts as your personal representative handling incoming calls, answering questions from your knowledge base, and managing interactions while you stay in control. This three-part series breaks down the challenge into core components:
Chapter 1: The Listening Layer (this post) : Accepting inbound calls and performing real-time Speech-to-Text transcription
Chapter 2: The Intelligence Layer 🚧 : Understanding context with LLMs, managing conversation state, and generating intelligent responses
Chapter 3: The Voice Laye 🚧 : Speaking back to callers with low-latency Text-to-Speech
Every missed call is a missed opportunity. Whether you're a consultant fielding client inquiries, a professional managing appointments, or someone who simply can't answer every ring you need a system that can listen, understand, and respond on your behalf.
But before our AI can think or speak, it must first hear.
This chapter tackles the foundational challenge: streaming live audio from phone calls into your system and converting speech to text in real-time. We'll build a clean architecture that's so you can swap STT services (ElevenLabs, Mistral) without rewriting your telephony layer.
Twilio handles the complex telephony networking. Your FastAPI server receives the audio chunks and forwards them to a Realtime STT Provider via another WebSocket.
We're focusing on three major concepts: The Handshake, The Factory, and The Stream.
When Twilio receives a phone call to your registered number, it sends a webhook to your FastAPI server asking, "What should I do?"
You respond with TwiML (Twilio Markup Language). Normally, you might tell it to just <Record> a voicemail. But we want low-latency, real-time audio access. For that, we use <Connect><Stream>:
This tiny XML block is incredibly powerful. Once Twilio processes it, it establishes a persistent WebSocket connection to your server and starts firing raw, 8kHz u-law encoded audio at you.
Now Twilio is firing audio chunks at our WebSocket endpoint (/twilio/stream). How do we get them to our STT Provider?
Twilio sends JSON events. The ones we care about are start (call begins), media (here is some audio), and stop (call ends). We process these events in a high-speed router:
asyncdefhandle_twilio_stream(ws: WebSocket)->None:await ws.accept()
stt =None# To be instantiated on "start"whileTrue:
msg_text =await ws.receive_text()
event = json.loads(msg_text)if event["event"]=="start":# 1. Instantiate the provider from the factory
stt = SttFactory.get_client()# 2. Setup the callback for when the STT provider returns text
stt.set_on_transcript(on_transcript)# 3. Start the STT receive loop in the background
asyncio.create_task(stt.run_receive_loop())elif event["event"]=="media":# 4. Forward the base64 audio payload to the STT providerawait stt.send_audio_base64(event["media"]["payload"], sample_rate=8000)elif event["event"]=="stop":break
Notice how clean the handle_twilio_stream loop is. Twilio sends audio; we forward it. The STT client handles all the ugly API-specific logic for ElevenLabs or Mistral.
When the STT provider gets an answer, it fires the on_transcript callback:
asyncdefon_transcript(kind:str, text:str)->None:# "kind" is either "partial" (still talking) or "committed" (paused thought)if kind =="partial":print(f"\rPARTIAL: {text}", end="", flush=True)else:print(f"\rCOMMITTED: {text}")
When you wire it all together, start up ngrok, and call your Twilio number, you will see a magical result stream directly into your terminal:
[CAxxxxxx] Twilio stream started PARTIAL: Hello PARTIAL: Hello can you COMMITTED: Hello can you hear me? PARTIAL: I'm calling COMMITTED: I'm calling about the appointment. [CAxxxxxx] Twilio stream stopped
Our AI assistant can finally hear the world.
But printing text to a terminal isn't an assistant it's just a dictation machine. To make it useful, we have to teach it how to think.
In Chapter 2: The Intelligence Laye, we will take these "committed" transcripts, inject them with our company's knowledge base, and pass them to an LLM (Gemini/Mistral) to generate intelligent, context-aware responses!
Coming soon. Star the GitHub repo to follow along!
Comments