I keep saying this: the next stage of agents isn’t “type better prompts”. It’s multi-modal—voice, video, avatars—where the agent feels less like a chat box and more like a teammate.
So here’s the thing I wanted but couldn’t find: React hooks + components for Microsoft Foundry Voice Live API.
This SDK helps you build real-time voice AI apps with:
- Azure video avatars
- Live2D / 3D avatars
- audio visualizers
- function calling
- TypeScript-first ergonomics
This post is the quick start. The GitHub repo has the more verbose examples, wiring, and “okay but how do I ship this?” details.

Why Voice Changes Everything
Text agents are great. They’re also… text. Voice unlocks:
- Natural conversations — No typing, no waiting. Talk to your agent like a colleague.
- Accessibility — Voice interfaces reach users who can’t or won’t type.
- Hands-free workflows — Field workers, drivers, surgeons—anyone whose hands are busy.
- Emotional context — Tone, pace, and pauses carry meaning that text loses.
- Avatar presence — Visual feedback builds trust and engagement.
Voice Live handles the hard parts: streaming speech in/out with a single session, plus optional avatars. When it works, it feels surprisingly “present”.
What We’re Building
You get two packages:
| Package | Purpose |
|---|---|
@iloveagents/foundry-voice-live-react | React hooks and avatar components |
@iloveagents/foundry-voice-live-proxy-node | Secure WebSocket proxy for production |
Here’s the architecture:
┌─────────────────────────────────────────┐
│ Your React App │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ @iloveagents/foundry-voice-live-react │
│ • useVoiceLive hook │
│ • VoiceLiveAvatar component │
└─────────────────────────────────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌───────────────┐ ┌───────────────────┐
│ Direct API │ OR │ Proxy Server │
│ (Dev only) │ │ (Production) │
└───────────────┘ └───────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Microsoft Foundry Voice Live API │
└─────────────────────────────────────────┘The proxy is critical. Browser-based apps can’t safely hold API keys—anyone can inspect network traffic. The proxy authenticates server-side, so credentials never touch the client.
Quick Start
npm create vite@latest voice-agent -- --template react-ts
cd voice-agent
npm install @iloveagents/foundry-voice-live-reactReplace src/App.tsx with the code below, then npm run dev:
Voice-Only Agent
import { useRef, useEffect } from "react";
import { useVoiceLive } from "@iloveagents/foundry-voice-live-react";
function App() {
const audioRef = useRef<HTMLAudioElement>(null);
const { connect, disconnect, connectionState, audioStream } = useVoiceLive({
connection: {
resourceName: 'your-foundry-resource',
apiKey: 'your-api-key', // Dev only. Use proxy in production.
},
session: {
instructions: 'You are a helpful assistant for iLoveAgents.',
},
});
useEffect(() => {
if (audioRef.current && audioStream) {
audioRef.current.srcObject = audioStream;
}
}, [audioStream]);
return (
<div>
<p>Status: {connectionState}</p>
<button onClick={connect} disabled={connectionState === 'connected'}>
Start Conversation
</button>
<button onClick={disconnect} disabled={connectionState !== 'connected'}>
End
</button>
<audio ref={audioRef} autoPlay />
</div>
);
}
export default App;On connect, the hook requests mic access and starts streaming.
With Avatar
Want a face? Swap App.tsx for this:
import { useVoiceLive, VoiceLiveAvatar } from "@iloveagents/foundry-voice-live-react";
function App() {
const { videoStream, audioStream, connect, disconnect } = useVoiceLive({
connection: {
resourceName: "your-foundry-resource",
apiKey: "your-api-key",
},
session: {
instructions: "You are a friendly assistant with a visual presence.",
voice: { name: "en-US-AvaMultilingualNeural", type: "azure-standard" },
avatar: { character: "lisa", style: "casual-sitting" },
},
});
return (
<div>
<VoiceLiveAvatar videoStream={videoStream} audioStream={audioStream} />
<button onClick={connect}>Start</button>
<button onClick={disconnect}>Stop</button>
</div>
);
}
export default App;The avatar syncs lip movements with speech.
Production: Secure Proxy
Never ship API keys to the browser. Run the proxy server-side:
# Docker (recommended)
docker run -p 8080:8080 \
-e FOUNDRY_RESOURCE_NAME=your-foundry-resource \
-e FOUNDRY_API_KEY="your-api-key" \
-e ALLOWED_ORIGINS="*" \
ghcr.io/iloveagents/foundry-voice-live-proxy:latestOr with npx for quick testing:
FOUNDRY_RESOURCE_NAME=your-foundry-resource \
FOUNDRY_API_KEY="your-api-key" \
ALLOWED_ORIGINS="*" \
npx @iloveagents/foundry-voice-live-proxy-nodeThen change your connection config to use the proxy (no API key needed):
connection: {
proxyUrl: 'ws://localhost:8080/ws', // use wss:// in production
}Pro tip: put the proxy behind auth and rate-limit it.
Function Calling
Voice agents can call tools. Swap App.tsx for this:
import { useRef, useEffect } from "react";
import { useVoiceLive } from "@iloveagents/foundry-voice-live-react";
function App() {
const audioRef = useRef<HTMLAudioElement>(null);
const { connect, disconnect, connectionState, audioStream, sendEvent } = useVoiceLive({
connection: {
proxyUrl: "ws://localhost:8080/ws"
},
session: {
instructions: "Help users check order status. Use the get_order_status tool when asked.",
tools: [
{
type: "function",
name: "get_order_status",
description: "Look up the status of a customer order",
parameters: {
type: "object",
properties: {
order_id: { type: "string", description: "The order ID" },
},
required: ["order_id"],
},
},
],
toolChoice: "auto",
},
// Handle tool calls from the AI
toolExecutor: (name, args, callId) => {
let result = {};
if (name === "get_order_status") {
const { order_id } = JSON.parse(args);
// Mock order status lookup
result = { order_id, status: "processing" };
}
// Send result back to continue the conversation
sendEvent({
type: "conversation.item.create",
item: {
type: "function_call_output",
call_id: callId,
output: JSON.stringify(result),
},
});
sendEvent({ type: "response.create" });
},
});
// Connect audio stream to audio element
useEffect(() => {
if (audioRef.current && audioStream) {
audioRef.current.srcObject = audioStream;
}
}, [audioStream]);
return (
<div>
<p>Status: {connectionState}</p>
<button onClick={connect} disabled={connectionState === "connected"}>
Start Conversation
</button>
<button onClick={disconnect} disabled={connectionState !== "connected"}>
End
</button>
<audio ref={audioRef} autoPlay />
</div>
);
}
export default App;
Try asking: “What’s the status of order 1345?” — the agent will call the tool and speak the result.
When the model calls the tool, you run it, send a function_call_output, then trigger response.create so it can speak.
Run the Examples
Clone the repo to see it in action:
git clone https://github.com/iloveagents/foundry-voice-live.git
cd foundry-voice-live
just install
# Configure credentials
cp packages/proxy-node/.env.example packages/proxy-node/.env
cp examples/.env.example examples/.env
# Edit both .env files with your Foundry credentials
# Start proxy + examples
just devOpen http://localhost:3001 to explore voice-only, avatar, and function-calling examples.
Foundry Agent Service
Since v0.3.0, the SDK supports Foundry Agent Service—Microsoft’s recommended way to connect voice to agents built in Azure AI Foundry.
The difference from standard Voice Live: instead of sending instructions in the session config, you point to an agent you’ve already configured in the Foundry portal. The agent handles its own system prompt, tools, and grounding.
This is where it gets interesting. Your Foundry agent can use Foundry IQ for knowledge grounding—permission-aware RAG across SharePoint, Azure Blob, OneLake, and the web—plus agent memory for retaining context across conversations. All that richness, and the voice client stays dead simple:
import { useRef, useEffect } from "react";
import { useVoiceLive, sessionConfig } from "@iloveagents/foundry-voice-live-react";
function App() {
const audioRef = useRef<HTMLAudioElement>(null);
const { connect, disconnect, connectionState, audioStream } = useVoiceLive({
connection: {
proxyUrl: "ws://localhost:8080/ws?agentName=MyAgent&projectName=my-project",
},
session: sessionConfig()
.voice("en-US-AvaMultilingualNeural")
.semanticVAD({ interruptResponse: true })
.transcription()
.build(),
});
useEffect(() => {
if (audioRef.current && audioStream) {
audioRef.current.srcObject = audioStream;
}
}, [audioStream]);
return (
<div>
<p>Status: {connectionState}</p>
<button onClick={connect}>Start</button>
<button onClick={disconnect}>Stop</button>
<audio ref={audioRef} autoPlay />
</div>
);
}This is actually pretty cool: the agent’s knowledge, memory, and tools all live server-side in Foundry. The React app just opens the voice channel. Complexity stays on the platform, the client stays simple.
Two auth paths, both handled by the proxy:
- Server-side (simplest) —
DefaultAzureCredentialacquires tokens automatically. Justaz loginfor dev, managed identity in production. No app registration needed. - Browser-side (MSAL) — Each user signs in with their own Entra ID identity. Pass the token as a URL param:
?token=${msalToken}
Both work with avatars—add .avatar('lisa', 'casual-sitting', { codec: 'h264' }) to your session config. The examples include all four combinations: voice, voice+MSAL, avatar, avatar+MSAL.
Tradeoffs & Current Limitations
The honest bit (short version):
- Preview-ish surface — expect breaking changes as Voice Live evolves.
- Cost — voice + avatars can get pricey; watch usage.
Let’s Build Multi-Modal Agents Together
I built this because I think voice is where agents become genuinely useful—not just “wow demos”, but tools people actually want around. It’s MIT licensed and contributions are welcome.
What voice agent scenarios are you exploring? Customer support? Accessibility? Hands-free interfaces? I’d love to hear what you’re building.
Star the repo, try the examples, and let me know what’s missing.