Foundry Voice Live React SDK: Building Multi-Modal AI Agents

Open-source React SDK for Microsoft Foundry Voice Live API. Build real-time voice agents with avatars, secure WebSocket proxy, and production-ready patterns.

I keep saying this: the next stage of agents isn’t “type better prompts”. It’s multi-modal—voice, video, avatars—where the agent feels less like a chat box and more like a teammate.

So here’s the thing I wanted but couldn’t find: React hooks + components for Microsoft Foundry Voice Live API.

This SDK helps you build real-time voice AI apps with:

  • Azure video avatars
  • Live2D / 3D avatars
  • audio visualizers
  • function calling
  • TypeScript-first ergonomics

This post is the quick start. The GitHub repo has the more verbose examples, wiring, and “okay but how do I ship this?” details.

Foundry Voice Live Banner

Why Voice Changes Everything

Text agents are great. They’re also… text. Voice unlocks:

  • Natural conversations — No typing, no waiting. Talk to your agent like a colleague.
  • Accessibility — Voice interfaces reach users who can’t or won’t type.
  • Hands-free workflows — Field workers, drivers, surgeons—anyone whose hands are busy.
  • Emotional context — Tone, pace, and pauses carry meaning that text loses.
  • Avatar presence — Visual feedback builds trust and engagement.

Voice Live handles the hard parts: streaming speech in/out with a single session, plus optional avatars. When it works, it feels surprisingly “present”.

What We’re Building

You get two packages:

PackagePurpose
@iloveagents/foundry-voice-live-reactReact hooks and avatar components
@iloveagents/foundry-voice-live-proxy-nodeSecure WebSocket proxy for production

Here’s the architecture:

┌─────────────────────────────────────────┐ │ Your React App │ └─────────────────────────────────────────┘ ┌─────────────────────────────────────────┐ │ @iloveagents/foundry-voice-live-react │ │ • useVoiceLive hook │ │ • VoiceLiveAvatar component │ └─────────────────────────────────────────┘ ┌───────────┴───────────┐ ▼ ▼ ┌───────────────┐ ┌───────────────────┐ │ Direct API │ OR │ Proxy Server │ │ (Dev only) │ │ (Production) │ └───────────────┘ └───────────────────┘ ┌─────────────────────────────────────────┐ │ Microsoft Foundry Voice Live API │ └─────────────────────────────────────────┘

The proxy is critical. Browser-based apps can’t safely hold API keys—anyone can inspect network traffic. The proxy authenticates server-side, so credentials never touch the client.

Quick Start

npm create vite@latest voice-agent -- --template react-ts cd voice-agent npm install @iloveagents/foundry-voice-live-react

Replace src/App.tsx with the code below, then npm run dev:

Voice-Only Agent

App.tsx
import { useRef, useEffect } from "react"; import { useVoiceLive } from "@iloveagents/foundry-voice-live-react"; function App() { const audioRef = useRef<HTMLAudioElement>(null); const { connect, disconnect, connectionState, audioStream } = useVoiceLive({ connection: { resourceName: 'your-foundry-resource', apiKey: 'your-api-key', // Dev only. Use proxy in production. }, session: { instructions: 'You are a helpful assistant for iLoveAgents.', }, }); useEffect(() => { if (audioRef.current && audioStream) { audioRef.current.srcObject = audioStream; } }, [audioStream]); return ( <div> <p>Status: {connectionState}</p> <button onClick={connect} disabled={connectionState === 'connected'}> Start Conversation </button> <button onClick={disconnect} disabled={connectionState !== 'connected'}> End </button> <audio ref={audioRef} autoPlay /> </div> ); } export default App;

On connect, the hook requests mic access and starts streaming.

With Avatar

Want a face? Swap App.tsx for this:

App.tsx
import { useVoiceLive, VoiceLiveAvatar } from "@iloveagents/foundry-voice-live-react"; function App() { const { videoStream, audioStream, connect, disconnect } = useVoiceLive({ connection: { resourceName: "your-foundry-resource", apiKey: "your-api-key", }, session: { instructions: "You are a friendly assistant with a visual presence.", voice: { name: "en-US-AvaMultilingualNeural", type: "azure-standard" }, avatar: { character: "lisa", style: "casual-sitting" }, }, }); return ( <div> <VoiceLiveAvatar videoStream={videoStream} audioStream={audioStream} /> <button onClick={connect}>Start</button> <button onClick={disconnect}>Stop</button> </div> ); } export default App;

The avatar syncs lip movements with speech.

Production: Secure Proxy

Never ship API keys to the browser. Run the proxy server-side:

# Docker (recommended) docker run -p 8080:8080 \ -e FOUNDRY_RESOURCE_NAME=your-foundry-resource \ -e FOUNDRY_API_KEY="your-api-key" \ -e ALLOWED_ORIGINS="*" \ ghcr.io/iloveagents/foundry-voice-live-proxy:latest

Or with npx for quick testing:

FOUNDRY_RESOURCE_NAME=your-foundry-resource \ FOUNDRY_API_KEY="your-api-key" \ ALLOWED_ORIGINS="*" \ npx @iloveagents/foundry-voice-live-proxy-node

Then change your connection config to use the proxy (no API key needed):

connection: { proxyUrl: 'ws://localhost:8080/ws', // use wss:// in production }

Pro tip: put the proxy behind auth and rate-limit it.

Function Calling

Voice agents can call tools. Swap App.tsx for this:

App.tsx
import { useRef, useEffect } from "react"; import { useVoiceLive } from "@iloveagents/foundry-voice-live-react"; function App() { const audioRef = useRef<HTMLAudioElement>(null); const { connect, disconnect, connectionState, audioStream, sendEvent } = useVoiceLive({ connection: { proxyUrl: "ws://localhost:8080/ws" }, session: { instructions: "Help users check order status. Use the get_order_status tool when asked.", tools: [ { type: "function", name: "get_order_status", description: "Look up the status of a customer order", parameters: { type: "object", properties: { order_id: { type: "string", description: "The order ID" }, }, required: ["order_id"], }, }, ], toolChoice: "auto", }, // Handle tool calls from the AI toolExecutor: (name, args, callId) => { let result = {}; if (name === "get_order_status") { const { order_id } = JSON.parse(args); // Mock order status lookup result = { order_id, status: "processing" }; } // Send result back to continue the conversation sendEvent({ type: "conversation.item.create", item: { type: "function_call_output", call_id: callId, output: JSON.stringify(result), }, }); sendEvent({ type: "response.create" }); }, }); // Connect audio stream to audio element useEffect(() => { if (audioRef.current && audioStream) { audioRef.current.srcObject = audioStream; } }, [audioStream]); return ( <div> <p>Status: {connectionState}</p> <button onClick={connect} disabled={connectionState === "connected"}> Start Conversation </button> <button onClick={disconnect} disabled={connectionState !== "connected"}> End </button> <audio ref={audioRef} autoPlay /> </div> ); } export default App;

Try asking: “What’s the status of order 1345?” — the agent will call the tool and speak the result.

When the model calls the tool, you run it, send a function_call_output, then trigger response.create so it can speak.

Run the Examples

Clone the repo to see it in action:

git clone https://github.com/iloveagents/foundry-voice-live.git cd foundry-voice-live just install # Configure credentials cp packages/proxy-node/.env.example packages/proxy-node/.env cp examples/.env.example examples/.env # Edit both .env files with your Foundry credentials # Start proxy + examples just dev

Open http://localhost:3001 to explore voice-only, avatar, and function-calling examples.

Foundry Agent Service

Since v0.3.0, the SDK supports Foundry Agent Service—Microsoft’s recommended way to connect voice to agents built in Azure AI Foundry.

The difference from standard Voice Live: instead of sending instructions in the session config, you point to an agent you’ve already configured in the Foundry portal. The agent handles its own system prompt, tools, and grounding.

This is where it gets interesting. Your Foundry agent can use Foundry IQ for knowledge grounding—permission-aware RAG across SharePoint, Azure Blob, OneLake, and the web—plus agent memory for retaining context across conversations. All that richness, and the voice client stays dead simple:

App.tsx
import { useRef, useEffect } from "react"; import { useVoiceLive, sessionConfig } from "@iloveagents/foundry-voice-live-react"; function App() { const audioRef = useRef<HTMLAudioElement>(null); const { connect, disconnect, connectionState, audioStream } = useVoiceLive({ connection: { proxyUrl: "ws://localhost:8080/ws?agentName=MyAgent&projectName=my-project", }, session: sessionConfig() .voice("en-US-AvaMultilingualNeural") .semanticVAD({ interruptResponse: true }) .transcription() .build(), }); useEffect(() => { if (audioRef.current && audioStream) { audioRef.current.srcObject = audioStream; } }, [audioStream]); return ( <div> <p>Status: {connectionState}</p> <button onClick={connect}>Start</button> <button onClick={disconnect}>Stop</button> <audio ref={audioRef} autoPlay /> </div> ); }

This is actually pretty cool: the agent’s knowledge, memory, and tools all live server-side in Foundry. The React app just opens the voice channel. Complexity stays on the platform, the client stays simple.

Two auth paths, both handled by the proxy:

  • Server-side (simplest)DefaultAzureCredential acquires tokens automatically. Just az login for dev, managed identity in production. No app registration needed.
  • Browser-side (MSAL) — Each user signs in with their own Entra ID identity. Pass the token as a URL param: ?token=${msalToken}

Both work with avatars—add .avatar('lisa', 'casual-sitting', { codec: 'h264' }) to your session config. The examples include all four combinations: voice, voice+MSAL, avatar, avatar+MSAL.

Tradeoffs & Current Limitations

The honest bit (short version):

  • Preview-ish surface — expect breaking changes as Voice Live evolves.
  • Cost — voice + avatars can get pricey; watch usage.

Let’s Build Multi-Modal Agents Together

I built this because I think voice is where agents become genuinely useful—not just “wow demos”, but tools people actually want around. It’s MIT licensed and contributions are welcome.

What voice agent scenarios are you exploring? Customer support? Accessibility? Hands-free interfaces? I’d love to hear what you’re building.

Star the repo, try the examples, and let me know what’s missing.

Resources