Foundry Voice Live React SDK: Building Multi-Modal AI Agents

I keep saying this: the next stage of agents isn’t “type better prompts”. It’s multi-modal—voice, video, avatars—where the agent feels less like a chat box and more like a teammate.

So here’s the thing I wanted but couldn’t find: React hooks + components for Microsoft Foundry Voice Live API.

This SDK helps you build real-time voice AI apps with:

Azure video avatars
Live2D / 3D avatars
audio visualizers
function calling
TypeScript-first ergonomics

This post is the quick start. The GitHub repo has the more verbose examples, wiring, and “okay but how do I ship this?” details.

Why Voice Changes Everything

Text agents are great. They’re also… text. Voice unlocks:

Natural conversations — No typing, no waiting. Talk to your agent like a colleague.
Accessibility — Voice interfaces reach users who can’t or won’t type.
Hands-free workflows — Field workers, drivers, surgeons—anyone whose hands are busy.
Emotional context — Tone, pace, and pauses carry meaning that text loses.
Avatar presence — Visual feedback builds trust and engagement.

Voice Live handles the hard parts: streaming speech in/out with a single session, plus optional avatars. When it works, it feels surprisingly “present”.

What We’re Building

You get two packages:

Package	Purpose
`@iloveagents/foundry-voice-live-react`	React hooks and avatar components
`@iloveagents/foundry-voice-live-proxy-node`	Secure WebSocket proxy for production

Here’s the architecture:


┌─────────────────────────────────────────┐
│            Your React App               │
└─────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────┐
│   @iloveagents/foundry-voice-live-react │
│   • useVoiceLive hook                   │
│   • VoiceLiveAvatar component           │
└─────────────────────────────────────────┘
                    │
        ┌───────────┴───────────┐
        ▼                       ▼
┌───────────────┐      ┌───────────────────┐
│  Direct API   │  OR  │  Proxy Server     │
│  (Dev only)   │      │  (Production)     │
└───────────────┘      └───────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────┐
│     Microsoft Foundry Voice Live API    │
└─────────────────────────────────────────┘

The proxy is critical. Browser-based apps can’t safely hold API keys—anyone can inspect network traffic. The proxy authenticates server-side, so credentials never touch the client.

Quick Start


npm create vite@latest voice-agent -- --template react-ts
cd voice-agent
npm install @iloveagents/foundry-voice-live-react

Replace src/App.tsx with the code below, then npm run dev:

Voice-Only Agent

App.tsx


import { useRef, useEffect } from "react";
import { useVoiceLive } from "@iloveagents/foundry-voice-live-react";
 
function App() {
  const audioRef = useRef<HTMLAudioElement>(null);
  
  const { connect, disconnect, connectionState, audioStream } = useVoiceLive({
    connection: {
       resourceName: 'your-foundry-resource',
      apiKey: 'your-api-key',  // Dev only. Use proxy in production.
    },
    session: {
      instructions: 'You are a helpful assistant for iLoveAgents.',
    },
  });
 
  useEffect(() => {
    if (audioRef.current && audioStream) {
      audioRef.current.srcObject = audioStream;
    }
  }, [audioStream]);
 
  return (
    <div>
      <p>Status: {connectionState}</p>
      <button onClick={connect} disabled={connectionState === 'connected'}>
        Start Conversation
      </button>
      <button onClick={disconnect} disabled={connectionState !== 'connected'}>
        End
      </button>
      <audio ref={audioRef} autoPlay />
    </div>
  );
}
 
export default App;

On connect, the hook requests mic access and starts streaming.

With Avatar

Want a face? Swap App.tsx for this:

App.tsx


import { useVoiceLive, VoiceLiveAvatar } from "@iloveagents/foundry-voice-live-react";
 
function App() {
  const { videoStream, audioStream, connect, disconnect } = useVoiceLive({
    connection: {
      resourceName: "your-foundry-resource",
      apiKey: "your-api-key",
    },
    session: {
      instructions: "You are a friendly assistant with a visual presence.",
      voice: { name: "en-US-AvaMultilingualNeural", type: "azure-standard" },
      avatar: { character: "lisa", style: "casual-sitting" },
    },
  });
 
  return (
    <div>
      <VoiceLiveAvatar videoStream={videoStream} audioStream={audioStream} />
      <button onClick={connect}>Start</button>
      <button onClick={disconnect}>Stop</button>
    </div>
  );
}
 
export default App;

The avatar syncs lip movements with speech.

Production: Secure Proxy

Never ship API keys to the browser. Run the proxy server-side:


# Docker (recommended)
docker run -p 8080:8080 \
  -e FOUNDRY_RESOURCE_NAME=your-foundry-resource \
  -e FOUNDRY_API_KEY="your-api-key" \
  -e ALLOWED_ORIGINS="*" \
  ghcr.io/iloveagents/foundry-voice-live-proxy:latest

Or with npx for quick testing:


FOUNDRY_RESOURCE_NAME=your-foundry-resource \
FOUNDRY_API_KEY="your-api-key" \
ALLOWED_ORIGINS="*" \
npx @iloveagents/foundry-voice-live-proxy-node

Then change your connection config to use the proxy (no API key needed):


connection: {
  proxyUrl: 'ws://localhost:8080/ws',  // use wss:// in production
}

Pro tip: put the proxy behind auth and rate-limit it.

Function Calling

Voice agents can call tools. Swap App.tsx for this:

App.tsx


import { useRef, useEffect } from "react";
import { useVoiceLive } from "@iloveagents/foundry-voice-live-react";
 
function App() {
  const audioRef = useRef<HTMLAudioElement>(null);
 
  const { connect, disconnect, connectionState, audioStream, sendEvent } = useVoiceLive({
    connection: { 
      proxyUrl: "ws://localhost:8080/ws" 
    },
    session: {
      instructions: "Help users check order status. Use the get_order_status tool when asked.",
      tools: [
        {
          type: "function",
          name: "get_order_status",
          description: "Look up the status of a customer order",
          parameters: {
            type: "object",
            properties: {
              order_id: { type: "string", description: "The order ID" },
            },
            required: ["order_id"],
          },
        },
      ],
      toolChoice: "auto",
    },
    // Handle tool calls from the AI
    toolExecutor: (name, args, callId) => {
      let result = {};
 
      if (name === "get_order_status") {
        const { order_id } = JSON.parse(args);
        // Mock order status lookup
        result = { order_id, status: "processing" };
      }
 
      // Send result back to continue the conversation
      sendEvent({
        type: "conversation.item.create",
        item: {
          type: "function_call_output",
          call_id: callId,
          output: JSON.stringify(result),
        },
      });
      sendEvent({ type: "response.create" });
    },
  });
 
  // Connect audio stream to audio element
  useEffect(() => {
    if (audioRef.current && audioStream) {
      audioRef.current.srcObject = audioStream;
    }
  }, [audioStream]);
 
  return (
    <div>
      <p>Status: {connectionState}</p>
      <button onClick={connect} disabled={connectionState === "connected"}>
        Start Conversation
      </button>
      <button onClick={disconnect} disabled={connectionState !== "connected"}>
        End
      </button>
      <audio ref={audioRef} autoPlay />
    </div>
  );
}
 
export default App;

Try asking: “What’s the status of order 1345?” — the agent will call the tool and speak the result.

When the model calls the tool, you run it, send a function_call_output, then trigger response.create so it can speak.

Run the Examples

Clone the repo to see it in action:


git clone https://github.com/iloveagents/foundry-voice-live.git
cd foundry-voice-live
just install
 
# Configure credentials
cp packages/proxy-node/.env.example packages/proxy-node/.env
cp examples/.env.example examples/.env
# Edit both .env files with your Foundry credentials
 
# Start proxy + examples
just dev

Open http://localhost:3001 to explore voice-only, avatar, and function-calling examples.

Foundry Agent Service

Since v0.3.0, the SDK supports Foundry Agent Service—Microsoft’s recommended way to connect voice to agents built in Azure AI Foundry.

The difference from standard Voice Live: instead of sending instructions in the session config, you point to an agent you’ve already configured in the Foundry portal. The agent handles its own system prompt, tools, and grounding.

This is where it gets interesting. Your Foundry agent can use Foundry IQ for knowledge grounding—permission-aware RAG across SharePoint, Azure Blob, OneLake, and the web—plus agent memory for retaining context across conversations. All that richness, and the voice client stays dead simple:

App.tsx


import { useRef, useEffect } from "react";
import { useVoiceLive, sessionConfig } from "@iloveagents/foundry-voice-live-react";
 
function App() {
  const audioRef = useRef<HTMLAudioElement>(null);
 
  const { connect, disconnect, connectionState, audioStream } = useVoiceLive({
    connection: {
      proxyUrl: "ws://localhost:8080/ws?agentName=MyAgent&projectName=my-project",
    },
    session: sessionConfig()
      .voice("en-US-AvaMultilingualNeural")
      .semanticVAD({ interruptResponse: true })
      .transcription()
      .build(),
  });
 
  useEffect(() => {
    if (audioRef.current && audioStream) {
      audioRef.current.srcObject = audioStream;
    }
  }, [audioStream]);
 
  return (
    <div>
      <p>Status: {connectionState}</p>
      <button onClick={connect}>Start</button>
      <button onClick={disconnect}>Stop</button>
      <audio ref={audioRef} autoPlay />
    </div>
  );
}

This is actually pretty cool: the agent’s knowledge, memory, and tools all live server-side in Foundry. The React app just opens the voice channel. Complexity stays on the platform, the client stays simple.

Two auth paths, both handled by the proxy:

Server-side (simplest) — DefaultAzureCredential acquires tokens automatically. Just az login for dev, managed identity in production. No app registration needed.
Browser-side (MSAL) — Each user signs in with their own Entra ID identity. Pass the token as a URL param: ?token=${msalToken}

Both work with avatars—add .avatar('lisa', 'casual-sitting', { codec: 'h264' }) to your session config. The examples include all four combinations: voice, voice+MSAL, avatar, avatar+MSAL.

Tradeoffs & Current Limitations

The honest bit (short version):

Preview-ish surface — expect breaking changes as Voice Live evolves.
Cost — voice + avatars can get pricey; watch usage.

I built this because I think voice is where agents become genuinely useful—not just “wow demos”, but tools people actually want around. It’s MIT licensed and contributions are welcome.

What voice agent scenarios are you exploring? Customer support? Accessibility? Hands-free interfaces? I’d love to hear what you’re building.

Star the repo, try the examples, and let me know what’s missing.

Why Voice Changes Everything

What We’re Building

Quick Start

Voice-Only Agent

With Avatar

Production: Secure Proxy

Function Calling

Run the Examples

Foundry Agent Service

Tradeoffs & Current Limitations

Let’s Build Multi-Modal Agents Together

Resources