Setting Up Voice Message Transcription in OpenClaw
This guide configures OpenClaw to automatically transcribe incoming voice messages using an Azure OpenAI gpt-4o-transcribe deployment. After setup, users can send voice messages via Discord, Telegram, etc., and the agent receives plain text — fully transparent.
Prerequisites
- OpenClaw installed and running
- An Azure OpenAI resource with a
gpt-4o-transcribemodel deployment
Step 1: Create the Azure OpenAI Transcription Deployment
If you don’t have one yet:
- Go to Azure OpenAI Studio
- Select your Azure OpenAI resource
- Go to Deployments → Create new deployment
- Model:
gpt-4o-transcribe - Give it a deployment name (e.g.
gpt-4o-transcribe) - Note down:
- Resource name: the subdomain in your endpoint URL (e.g.
my-resourcefromhttps://my-resource.openai.azure.com) - Deployment name: what you named it (e.g.
gpt-4o-transcribe) - API key: found in Azure Portal → your resource → Keys and Endpoint
- Resource name: the subdomain in your endpoint URL (e.g.
Step 2: Test the Endpoint
Before configuring OpenClaw, verify the endpoint works:
curl -s "https://<your-resource>.openai.azure.com/openai/deployments/<your-deployment>/audio/transcriptions?api-version=2025-03-01-preview" \
-H "api-key: <your-api-key>" \
-F "file=@test-audio.mp3"
You should get a JSON response with the transcribed text. If you get an error, check your resource name, deployment name, and API key.
Step 3: Configure OpenClaw
Edit openclaw.json and add or update the tools.media.audio section:
{
"tools": {
"media": {
"audio": {
"enabled": true,
"models": [
{
"type": "cli",
"command": "curl",
"args": [
"-s",
"https://<your-resource>.openai.azure.com/openai/deployments/<your-deployment>/audio/transcriptions?api-version=2025-03-01-preview",
"-H",
"api-key: <your-api-key>",
"-F",
"file=@{{MediaPath}}"
]
}
]
}
}
}
}
Placeholders to replace
| Placeholder | Example | Where to find |
|---|---|---|
<your-resource> | my-aoai-eastus | Azure Portal → your OpenAI resource → Overview → Endpoint URL subdomain |
<your-deployment> | gpt-4o-transcribe | Azure OpenAI Studio → Deployments → deployment name |
<your-api-key> | abc123... | Azure Portal → your OpenAI resource → Keys and Endpoint → Key 1 or Key 2 |
Important: Do NOT change {{MediaPath}}
{{MediaPath}} is an OpenClaw template variable. At runtime, OpenClaw automatically replaces it with the actual path to the received audio file. Leave it exactly as {{MediaPath}}.
Step 4: Restart OpenClaw
openclaw gateway restart
Step 5: Verify
- Send a voice message to your OpenClaw bot (via Discord, Telegram, etc.)
- The agent should respond to the spoken content as text
- Check status — the media summary should show:
📎 Media: audio ok
If the agent doesn’t understand the voice message or responds with something unrelated, check:
- Is
curlavailable on the system? (which curl) - Are the Azure credentials correct? (re-run the test from Step 2)
- Is the
tools.media.audiosection properly nested inopenclaw.json? (validate JSON syntax)
How It Works
The transcription pipeline runs before the message reaches the agent:
User sends voice message
↓
OpenClaw gateway receives audio file
↓
Gateway runs the configured curl command with the audio file
↓
Azure OpenAI returns transcribed text (JSON)
↓
Gateway extracts text and delivers it to the agent as a normal message
↓
Agent sees plain text, responds normally
The agent never sees the audio file — it only receives the transcribed text. This is a gateway-level feature, not a skill.
Full openclaw.json Context
The tools.media.audio config sits inside the top-level tools object. Here’s where it fits in the overall structure:
{
"agents": { ... },
"channels": { ... },
"gateway": { ... },
"tools": {
"media": {
"audio": {
"enabled": true,
"models": [
{
"type": "cli",
"command": "curl",
"args": [
"-s",
"https://<your-resource>.openai.azure.com/openai/deployments/<your-deployment>/audio/transcriptions?api-version=2025-03-01-preview",
"-H",
"api-key: <your-api-key>",
"-F",
"file=@{{MediaPath}}"
]
}
]
}
},
"exec": { ... }
}
}