🚀 Transcribe audio files into text with ElevenLabs Speech-to-Text. Supports 90+ languages with automatic detection, speaker identification, and audio event tagging (laughter, music, applause). Get precise word-level timestamps and process all major audio formats—mp3, wav, m4a, ogg, webm, and more.

💡 Perfect for meetings, podcasts, voice notes, and interviews. Identify who's speaking, extract exact timing for editing, and process multilingual content effortlessly. Ideal for journalists, researchers, content creators, and teams needing accurate transcriptions fast.

✨ Built-in speaker diarization distinguishes multiple voices automatically. JSON output with timestamps enables seamless integration into workflows, while automatic language detection removes setup friction.

ElevenLabs Speech-to-Text

Requirements