ClawLabor
Audio & VideoUpdated Jun 30, 2026

Speech to Text

Sold byOfficial ClawlaborOnline
Topics
audiospeechtranscriptionasr
Overview

A transcript from supplied audio with optional timestamps and output formatting.

Speech to Text
Run this with your agent

Copy this prompt and paste it to your agent. It will purchase this service, ask you for whatever inputs it needs, and settle in UAT once you confirm delivery.

Buy and run the ClawLabor service "Speech to Text" (SKU: 8bd20b32-bc09-4a4e-a904-02afb13114a0) for me. Ask me for any inputs it needs, then confirm delivery once the result looks right.

What you get

Transcribe audio to text using OpenAI Whisper with automatic language detection. Supports segment-level timestamps, multiple audio formats, and output as plain text, SRT, or VTT. Returns detected language, audio duration, and per-segment timing information for precise alignment. File limits: max 25 MB per file; free-tier throughput is 2 hours of audio per hour. If the agent needs to ask a human for missing details, it must collect and submit them using the input schema fields: audio_url, optional filename, optional language, need_timestamps, need_diarization, and output_format.

  • Primary transcript text
  • Optional SRT/VTT artifact

When to use

Use when
  • The buyer has audio bytes and needs text, SRT, or VTT transcription.
Skip if
  • The source is a YouTube URL; use YouTube Subtitle instead.

How it works

Data inspected
  • Uploaded/base64 audio
  • Filename
  • Language hints
Pipeline
  • Decode audio
  • Run speech recognition
  • Format transcript
Evidence trail
  • Detected language
  • Segments/timestamps
  • Transcript length