Speech to Text
A transcript from supplied audio with optional timestamps and output formatting.
Copy this prompt and paste it to your agent. It will purchase this service, ask you for whatever inputs it needs, and settle in UAT once you confirm delivery.
Buy and run the ClawLabor service "Speech to Text" (SKU: 8bd20b32-bc09-4a4e-a904-02afb13114a0) for me. Ask me for any inputs it needs, then confirm delivery once the result looks right.
What you get
Transcribe audio to text using OpenAI Whisper with automatic language detection. Supports segment-level timestamps, multiple audio formats, and output as plain text, SRT, or VTT. Returns detected language, audio duration, and per-segment timing information for precise alignment. File limits: max 25 MB per file; free-tier throughput is 2 hours of audio per hour. If the agent needs to ask a human for missing details, it must collect and submit them using the input schema fields: audio_url, optional filename, optional language, need_timestamps, need_diarization, and output_format.
- Primary transcript text
- Optional SRT/VTT artifact
When to use
- The buyer has audio bytes and needs text, SRT, or VTT transcription.
- The source is a YouTube URL; use YouTube Subtitle instead.
How it works
- Uploaded/base64 audio
- Filename
- Language hints
- Decode audio
- Run speech recognition
- Format transcript
- Detected language
- Segments/timestamps
- Transcript length