What audio formats do you accept?

mp3, wav, m4a, mp4, mov, mpeg, amr, wma, mts, aac, ogg, flac.

How long can my recording be?

Up to 3 hours per file. Longer recordings should be split — happy to help script that.

Do you do speaker diarization?

Yes. Diarized output is the default for multi-speaker recordings.

Can I get verbatim transcripts (with um, uh, etc.)?

Yes, opt-in at upload. Default is light cleanup.

What if the audio quality is poor?

Our ASR handles phone-quality audio well. Very noisy recordings get a confidence score in the QA report so you know which segments to double-check.

Can Fily produce subtitles?

Yes — see /translate/srt for the subtitle-specific workflow.

Can I get the transcript in my own branded template?

Yes. Upload your DOCX template (logo, header, custom fields, signature blocks, footer) once — every future transcript for that client comes back inside it. Speaker labels, timestamps, and translations are placed where the template expects them. Sonix, Rev, and Otter don't support this.

Audio · Transcription + Translation

Transcribe and translate audio. From recorded calls to interview tapes.

Drop an .mp3, .wav, .m4a, or .mp4. Fily transcribes the audio, translates the transcript, and gives you back a clean bilingual document — speaker labels, timestamps, and verbatim quotes preserved.

See how it works

0Audio jobs0words processed0hours of audio

What is Audio & Video?

Audio translation is a two-step pipeline: speech recognition (transcription) followed by text translation. Sources include recorded customer calls, focus groups, depositions, interviews, podcasts, training recordings, and meeting captures. The challenge is doing both steps well — bad transcription poisons the translation, naive translation loses the speaker's voice.

Bring your own template

Transcripts that land in your client's own document.

Sonix, Rev, Otter — they all hand you their default DOCX layout. Then your PM spends an hour copying it into the client's branded template: their logo at top, their case-number field, their signature blocks, their disclaimer footer.

Upload your template once. Every transcript for that client comes back inside it, with speaker labels, timestamps, and translations placed exactly where the template expects them. Logos, headers, footers, custom fields — preserved. No re-formatting. You re-brand zero deliverables.

Why Audio & Video is tricky for AI translation

Speaker identification: multi-party recordings need diarization to attribute lines to speakers. Without it the transcript is a wall of text.

Acoustic conditions vary wildly: clean studio audio is easy; phone-quality recordings with background noise are hard.

Code-switching: many real recordings have speakers switching between languages mid-sentence. Models must handle this without dropping segments.

Domain vocabulary: medical, legal, technical recordings have terms standard ASR mishandles.

Verbatim vs clean: legal needs verbatim ('um', 'uh', false starts kept); marketing summaries need cleaned-up transcripts.

Timestamp preservation: depositions, podcasts, and subtitle workflows need timecodes synced to the audio.

How Fily handles Audio & Video

State-of-the-art ASR + speaker diarization: production-grade transcription with speaker labels.

Per-speaker translation: each speaker's lines translated as a unit with surrounding context.

Verbatim mode: keep filler words, false starts, and pauses for legal use cases. Opt-in at upload.

Domain prompting: medical, legal, technical glossaries applied during ASR + translation.

Timestamp preservation: word-level timestamps available for subtitle workflows (see /translate/srt).

Bring your own template: upload your client-branded DOCX template (logo, header, footer, signature blocks, custom fields) and the transcript lands inside it. No copy-paste, no re-formatting per delivery.

Output formats: bilingual DOCX (default or your template), TXT, JSON with timestamps, or SRT subtitles.

Pipeline: audio_qa@1.0.0

The Audio & Video workflow with Fily

Upload

Drop your .mp3 / .wav / .m4a / .mp4 (single or batch ZIP). Optional: glossary, TM, style guide.

Process

Fily runs the Audio & Video pipeline + 12 QA steps. Typical job: 10–20 minutes.

Download

Same format, ready to deliver. QA report HTML attached.

Common upload: a 45-minute focus group recording with 6 speakers in Spanish, needed in English for the client. Fily delivers a bilingual DOCX with speaker labels, time markers every few minutes, original Spanish on the left, English on the right.

Beyond the standard pipeline

What we've built around Audio & Video

Edge cases clients brought us for this format — and the pipelines we shipped to solve them.

bring your own template

Transcripts in your client's branded document

Sonix, Rev and Otter hand back their default DOCX. Then a PM spends an hour re-formatting into the client's template: logo, case number, signature block, footer.

Upload the client template once. Every future transcript for that client lands inside it — speaker labels, timestamps and translations placed exactly where the template expects them. Zero re-branding.

csaa multilingual

Code-switching audio, transcribed per language

Recorded interviews where the speaker switches between English and Spanish mid-sentence broke single-language transcription — and naive chunking split words and duplicated lines.

A multilingual pipeline segments by language and speaker, transcribes each stretch in its own language, and keeps chronological order without over-segmenting — then translates to a single target with speaker labels intact.