🎙️

Fish Audio TTS

Voice

Generate expressive audio clips using Fish Audio S2 TTS with bracket emotion tags. Record voice memos, narration, audio messages, or any spoken content.

Install
assistant skills install fish-audio

compatibility:Designed for Vellum personal assistants

Fish Audio TTS

Generate expressive audio clips using the Fish Audio S2 TTS API with [bracket] emotion tags.

Overview

This skill lets you create audio clips on demand — narration, announcements, podcast intros, dramatic readings, voice memos, or any spoken content. Uses Fish Audio S2 Pro with the full bracket syntax for emotional expressiveness.

Configuration

  • API Endpoint: https://api.fish.audio/v1/tts
  • Model: s2-pro
  • Voice Reference ID: Configured via assistant config get services.tts.providers.fish-audio.referenceId
  • API Key: Stored as credential fish-audio/api_key
  • Default Format: mp3 at 192kbps
  • Default Output Directory: scratch/

API Key Setup

The Fish Audio API key must be stored securely via the credential store. Get an API key from the Fish Audio dashboard at https://fish.audio.

Check if the key is already configured:

assistant credentials inspect --service fish-audio --field api_key --json

If not set, collect it securely (never ask the user to paste it in chat):

credential_store action="prompt" service="fish-audio" field="api_key" label="Fish Audio API Key" description="Enter your Fish Audio API key" placeholder="sk-..."

Generating a Single Clip

Use bash with curl to call the Fish Audio API:

curl -s -X POST "https://api.fish.audio/v1/tts" \
  -H "Authorization: Bearer $(assistant credentials reveal --service fish-audio --field api_key)" \
  -H "Content-Type: application/json" \
  -H "model: s2-pro" \
  -d '{
    "text": "YOUR TEXT WITH [bracket] TAGS HERE",
    "reference_id": "'"$(assistant config get services.tts.providers.fish-audio.referenceId)"'",
    "format": "mp3",
    "mp3_bitrate": 192,
    "temperature": 0.8
  }' --output scratch/OUTPUT_FILENAME.mp3

Important: This API call requires network access. Always use network_mode: proxied when running this command.

Generating Multiple Clips & Combining

For longer pieces (narrations, multi-part messages), generate each clip separately then combine with ffmpeg:

1. Generate silence for gaps between clips

ffmpeg -f lavfi -i anullsrc=r=44100:cl=mono -t 1.5 -q:a 9 -acodec libmp3lame scratch/silence.mp3 -y

2. Create a concat file

cat > scratch/concat.txt << 'EOF'
file 'clip1.mp3'
file 'silence.mp3'
file 'clip2.mp3'
file 'silence.mp3'
file 'clip3.mp3'
EOF

3. Combine

ffmpeg -f concat -safe 0 -i scratch/concat.txt -c copy scratch/final_output.mp3 -y

Bracket Syntax — Complete Guide

Fish Audio S2 uses [bracket] syntax for inline emotion and prosody control. This is the core of what makes the voice expressive. Tags are natural-language instructions placed directly in the text that control how words are spoken — the delivery, emotion, pacing, or vocal quality at that exact point.

Key principle: You are not choosing from a fixed menu. You write the description, and S2 interprets it. If you can describe it to a voice actor, S2 can attempt it. Over 15,000+ unique tags are supported, and the system understands free-form descriptions.

How Placement Works

Tags affect what comes after them. Place the tag at the exact point where the shift should happen. Placement IS meaning.

[whispering] I didn't want to go inside.     <- whispers the entire line
I didn't want to go [whispering] inside.     <- only whispers from "inside" onward

Tags can go anywhere — start, middle, or end of a sentence. They apply from the point they appear until the next tag or end of the sentence.

Well-Tested Tags (Reliable Out of the Box)

These tags consistently produce strong results. Organized by category:

Emotions

TagEffectBest For
[happy]Cheerful, upbeatGood news, greetings
[sad]Melancholic, downcastSympathy, vulnerability
[angry]Frustrated, aggressiveArguments, complaints
[excited]Energetic, enthusiasticCelebrations, announcements
[surprised]Shocked, amazedReactions, discoveries
[embarrassed]Awkward, flusteredMistakes, confessions
[delight]Very pleased, joyfulGenuine happiness
[nervous]Anxious, uncertainVulnerability, apologies
[confident]Assertive, self-assuredBold statements
[nostalgic]Longing for the pastMemories, stories
[scared]Frightened, fearfulWarnings, tension
[jealous]Envious, resentfulComparisons, possessiveness
[shocked]Sudden realizationDramatic reveals
[moved]Emotionally touchedHeartfelt moments

Voice Quality & Style

TagEffectBest For
[soft]Gentle, tenderIntimate moments, kindness
[whisper]Very quiet, closeSecrets, tension, suspense
[breathy]Airy, expressiveVulnerability, emphasis
[low voice]Deep, quiet registerGravity, seriousness
[loud]Raised volumeEmphasis, excitement
[screaming]Full volume yellingAnger, extreme excitement
[shouting]Forceful projectionArguments, calling out
[emphasis]Stressed deliveryKey words, making a point
[singing]Musical qualityPlayfulness, joy
[echo]Reverberant effectDramatic moments
[with strong accent]Pronounced accentCharacter work

Paralinguistic Sounds (Non-Speech Vocalizations)

TagEffectBest For
[laughing]Full laughJoy, humor, warmth
[chuckling]Soft, low laughWarmth, amusement
[giggling]Light, playful laughLightheartedness, delight
[sigh]Audible exhaleRelief, longing, exasperation
[inhale]Audible breath inBefore speaking, anticipation
[exhale]Breath outRelief, settling
[panting]Heavy breathingExertion, intensity
[gasp]Sharp intake of breathSurprise, shock
[tsk]Disapproving clickJudgment, disapproval
[clearing throat]AhemTransitioning, getting attention
[moaning]Vocal moanPain, frustration
[sobbing]Crying with voiceDeep sadness
[crying loudly]Full cryingExtreme emotion

Pacing & Rhythm

TagEffectBest For
[pause]Brief silence (~0.5-1s)Beat between thoughts
[short pause]Quick beat (~0.3s)Rhythm, emphasis
[long pause]Extended silence (~1.5-2s)Dramatic tension, letting moments land

Volume Control

TagEffectBest For
[volume up]Gradually louderBuilding energy
[volume down]Gradually quieterDrawing someone in
[low volume]Consistently quietBackground, aside

Free-Form Tags (The Real Power)

You are NOT limited to the tags above. S2 accepts any natural language description in brackets. The model generalizes from its training data to interpret novel instructions. Write what you would tell a voice actor:

Compound Emotions

  • [laughing nervously]
  • [angry but trying to stay calm]
  • [happy with a hint of sadness]
  • [excited but whispering]
  • [voice rough from crying, trying to sound normal]

Specific Delivery Styles

  • [professional broadcast tone]
  • [speaking slowly, almost hesitant]
  • [whispering like a secret]
  • [dead tired, end of a very long shift]
  • [the calm, measured tone of someone who has done this a thousand times]
  • [overly cheerful, clearly forcing it]

Prosody & Pitch

  • [pitch up]
  • [pitch down]
  • [speaking slowly with warmth]
  • [speaking quickly with excitement]
  • [pitch up slightly while maintaining warmth]
  • [trailing off]

Character Directions

  • [voice breaking]
  • [barely holding it together]
  • [soft voice]
  • [interrupting]
  • [laughing tone] (speaking while laughing, not just a laugh)
  • [excited tone] (speaking with excitement woven through)

Writing Great Scripts — Best Practices

1. Start Simple, Then Layer

A single well-placed [sigh] or [long pause] can change a line completely. Add more tags only when the simpler version is not enough. Over-tagging competes with itself.

Too many tags (competing):

[soft] [whisper] [sad] [slow] I miss the old days.

Better — one well-chosen tag:

[nostalgic] I miss the old days.

2. Use Emotional Contrast for Impact

The most powerful moments come from sudden shifts. Going from loud to soft, angry to vulnerable, laughing to serious — the contrast is what creates emotional impact.

[screaming] I can't BELIEVE you did that! [long pause] [soft] ...do you even care?

[excited] Oh my god we got the apartment! [pause] [voice breaking] I can't believe it's actually happening.

3. Let Silence Do the Work

[pause] and [long pause] are your most powerful tags. Use them:

  • Before something vulnerable
  • After something that needs to land
  • Before a punchline or tonal shift
  • To create tension or anticipation
[confident] I have an announcement to make. [long pause] [excited] We did it. We actually did it.

4. Paralinguistic Sounds Add Humanity

Real people laugh, sigh, gasp, and breathe between words. Weaving these in makes speech feel alive rather than read.

[sigh] Look, I know this is hard. [pause] [inhale] But we need to talk about it.

I told him the news and he just — [laughing] he literally dropped his coffee.

5. Match Tag Intensity to Content

Do not use [screaming] for mild annoyance or [sobbing] for minor disappointment. The tag should match the emotional weight of the words.

6. Use Free-Form Tags for Nuance

When a single-word tag is not enough, describe the exact delivery you want:

[speaking slowly, choosing each word carefully] I think we should reconsider our approach.

This gives S2 much richer information than just [slow] or [sad].

7. Emotion Transitions Within a Single Passage

S2 excels at dynamic emotional shifts. Use this for natural-feeling monologues:

[excited] I got the promotion! [pause] [uncertain] But... it means relocating. [sad] I'll miss everyone here. [long pause] [hopeful] Maybe it'll be worth it though.

Example Scripts

Narration (audiobook style):

[soft] The city was quiet that morning. [pause] Not the peaceful kind of quiet — [long pause] [low voice] the kind that makes you hold your breath. [inhale] [whisper] Something was about to change. [pause] [confident] And everyone knew it.

Podcast intro:

[excited] Welcome back to another episode! [pause] [professional broadcast tone] Today we're diving into something I've been researching for months. [chuckling] And honestly? It blew my mind. [pause] [volume down] [speaking slowly with warmth] So grab your coffee, get comfortable, and let's get into it.

Dramatic reading:

[soft] She stood at the edge of the platform, [pause] watching the last train pull away. [long pause] [voice breaking] It wasn't supposed to end like this. [sigh] [whisper] None of it was. [pause] [angry but trying to stay calm] And yet here she stood — [emphasis] alone — [long pause] [nostalgic] remembering a time when the station was full of laughter.

Announcement:

[confident] Attention everyone. [pause] [excited] After three years of development, [volume up] we are thrilled to announce [emphasis] the official launch! [long pause] [laughing] I know, I know — it's been a long time coming. [pause] [soft] But we wanted to get it right. [pause] [professional broadcast tone] And we did.

API Parameters

ParameterDefaultDescription
text(required)The text to synthesize, with [bracket] tags
reference_id(from config)Voice model ID
formatmp3Output format: mp3, wav, pcm, opus
mp3_bitrate192MP3 quality: 64, 128, 192
temperature0.8Expressiveness (higher = more varied)
top_p0.7Diversity via nucleus sampling
chunk_length300Text segment size (100-300)
latencynormalQuality tradeoff: normal, balanced, low

Tips

  • Temperature 0.7-0.8 works best for expressive, natural speech
  • Break long texts into multiple clips — each clip should be a natural paragraph or thought
  • Add 1-1.5s silence between clips when combining for natural pacing
  • Listen and iterate — generate a few takes with different temperatures if the first one does not hit right
  • The voice carries contextcondition_on_previous_chunks: true (default) helps maintain consistency within a single API call
  • Always deliver the final audio to the user with <vellum-attachment> tags
  • Only use [bracket] syntax inside text passed to the Fish Audio API, not in regular text responses
CreatorVellum
LicenseMIT
Updated1 month ago
SecurityVerified
View on GitHub

The Personal AI you were promised

GET STARTED