Generate expressive audio clips using Fish Audio S2 TTS with bracket emotion tags. Record voice memos, narration, audio messages, or any spoken content.
assistant skills install fish-audioGenerate expressive audio clips using the Fish Audio S2 TTS API with [bracket] emotion tags.
This skill lets you create audio clips on demand — narration, announcements, podcast intros, dramatic readings, voice memos, or any spoken content. Uses Fish Audio S2 Pro with the full bracket syntax for emotional expressiveness.
https://api.fish.audio/v1/ttss2-proassistant config get services.tts.providers.fish-audio.referenceIdfish-audio/api_keymp3 at 192kbpsscratch/The Fish Audio API key must be stored securely via the credential store. Get an API key from the Fish Audio dashboard at https://fish.audio.
Check if the key is already configured:
assistant credentials inspect --service fish-audio --field api_key --json
If not set, collect it securely (never ask the user to paste it in chat):
credential_store action="prompt" service="fish-audio" field="api_key" label="Fish Audio API Key" description="Enter your Fish Audio API key" placeholder="sk-..."
Use bash with curl to call the Fish Audio API:
curl -s -X POST "https://api.fish.audio/v1/tts" \
-H "Authorization: Bearer $(assistant credentials reveal --service fish-audio --field api_key)" \
-H "Content-Type: application/json" \
-H "model: s2-pro" \
-d '{
"text": "YOUR TEXT WITH [bracket] TAGS HERE",
"reference_id": "'"$(assistant config get services.tts.providers.fish-audio.referenceId)"'",
"format": "mp3",
"mp3_bitrate": 192,
"temperature": 0.8
}' --output scratch/OUTPUT_FILENAME.mp3
Important: This API call requires network access. Always use network_mode: proxied when running this command.
For longer pieces (narrations, multi-part messages), generate each clip separately then combine with ffmpeg:
ffmpeg -f lavfi -i anullsrc=r=44100:cl=mono -t 1.5 -q:a 9 -acodec libmp3lame scratch/silence.mp3 -y
cat > scratch/concat.txt << 'EOF'
file 'clip1.mp3'
file 'silence.mp3'
file 'clip2.mp3'
file 'silence.mp3'
file 'clip3.mp3'
EOF
ffmpeg -f concat -safe 0 -i scratch/concat.txt -c copy scratch/final_output.mp3 -y
Fish Audio S2 uses [bracket] syntax for inline emotion and prosody control. This is the core of what makes the voice expressive. Tags are natural-language instructions placed directly in the text that control how words are spoken — the delivery, emotion, pacing, or vocal quality at that exact point.
Key principle: You are not choosing from a fixed menu. You write the description, and S2 interprets it. If you can describe it to a voice actor, S2 can attempt it. Over 15,000+ unique tags are supported, and the system understands free-form descriptions.
Tags affect what comes after them. Place the tag at the exact point where the shift should happen. Placement IS meaning.
[whispering] I didn't want to go inside. <- whispers the entire line
I didn't want to go [whispering] inside. <- only whispers from "inside" onward
Tags can go anywhere — start, middle, or end of a sentence. They apply from the point they appear until the next tag or end of the sentence.
These tags consistently produce strong results. Organized by category:
| Tag | Effect | Best For |
|---|---|---|
[happy] | Cheerful, upbeat | Good news, greetings |
[sad] | Melancholic, downcast | Sympathy, vulnerability |
[angry] | Frustrated, aggressive | Arguments, complaints |
[excited] | Energetic, enthusiastic | Celebrations, announcements |
[surprised] | Shocked, amazed | Reactions, discoveries |
[embarrassed] | Awkward, flustered | Mistakes, confessions |
[delight] | Very pleased, joyful | Genuine happiness |
[nervous] | Anxious, uncertain | Vulnerability, apologies |
[confident] | Assertive, self-assured | Bold statements |
[nostalgic] | Longing for the past | Memories, stories |
[scared] | Frightened, fearful | Warnings, tension |
[jealous] | Envious, resentful | Comparisons, possessiveness |
[shocked] | Sudden realization | Dramatic reveals |
[moved] | Emotionally touched | Heartfelt moments |
| Tag | Effect | Best For |
|---|---|---|
[soft] | Gentle, tender | Intimate moments, kindness |
[whisper] | Very quiet, close | Secrets, tension, suspense |
[breathy] | Airy, expressive | Vulnerability, emphasis |
[low voice] | Deep, quiet register | Gravity, seriousness |
[loud] | Raised volume | Emphasis, excitement |
[screaming] | Full volume yelling | Anger, extreme excitement |
[shouting] | Forceful projection | Arguments, calling out |
[emphasis] | Stressed delivery | Key words, making a point |
[singing] | Musical quality | Playfulness, joy |
[echo] | Reverberant effect | Dramatic moments |
[with strong accent] | Pronounced accent | Character work |
| Tag | Effect | Best For |
|---|---|---|
[laughing] | Full laugh | Joy, humor, warmth |
[chuckling] | Soft, low laugh | Warmth, amusement |
[giggling] | Light, playful laugh | Lightheartedness, delight |
[sigh] | Audible exhale | Relief, longing, exasperation |
[inhale] | Audible breath in | Before speaking, anticipation |
[exhale] | Breath out | Relief, settling |
[panting] | Heavy breathing | Exertion, intensity |
[gasp] | Sharp intake of breath | Surprise, shock |
[tsk] | Disapproving click | Judgment, disapproval |
[clearing throat] | Ahem | Transitioning, getting attention |
[moaning] | Vocal moan | Pain, frustration |
[sobbing] | Crying with voice | Deep sadness |
[crying loudly] | Full crying | Extreme emotion |
| Tag | Effect | Best For |
|---|---|---|
[pause] | Brief silence (~0.5-1s) | Beat between thoughts |
[short pause] | Quick beat (~0.3s) | Rhythm, emphasis |
[long pause] | Extended silence (~1.5-2s) | Dramatic tension, letting moments land |
| Tag | Effect | Best For |
|---|---|---|
[volume up] | Gradually louder | Building energy |
[volume down] | Gradually quieter | Drawing someone in |
[low volume] | Consistently quiet | Background, aside |
You are NOT limited to the tags above. S2 accepts any natural language description in brackets. The model generalizes from its training data to interpret novel instructions. Write what you would tell a voice actor:
[laughing nervously][angry but trying to stay calm][happy with a hint of sadness][excited but whispering][voice rough from crying, trying to sound normal][professional broadcast tone][speaking slowly, almost hesitant][whispering like a secret][dead tired, end of a very long shift][the calm, measured tone of someone who has done this a thousand times][overly cheerful, clearly forcing it][pitch up][pitch down][speaking slowly with warmth][speaking quickly with excitement][pitch up slightly while maintaining warmth][trailing off][voice breaking][barely holding it together][soft voice][interrupting][laughing tone] (speaking while laughing, not just a laugh)[excited tone] (speaking with excitement woven through)A single well-placed [sigh] or [long pause] can change a line completely. Add more tags only when the simpler version is not enough. Over-tagging competes with itself.
Too many tags (competing):
[soft] [whisper] [sad] [slow] I miss the old days.
Better — one well-chosen tag:
[nostalgic] I miss the old days.
The most powerful moments come from sudden shifts. Going from loud to soft, angry to vulnerable, laughing to serious — the contrast is what creates emotional impact.
[screaming] I can't BELIEVE you did that! [long pause] [soft] ...do you even care?
[excited] Oh my god we got the apartment! [pause] [voice breaking] I can't believe it's actually happening.
[pause] and [long pause] are your most powerful tags. Use them:
[confident] I have an announcement to make. [long pause] [excited] We did it. We actually did it.
Real people laugh, sigh, gasp, and breathe between words. Weaving these in makes speech feel alive rather than read.
[sigh] Look, I know this is hard. [pause] [inhale] But we need to talk about it.
I told him the news and he just — [laughing] he literally dropped his coffee.
Do not use [screaming] for mild annoyance or [sobbing] for minor disappointment. The tag should match the emotional weight of the words.
When a single-word tag is not enough, describe the exact delivery you want:
[speaking slowly, choosing each word carefully] I think we should reconsider our approach.
This gives S2 much richer information than just [slow] or [sad].
S2 excels at dynamic emotional shifts. Use this for natural-feeling monologues:
[excited] I got the promotion! [pause] [uncertain] But... it means relocating. [sad] I'll miss everyone here. [long pause] [hopeful] Maybe it'll be worth it though.
Narration (audiobook style):
[soft] The city was quiet that morning. [pause] Not the peaceful kind of quiet — [long pause] [low voice] the kind that makes you hold your breath. [inhale] [whisper] Something was about to change. [pause] [confident] And everyone knew it.
Podcast intro:
[excited] Welcome back to another episode! [pause] [professional broadcast tone] Today we're diving into something I've been researching for months. [chuckling] And honestly? It blew my mind. [pause] [volume down] [speaking slowly with warmth] So grab your coffee, get comfortable, and let's get into it.
Dramatic reading:
[soft] She stood at the edge of the platform, [pause] watching the last train pull away. [long pause] [voice breaking] It wasn't supposed to end like this. [sigh] [whisper] None of it was. [pause] [angry but trying to stay calm] And yet here she stood — [emphasis] alone — [long pause] [nostalgic] remembering a time when the station was full of laughter.
Announcement:
[confident] Attention everyone. [pause] [excited] After three years of development, [volume up] we are thrilled to announce [emphasis] the official launch! [long pause] [laughing] I know, I know — it's been a long time coming. [pause] [soft] But we wanted to get it right. [pause] [professional broadcast tone] And we did.
| Parameter | Default | Description |
|---|---|---|
text | (required) | The text to synthesize, with [bracket] tags |
reference_id | (from config) | Voice model ID |
format | mp3 | Output format: mp3, wav, pcm, opus |
mp3_bitrate | 192 | MP3 quality: 64, 128, 192 |
temperature | 0.8 | Expressiveness (higher = more varied) |
top_p | 0.7 | Diversity via nucleus sampling |
chunk_length | 300 | Text segment size (100-300) |
latency | normal | Quality tradeoff: normal, balanced, low |
condition_on_previous_chunks: true (default) helps maintain consistency within a single API call<vellum-attachment> tags[bracket] syntax inside text passed to the Fish Audio API, not in regular text responses