Skip to main content
Text-to-speech

Voice design

Create custom voices from text descriptions for use with Qwen TTS models.

Voice design generates custom voices from text descriptions. After creating a voice, use the returned voice name with Qwen TTS or Realtime streaming TTS.
The target_model in voice design must match the model in synthesis. Mismatched models cause failures.

How it works

  1. Write a voice description (voice_prompt) and preview text (preview_text).
  2. Send a Create voice request with your target_model.
  3. The API returns a voice name and Base64-encoded preview audio. Decode the Base64 string to get the audio file (WAV format).
  4. Listen to the preview. If satisfied, use the voice name for synthesis. Otherwise, create a new voice.

Quick start

Prerequisites

  1. Get an API key and set the DASHSCOPE_API_KEY environment variable.

Endpoint

All voice design operations use a single endpoint:
POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization

Create a voice

  • cURL
  • Python
  • Java
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-voice-design",
  "input": {
    "action": "create",
    "target_model": "qwen3-tts-vd-2026-01-26",
    "voice_prompt": "A calm young female voice with clear articulation and gentle tone, suitable for audiobook narration.",
    "preview_text": "Hello, welcome to our program. Today we will explore the wonders of nature.",
    "preferred_name": "narrator",
    "language": "en"
  },
  "parameters": {
    "sample_rate": 24000,
    "response_format": "wav"
  }
}'
The response includes the voice name and Base64-encoded preview audio. Decode the Base64 string to get the WAV file and listen to the preview.

Use the voice for synthesis

Use the returned voice name with the matching synthesis model. The model in synthesis must match the target_model used during voice creation.
  • cURL
  • Python
  • Java
curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3-tts-vd-2026-01-26",
  "input": {
    "text": "Welcome to our audiobook. Let me take you on a journey through the wonders of nature.",
    "voice": "VOICE_NAME"
  }
}'
Replace VOICE_NAME with the voice name returned from the create step. The response contains an output.audio.url field with a download link (valid for 24 hours).
For real-time streaming synthesis with custom voices, see Realtime streaming TTS. For complete API parameters and more operations (list, query, delete), see the Voice design API reference.

Supported models

Voice design uses two models: a design model and a target synthesis model.
ModelValueUse with
Voice design modelqwen-voice-designAll voice design operations (fixed value)
Real-time synthesis targetqwen3-tts-vd-realtime-2026-01-15Realtime streaming TTS
Real-time synthesis target (earlier version)qwen3-tts-vd-realtime-2025-12-16Realtime streaming TTS
Non-real-time synthesis targetqwen3-tts-vd-2026-01-26Qwen TTS
Voice design models (qwen3-tts-vd-*) only support custom-designed voices. They do not support system voices (Chelsie, Serena, Ethan, Cherry).

Supported languages

CodeLanguage
zhChinese
enEnglish
deGerman
itItalian
ptPortuguese
esSpanish
jaJapanese
koKorean
frFrench
ruRussian
voice_prompt supports Chinese and English only. The language parameter must match the preview_text language.

Write effective voice descriptions

A voice description (voice_prompt) tells the model what voice to generate. Combine gender, age, tone, and use case to define a distinctive voice.

Constraints

  • Max length: 2,048 characters.
  • Languages: Chinese and English only.

Description dimensions

DimensionExamples
GenderMale, female, neutral
AgeChild (5--12), teenager (13--18), young adult (19--35), middle-aged (36--55), elderly (55+)
PitchHigh, medium, low, high-pitched, low-pitched
PaceFast, medium, slow, fast-paced, slow-paced
EmotionCheerful, calm, gentle, serious, lively, composed, soothing
CharacteristicsMagnetic, crisp, hoarse, mellow, sweet, rich, powerful
Use caseNews broadcast, ad voice-over, audiobook, animation character, voice assistant, documentary narration

Tips

  1. Be specific. Use concrete qualities like "deep," "crisp," or "fast-paced." Avoid vague terms like "nice" or "normal."
  2. Use multiple dimensions. Combine gender, age, emotion, and use case. "Female voice" alone is too broad.
  3. Be objective. Focus on physical and perceptual features. Write "high-pitched and energetic" instead of "my favorite voice."
  4. Be original. Describe voice qualities directly. Celebrity imitation is not supported and involves copyright risks.
  5. Be concise. Every word should serve a purpose. Avoid synonyms and meaningless intensifiers.

Examples

Good descriptions:
  • "A young, lively female voice with a fast pace and noticeable upward inflection, suitable for fashion product introductions."
  • "A calm, middle-aged male voice with a slow pace and deep, magnetic tone, suitable for news or documentary narration."
  • "A cute child's voice, around 8 years old, with a slightly childish tone, suitable for animation character voice-overs."
Ineffective descriptions:
DescriptionIssueImprovement
"A nice voice"Too vague"A young female voice with a clear vocal line and gentle tone."
"A voice like a certain celebrity"Celebrity imitation not supported"A mature, magnetic male voice with a calm pace."
"A very, very, very nice female voice"Redundant repetition"A female voice, 20--24 years old, with a light tone and sweet quality."

Voice quota and cleanup

  • Account limit: 1,000 voices per account. Check the total_count field in the List voices response.
  • Automatic cleanup: Voices unused for synthesis in the past year are deleted automatically.

Error codes

If a call fails, see Error messages. Common voice design errors:
HTTP statusError codeCauseResolution
400BadRequest.VoiceNotFoundThe specified voice does not exist (in voice design or synthesis operations)Verify the voice name with List voices or Query a voice. If the voice does not exist, create a new voice with Create a voice.

Next steps