Voice design | Qwen Cloud

Voice design generates custom voices from text descriptions. After creating a voice, use the returned voice name with Qwen TTS or Realtime streaming TTS.

The target_model in voice design must match the model in synthesis. Mismatched models cause failures.

How it works

Write a voice description (voice_prompt) and preview text (preview_text).
Send a Create voice request with your target_model.
The API returns a voice name and Base64-encoded preview audio. Decode the Base64 string to get the audio file (WAV format).
Listen to the preview. If satisfied, use the voice name for synthesis. Otherwise, create a new voice.

Quick start

Prerequisites

Get an API key and set the DASHSCOPE_API_KEY environment variable.

Endpoint

All voice design operations use a single endpoint:

POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization

Create a voice

cURL
Python
Java

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-voice-design",
  "input": {
    "action": "create",
    "target_model": "qwen3-tts-vd-2026-01-26",
    "voice_prompt": "A calm young female voice with clear articulation and gentle tone, suitable for audiobook narration.",
    "preview_text": "Hello, welcome to our program. Today we will explore the wonders of nature.",
    "preferred_name": "narrator",
    "language": "en"
  },
  "parameters": {
    "sample_rate": 24000,
    "response_format": "wav"
  }
}'

import requests
import base64
import os

response = requests.post(
  "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization",
  headers={
    "Authorization": f"Bearer {os.getenv('DASHSCOPE_API_KEY')}",
    "Content-Type": "application/json"
  },
  json={
    "model": "qwen-voice-design",
    "input": {
      "action": "create",
      "target_model": "qwen3-tts-vd-2026-01-26",
      "voice_prompt": "A calm young female voice with clear articulation "
                      "and gentle tone, suitable for audiobook narration.",
      "preview_text": "Hello, welcome to our program. "
                      "Today we will explore the wonders of nature.",
      "preferred_name": "narrator",
      "language": "en"
    },
    "parameters": {
      "sample_rate": 24000,
      "response_format": "wav"
    }
  },
  timeout=60
)

result = response.json()
voice_name = result["output"]["voice"]
print(f"Voice created: {voice_name}")

# Decode and save preview audio
audio_bytes = base64.b64decode(result["output"]["preview_audio"]["data"])
with open(f"{voice_name}_preview.wav", "wb") as f:
  f.write(audio_bytes)

import com.google.gson.Gson;
import com.google.gson.JsonObject;

import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Base64;

public class VoiceDesign {
  public static void main(String[] args) {
    String apiKey = System.getenv("DASHSCOPE_API_KEY");
    String apiUrl = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization";

    try {
      String body = "{"
        + "\"model\": \"qwen-voice-design\","
        + "\"input\": {"
        +   "\"action\": \"create\","
        +   "\"target_model\": \"qwen3-tts-vd-realtime-2026-01-15\","
        +   "\"voice_prompt\": \"A calm young female voice with clear articulation "
        +     "and gentle tone, suitable for audiobook narration.\","
        +   "\"preview_text\": \"Hello, welcome to our program. "
        +     "Today we will explore the wonders of nature.\","
        +   "\"preferred_name\": \"narrator\","
        +   "\"language\": \"en\""
        + "},"
        + "\"parameters\": {"
        +   "\"sample_rate\": 24000,"
        +   "\"response_format\": \"wav\""
        + "}"
        + "}";

      HttpURLConnection conn = (HttpURLConnection) new URL(apiUrl).openConnection();
      conn.setRequestMethod("POST");
      conn.setRequestProperty("Authorization", "Bearer " + apiKey);
      conn.setRequestProperty("Content-Type", "application/json");
      conn.setDoOutput(true);

      try (OutputStream os = conn.getOutputStream()) {
        os.write(body.getBytes("UTF-8"));
      }

      int status = conn.getResponseCode();
      InputStream is = (status >= 200 && status < 300)
        ? conn.getInputStream()
        : conn.getErrorStream();

      StringBuilder sb = new StringBuilder();
      try (BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"))) {
        String line;
        while ((line = br.readLine()) != null) {
          sb.append(line);
        }
      }

      if (status == 200) {
        Gson gson = new Gson();
        JsonObject result = gson.fromJson(sb.toString(), JsonObject.class);
        JsonObject output = result.getAsJsonObject("output");
        String voiceName = output.get("voice").getAsString();
        System.out.println("Voice created: " + voiceName);

        // Decode and save preview audio
        String audioData = output.getAsJsonObject("preview_audio").get("data").getAsString();
        byte[] audioBytes = Base64.getDecoder().decode(audioData);
        try (FileOutputStream fos = new FileOutputStream(voiceName + "_preview.wav")) {
          fos.write(audioBytes);
        }
        System.out.println("Preview saved: " + voiceName + "_preview.wav");
      } else {
        System.err.println("Error " + status + ": " + sb.toString());
      }

    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

The response includes the voice name and Base64-encoded preview audio. Decode the Base64 string to get the WAV file and listen to the preview.

Use the voice for synthesis

Use the returned voice name with the matching synthesis model. The model in synthesis must match the target_model used during voice creation.

cURL
Python
Java

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3-tts-vd-2026-01-26",
  "input": {
    "text": "Welcome to our audiobook. Let me take you on a journey through the wonders of nature.",
    "voice": "VOICE_NAME"
  }
}'

Replace VOICE_NAME with the voice name returned from the create step. The response contains an output.audio.url field with a download link (valid for 24 hours).

import requests
import os

voice_name = "VOICE_NAME"  # <-- from the create step

response = requests.post(
  "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation",
  headers={
    "Authorization": f"Bearer {os.getenv('DASHSCOPE_API_KEY')}",
    "Content-Type": "application/json"
  },
  json={
    "model": "qwen3-tts-vd-2026-01-26",
    "input": {
      "text": "Welcome to our audiobook. "
              "Let me take you on a journey through the wonders of nature.",
      "voice": voice_name
    }
  },
  timeout=60
)

result = response.json()
audio_url = result["output"]["audio"]["url"]
print(f"Audio URL: {audio_url}")

import com.google.gson.Gson;
import com.google.gson.JsonObject;

import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;

public class VoiceDesignSynthesize {
  public static void main(String[] args) {
    String apiKey = System.getenv("DASHSCOPE_API_KEY");
    String voiceName = "VOICE_NAME"; // <-- from the create step

    try {
      String body = "{"
        + "\"model\": \"qwen3-tts-vd-2026-01-26\","
        + "\"input\": {"
        +   "\"text\": \"Welcome to our audiobook. "
        +     "Let me take you on a journey through the wonders of nature.\","
        +   "\"voice\": \"" + voiceName + "\""
        + "}"
        + "}";

      HttpURLConnection conn = (HttpURLConnection) new URL(
        "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation"
      ).openConnection();
      conn.setRequestMethod("POST");
      conn.setRequestProperty("Authorization", "Bearer " + apiKey);
      conn.setRequestProperty("Content-Type", "application/json");
      conn.setDoOutput(true);

      try (OutputStream os = conn.getOutputStream()) {
        os.write(body.getBytes("UTF-8"));
      }

      int status = conn.getResponseCode();
      InputStream is = (status >= 200 && status < 300)
        ? conn.getInputStream()
        : conn.getErrorStream();

      StringBuilder sb = new StringBuilder();
      try (BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"))) {
        String line;
        while ((line = br.readLine()) != null) {
          sb.append(line);
        }
      }

      if (status == 200) {
        Gson gson = new Gson();
        JsonObject result = gson.fromJson(sb.toString(), JsonObject.class);
        String audioUrl = result.getAsJsonObject("output")
          .getAsJsonObject("audio").get("url").getAsString();
        System.out.println("Audio URL: " + audioUrl);

        // Download the audio file
        try (InputStream in = new URL(audioUrl).openStream();
             FileOutputStream out = new FileOutputStream("synthesis_output.wav")) {
          byte[] buffer = new byte[4096];
          int bytesRead;
          while ((bytesRead = in.read(buffer)) != -1) {
            out.write(buffer, 0, bytesRead);
          }
        }
        System.out.println("Audio saved: synthesis_output.wav");
      } else {
        System.err.println("Error " + status + ": " + sb.toString());
      }

    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

For real-time streaming synthesis with custom voices, see Realtime streaming TTS. For complete API parameters and more operations (list, query, delete), see the Voice design API reference.

Supported models

Voice design uses two models: a design model and a target synthesis model.

Model	Value	Use with
Voice design model	`qwen-voice-design`	All voice design operations (fixed value)
Real-time synthesis target	`qwen3-tts-vd-realtime-2026-01-15`	Realtime streaming TTS
Real-time synthesis target (earlier version)	`qwen3-tts-vd-realtime-2025-12-16`	Realtime streaming TTS
Non-real-time synthesis target	`qwen3-tts-vd-2026-01-26`	Qwen TTS

Voice design models (qwen3-tts-vd-*) only support custom-designed voices. They do not support system voices (Chelsie, Serena, Ethan, Cherry).

Supported languages

Code	Language
`zh`	Chinese
`en`	English
`de`	German
`it`	Italian
`pt`	Portuguese
`es`	Spanish
`ja`	Japanese
`ko`	Korean
`fr`	French
`ru`	Russian

voice_prompt supports Chinese and English only. The language parameter must match the preview_text language.

Write effective voice descriptions

A voice description (voice_prompt) tells the model what voice to generate. Combine gender, age, tone, and use case to define a distinctive voice.

Constraints

Max length: 2,048 characters.
Languages: Chinese and English only.

Description dimensions

Dimension	Examples
Gender	Male, female, neutral
Age	Child (5--12), teenager (13--18), young adult (19--35), middle-aged (36--55), elderly (55+)
Pitch	High, medium, low, high-pitched, low-pitched
Pace	Fast, medium, slow, fast-paced, slow-paced
Emotion	Cheerful, calm, gentle, serious, lively, composed, soothing
Characteristics	Magnetic, crisp, hoarse, mellow, sweet, rich, powerful
Use case	News broadcast, ad voice-over, audiobook, animation character, voice assistant, documentary narration

Tips

Be specific. Use concrete qualities like "deep," "crisp," or "fast-paced." Avoid vague terms like "nice" or "normal."
Use multiple dimensions. Combine gender, age, emotion, and use case. "Female voice" alone is too broad.
Be objective. Focus on physical and perceptual features. Write "high-pitched and energetic" instead of "my favorite voice."
Be original. Describe voice qualities directly. Celebrity imitation is not supported and involves copyright risks.
Be concise. Every word should serve a purpose. Avoid synonyms and meaningless intensifiers.

Examples

Good descriptions:

"A young, lively female voice with a fast pace and noticeable upward inflection, suitable for fashion product introductions."
"A calm, middle-aged male voice with a slow pace and deep, magnetic tone, suitable for news or documentary narration."
"A cute child's voice, around 8 years old, with a slightly childish tone, suitable for animation character voice-overs."

Ineffective descriptions:

Description	Issue	Improvement
"A nice voice"	Too vague	"A young female voice with a clear vocal line and gentle tone."
"A voice like a certain celebrity"	Celebrity imitation not supported	"A mature, magnetic male voice with a calm pace."
"A very, very, very nice female voice"	Redundant repetition	"A female voice, 20--24 years old, with a light tone and sweet quality."

Voice quota and cleanup

Account limit: 1,000 voices per account. Check the total_count field in the List voices response.
Automatic cleanup: Voices unused for synthesis in the past year are deleted automatically.

Error codes

If a call fails, see Error messages. Common voice design errors:

HTTP status	Error code	Cause	Resolution
400	BadRequest.VoiceNotFound	The specified voice does not exist (in voice design or synthesis operations)	Verify the voice name with List voices or Query a voice. If the voice does not exist, create a new voice with Create a voice.

Next steps

Voice design API reference -- API parameters and response format
Realtime streaming TTS -- Use custom voices for real-time synthesis
Qwen TTS -- Use custom voices for non-streaming synthesis
Get an API key -- Set up authentication

​How it works

​Quick start

​Prerequisites

​Endpoint

​Create a voice

​Use the voice for synthesis

​Supported models

​Supported languages

​Write effective voice descriptions

​Constraints

​Description dimensions

​Tips

​Examples

​Voice quota and cleanup

​Error codes

​Next steps

How it works

Quick start

Prerequisites

Endpoint

Create a voice

Use the voice for synthesis

Supported models

Supported languages

Write effective voice descriptions

Constraints

Description dimensions

Tips

Examples

Voice quota and cleanup

Error codes

Next steps