Real-time speech synthesis

Qwen Cloud provides two families of real-time speech synthesis models: CosyVoice for streaming synthesis with SSML control, and Qwen-TTS-Realtime for real-time synthesis with instruction-based voice control, voice cloning, and voice design.

Core features

Generates high-fidelity speech in real time with natural pronunciation in multiple languages, such as Chinese and English
Supports voice customization through Qwen-TTS-Realtime voice cloning and voice design
Supports streaming input and output with low-latency responses for real-time interactive scenarios
Adjustable speech rate, pitch, volume, and bitrate for fine-grained control over vocal expression
Compatible with mainstream audio formats, supporting output up to 48 kHz sample rate
Supports instruction control, enabling natural language instructions to control vocal expressiveness

Availability

CosyVoice
Qwen-TTS-Realtime

Supported models:When you invoke the following models, use an API key.

CosyVoice: cosyvoice-v3-plus, cosyvoice-v3-flash

For more information, see the Model list.

Model selection

CosyVoice
Qwen-TTS-Realtime

Scenario	Recommended	Reason	Notes
Intelligent customer service / Voice assistant	cosyvoice-v3-flash	Lower cost than plus models with support for streaming interaction and emotional expression, delivering fast responses at an affordable price point.
Educational applications (including formula reading)	cosyvoice-v3-flash, cosyvoice-v3-plus	Supports LaTeX formula-to-speech conversion, ideal for mathematics, physics, and chemistry instruction.	cosyvoice-v3-plus has higher costs ($0.286706 per 10,000 characters).
Structured voice broadcasting (news/announcements)	cosyvoice-v3-plus, cosyvoice-v3-flash	Supports SSML for controlling speech rate, pauses, and pronunciation to enhance broadcast professionalism.	Implement the SSML generation logic independently. This model does not support emotion settings.
Precise speech-text alignment for scenarios such as caption generation, lesson playback, and dictation practice	cosyvoice-v3-flash, cosyvoice-v3-plus	Supports timestamp output to synchronize the synthesized speech with the original text.	Manually enable the timestamp feature.
Multilingual international products	cosyvoice-v3-flash, cosyvoice-v3-plus	Supports multiple languages.

Scenario	Recommended model	Reason
Voice customization for brand identity, exclusive voices, or extended system voices (based on text descriptions)	qwen3-tts-vd-realtime-2026-01-15	Supports voice design. Creates customized voices from text descriptions without audio samples. Ideal for designing brand-exclusive voices from scratch.
Voice customization for brand identity, exclusive voices, or extended system voices (based on audio samples)	qwen3-tts-vc-realtime-2026-01-15	Supports voice cloning. Quickly replicates voices from real audio samples to create lifelike brand voiceprints with high fidelity and consistency.
Emotional content production (audiobooks, radio dramas, game/animation dubbing)	qwen3-tts-instruct-flash-realtime	Supports instruction control. Precisely controls tone, speed, emotion, and character personality through natural language descriptions. Ideal for scenarios requiring rich expressiveness and character development.
Professional broadcasting (news, documentaries, advertising)	qwen3-tts-instruct-flash-realtime	Supports instruction control. Describes broadcasting styles and tonal characteristics (such as "authoritative and solemn" or "casual and friendly"). Suitable for professional content production.
Intelligent customer service and conversational bots	qwen3-tts-flash-realtime, qwen3-tts-instruct-flash-realtime	Supports streaming input and output with adjustable speech rate and pitch. The instruct version supports instruction control to dynamically adjust tone (such as reassuring, enthusiastic, or professional) based on conversation context.
Multilingual content broadcasting	qwen3-tts-flash-realtime, qwen3-tts-instruct-flash-realtime	Supports multiple languages and Chinese dialects, meeting global content distribution needs.
Audiobook reading and general content production	qwen3-tts-flash-realtime, qwen3-tts-instruct-flash-realtime	Adjustable volume, speech rate, and pitch to meet fine-grained production requirements for audiobooks, podcasts, and similar content.
E-commerce livestreaming and short video dubbing	qwen3-tts-flash-realtime, qwen3-tts-instruct-flash-realtime	Supports mp3/opus compressed formats, suitable for bandwidth-constrained scenarios.

For more details, see Model comparison.

Getting started

CosyVoice
Qwen-TTS-Realtime

For more code examples, see GitHub.Get an API key and set it as an environment variable. To use the SDK, install it.

Use system voices

Save synthesized audio to a file

For available voices, see the Voice list.

Python
Java

# coding=utf-8

import os
import dashscope
from dashscope.audio.tts_v2 import *

# If you have not configured environment variables, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For more information, see the CosyVoice voice list.
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send the text to be synthesized and get the binary audio.
audio = synthesizer.call("How is the weather today?")
# The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time.
print('[Metric] Request ID: {}, First packet delay: {} ms'.format(
  synthesizer.get_last_request_id(),
  synthesizer.get_first_package_delay()))

# Save the audio locally.
with open('output.mp3', 'wb') as f:
  f.write(audio)

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Main {
  // Model
  // cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
  // Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For more information, see the CosyVoice voice list.
  private static String model = "cosyvoice-v3-flash";
  // Voice
  private static String voice = "longanyang";

  public static void streamAudioDataToSpeaker() {
    // Request parameters
    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            // If you have not configured environment variables, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .model(model) // Model
            .voice(voice) // Voice
            .build();

    // Synchronous mode: Disable callback (second parameter is null).
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
    ByteBuffer audio = null;
    try {
      // Block until audio returns.
      audio = synthesizer.call("How is the weather today?");
    } catch (Exception e) {
      throw new RuntimeException(e);
    } finally {
      // Close the WebSocket connection when the task ends.
      synthesizer.getDuplexApi().close(1000, "bye");
    }
    if (audio != null) {
      // Save the audio data to the local file "output.mp3".
      File file = new File("output.mp3");
      // The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time.
      System.out.println(
          "[Metric] Request ID: "
              + synthesizer.getLastRequestId()
              + ", First packet delay (ms): "
              + synthesizer.getFirstPackageDelay());
      try (FileOutputStream fos = new FileOutputStream(file)) {
        fos.write(audio.array());
      } catch (IOException e) {
        throw new RuntimeException(e);
      }
    }
  }

  public static void main(String[] args) {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    streamAudioDataToSpeaker();
    System.exit(0);
  }
}

Convert LLM-generated text to speech in real time and play it through speakers

Play text from a Qwen model (qwen3.5-flash) as speech in real time on a local device.

Python
Java

Before you run the Python example, install a third-party audio playback library using pip.

# coding=utf-8
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
import pyaudio
import dashscope
from dashscope.audio.tts_v2 import *


from http import HTTPStatus
from dashscope import Generation

# If you have not configured environment variables, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For more information, see the CosyVoice voice list.
model = "cosyvoice-v3-flash"
voice = "longanyang"


class Callback(ResultCallback):
  _player = None
  _stream = None

  def on_open(self):
    print("websocket is open.")
    self._player = pyaudio.PyAudio()
    self._stream = self._player.open(
      format=pyaudio.paInt16, channels=1, rate=22050, output=True
    )

  def on_complete(self):
    print("speech synthesis task complete successfully.")

  def on_error(self, message: str):
    print(f"speech synthesis task failed, {message}")

  def on_close(self):
    print("websocket is closed.")
    # stop player
    self._stream.stop_stream()
    self._stream.close()
    self._player.terminate()

  def on_event(self, message):
    print(f"recv speech synthsis message {message}")

  def on_data(self, data: bytes) -> None:
    print("audio result length:", len(data))
    self._stream.write(data)


def synthesizer_with_llm():
  callback = Callback()
  synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    format=AudioFormat.PCM_22050HZ_MONO_16BIT,
    callback=callback,
  )

  messages = [{"role": "user", "content": "Please introduce yourself"}]
  responses = Generation.call(
    model="qwen3.5-flash",
    messages=messages,
    result_format="message",  # set result format as 'message'
    stream=True,  # enable stream output
    incremental_output=True,  # enable incremental output
  )
  for response in responses:
    if response.status_code == HTTPStatus.OK:
      print(response.output.choices[0]["message"]["content"], end="")
      synthesizer.streaming_call(response.output.choices[0]["message"]["content"])
    else:
      print(
        "Request id: %s, Status code: %s, error code: %s, error message: %s"
        % (
          response.request_id,
          response.status_code,
          response.code,
          response.message,
        )
      )
  synthesizer.streaming_complete()
  print('requestId: ', synthesizer.get_last_request_id())


if __name__ == "__main__":
  synthesizer_with_llm()

Get an API key and install the SDK before running the code.

Use system voice
Use cloned voice
Use designed voice

See Supported voices for available voices.Replace the model parameter with qwen3-tts-instruct-flash-realtime and set instructions using the instructions parameter to use the instruction control feature.

DashScope SDK
WebSocket API

Python
Java

Server commit mode:

import os
import base64
import threading
import time
import dashscope
from dashscope.audio.qwen_tts_realtime import *


qwen_tts_realtime: QwenTtsRealtime = None
text_to_synthesize = [
  'Right? I love supermarkets like this.',
  'Especially during Chinese New Year,',
  'I go shopping at supermarkets.',
  'And I feel',
  'absolutely thrilled!',
  'I want to buy so many things!'
]

DO_VIDEO_TEST = False

def init_dashscope_api_key():
  """
    Set your DashScope API key. More information:
    https://github.com/aliyun/alibabacloud-bailian-speech-demo/blob/master/PREREQUISITES.md
  """

  if 'DASHSCOPE_API_KEY' in os.environ:
    dashscope.api_key = os.environ[
      'DASHSCOPE_API_KEY']  # Load API key from environment variable DASHSCOPE_API_KEY
  else:
    dashscope.api_key = 'your-dashscope-api-key'  # Set API key manually



class MyCallback(QwenTtsRealtimeCallback):
  def __init__(self):
    self.complete_event = threading.Event()
    self.file = open('result_24k.pcm', 'wb')

  def on_open(self) -> None:
    print('connection opened, init player')

  def on_close(self, close_status_code, close_msg) -> None:
    self.file.close()
    print('connection closed with code: {}, msg: {}, destroy player'.format(close_status_code, close_msg))

  def on_event(self, response: str) -> None:
    try:
      global qwen_tts_realtime
      type = response['type']
      if 'session.created' == type:
        print('start session: {}'.format(response['session']['id']))
      if 'response.audio.delta' == type:
        recv_audio_b64 = response['delta']
        self.file.write(base64.b64decode(recv_audio_b64))
      if 'response.done' == type:
        print(f'response {qwen_tts_realtime.get_last_response_id()} done')
      if 'session.finished' == type:
        print('session finished')
        self.complete_event.set()
    except Exception as e:
      print('[Error] {}'.format(e))
      return

  def wait_for_finished(self):
    self.complete_event.wait()


if __name__  == '__main__':
  init_dashscope_api_key()

  print('Initializing ...')

  callback = MyCallback()

  qwen_tts_realtime = QwenTtsRealtime(
    # To use instruction control, replace the model with qwen3-tts-instruct-flash-realtime
    model='qwen3-tts-flash-realtime',
    callback=callback,
    url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime'
    )

  qwen_tts_realtime.connect()
  qwen_tts_realtime.update_session(
    voice = 'Cherry',
    response_format = AudioFormat.PCM_24000HZ_MONO_16BIT,
    # To use instruction control, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime
    # instructions='Speak quickly with a rising intonation, suitable for introducing fashion products.',
    # optimize_instructions=True,
    mode = 'server_commit'        
  )
  for text_chunk in text_to_synthesize:
    print(f'send text: {text_chunk}')
    qwen_tts_realtime.append_text(text_chunk)
    time.sleep(0.1)
  qwen_tts_realtime.finish()
  callback.wait_for_finished()
  print('[Metric] session: {}, first audio delay: {}'.format(
          qwen_tts_realtime.get_session_id(), 
          qwen_tts_realtime.get_first_audio_delay(),
          ))

Commit mode:

import base64
import os
import threading
import dashscope
from dashscope.audio.qwen_tts_realtime import *


qwen_tts_realtime: QwenTtsRealtime = None
text_to_synthesize = [
  'This is the first sentence.',
  'This is the second sentence.',
  'This is the third sentence.',
]

DO_VIDEO_TEST = False

def init_dashscope_api_key():
  """
    Set your DashScope API key. More information:
    https://github.com/aliyun/alibabacloud-bailian-speech-demo/blob/master/PREREQUISITES.md
  """

  if 'DASHSCOPE_API_KEY' in os.environ:
    dashscope.api_key = os.environ[
      'DASHSCOPE_API_KEY']  # Load API key from environment variable DASHSCOPE_API_KEY
  else:
    dashscope.api_key = 'your-dashscope-api-key'  # Set API key manually



class MyCallback(QwenTtsRealtimeCallback):
  def __init__(self):
    super().__init__()
    self.response_counter = 0
    self.complete_event = threading.Event()
    self.file = open(f'result_{self.response_counter}_24k.pcm', 'wb')

  def reset_event(self):
    self.response_counter += 1
    self.file = open(f'result_{self.response_counter}_24k.pcm', 'wb')
    self.complete_event = threading.Event()

  def on_open(self) -> None:
    print('connection opened, init player')

  def on_close(self, close_status_code, close_msg) -> None:
    print('connection closed with code: {}, msg: {}, destroy player'.format(close_status_code, close_msg))

  def on_event(self, response: str) -> None:
    try:
      global qwen_tts_realtime
      type = response['type']
      if 'session.created' == type:
        print('start session: {}'.format(response['session']['id']))
      if 'response.audio.delta' == type:
        recv_audio_b64 = response['delta']
        self.file.write(base64.b64decode(recv_audio_b64))
      if 'response.done' == type:
        print(f'response {qwen_tts_realtime.get_last_response_id()} done')
        self.complete_event.set()
        self.file.close()
      if 'session.finished' == type:
        print('session finished')
        self.complete_event.set()
    except Exception as e:
      print('[Error] {}'.format(e))
      return

  def wait_for_response_done(self):
    self.complete_event.wait()


if __name__  == '__main__':
  init_dashscope_api_key()

  print('Initializing ...')

  callback = MyCallback()

  qwen_tts_realtime = QwenTtsRealtime(
    # To use instruction control, replace the model with qwen3-tts-instruct-flash-realtime
    model='qwen3-tts-flash-realtime',
    callback=callback, 
    url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime'
    )

  qwen_tts_realtime.connect()
  qwen_tts_realtime.update_session(
    voice = 'Cherry',
    response_format = AudioFormat.PCM_24000HZ_MONO_16BIT,
    # To use instruction control, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime
    # instructions='Speak quickly with a rising intonation, suitable for introducing fashion products.',
    # optimize_instructions=True,
    mode = 'commit'        
  )
  print(f'send text: {text_to_synthesize[0]}')
  qwen_tts_realtime.append_text(text_to_synthesize[0])
  qwen_tts_realtime.commit()
  callback.wait_for_response_done()
  callback.reset_event()
  
  print(f'send text: {text_to_synthesize[1]}')
  qwen_tts_realtime.append_text(text_to_synthesize[1])
  qwen_tts_realtime.commit()
  callback.wait_for_response_done()
  callback.reset_event()

  print(f'send text: {text_to_synthesize[2]}')
  qwen_tts_realtime.append_text(text_to_synthesize[2])
  qwen_tts_realtime.commit()
  callback.wait_for_response_done()
  
  qwen_tts_realtime.finish()
  print('[Metric] session: {}, first audio delay: {}'.format(
          qwen_tts_realtime.get_session_id(), 
          qwen_tts_realtime.get_first_audio_delay(),
          ))

Server commit mode:

import com.alibaba.dashscope.audio.qwen_tts_realtime.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.SourceDataLine;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.DataLine;
import javax.sound.sampled.AudioSystem;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Base64;
import java.util.Queue;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;

public class Main {
  static String[] textToSynthesize = {
      "Right? I just really love this kind of supermarket",
      "Especially during the New Year",
      "Going to the supermarket",
      "Makes me feel",
      "Super, super happy!",
      "I want to buy so many things!"
  };

  // Real-time PCM audio player class
  public static class RealtimePcmPlayer {
    private int sampleRate;
    private SourceDataLine line;
    private AudioFormat audioFormat;
    private Thread decoderThread;
    private Thread playerThread;
    private AtomicBoolean stopped = new AtomicBoolean(false);
    private Queue<String> b64AudioBuffer = new ConcurrentLinkedQueue<>();
    private Queue<byte[]> RawAudioBuffer = new ConcurrentLinkedQueue<>();

    // The constructor initializes the audio format and audio line.
    public RealtimePcmPlayer(int sampleRate) throws LineUnavailableException {
      this.sampleRate = sampleRate;
      this.audioFormat = new AudioFormat(this.sampleRate, 16, 1, true, false);
      DataLine.Info info = new DataLine.Info(SourceDataLine.class, audioFormat);
      line = (SourceDataLine) AudioSystem.getLine(info);
      line.open(audioFormat);
      line.start();
      decoderThread = new Thread(new Runnable() {
        @Override
        public void run() {
          while (!stopped.get()) {
            String b64Audio = b64AudioBuffer.poll();
            if (b64Audio != null) {
              byte[] rawAudio = Base64.getDecoder().decode(b64Audio);
              RawAudioBuffer.add(rawAudio);
            } else {
              try {
                Thread.sleep(100);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            }
          }
        }
      });
      playerThread = new Thread(new Runnable() {
        @Override
        public void run() {
          while (!stopped.get()) {
            byte[] rawAudio = RawAudioBuffer.poll();
            if (rawAudio != null) {
              try {
                playChunk(rawAudio);
              } catch (IOException e) {
                throw new RuntimeException(e);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            } else {
              try {
                Thread.sleep(100);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            }
          }
        }
      });
      decoderThread.start();
      playerThread.start();
    }

    // Plays an audio chunk and blocks until playback is complete.
    private void playChunk(byte[] chunk) throws IOException, InterruptedException {
      if (chunk == null || chunk.length == 0) return;

      int bytesWritten = 0;
      while (bytesWritten < chunk.length) {
        bytesWritten += line.write(chunk, bytesWritten, chunk.length - bytesWritten);
      }
      int audioLength = chunk.length / (this.sampleRate*2/1000);
      // Waits for the audio in the buffer to finish playing.
      Thread.sleep(audioLength - 10);
    }

    public void write(String b64Audio) {
      b64AudioBuffer.add(b64Audio);
    }

    public void cancel() {
      b64AudioBuffer.clear();
      RawAudioBuffer.clear();
    }

    public void waitForComplete() throws InterruptedException {
      while (!b64AudioBuffer.isEmpty() || !RawAudioBuffer.isEmpty()) {
        Thread.sleep(100);
      }
      line.drain();
    }

    public void shutdown() throws InterruptedException {
      stopped.set(true);
      decoderThread.join();
      playerThread.join();
      if (line != null && line.isRunning()) {
        line.drain();
        line.close();
      }
    }
  }

  public static void main(String[] args) throws InterruptedException, LineUnavailableException, FileNotFoundException {
    QwenTtsRealtimeParam param = QwenTtsRealtimeParam.builder()
        // To use the instruction control feature, replace the model with qwen3-tts-instruct-flash-realtime.
        .model("qwen3-tts-flash-realtime")
        .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
        .apikey(System.getenv("DASHSCOPE_API_KEY"))
        .build();
    AtomicReference<CountDownLatch> completeLatch = new AtomicReference<>(new CountDownLatch(1));
    final AtomicReference<QwenTtsRealtime> qwenTtsRef = new AtomicReference<>(null);

    // Creates a real-time audio player instance.
    RealtimePcmPlayer audioPlayer = new RealtimePcmPlayer(24000);

    QwenTtsRealtime qwenTtsRealtime = new QwenTtsRealtime(param, new QwenTtsRealtimeCallback() {
      @Override
      public void onOpen() {
        // Handles the event when the connection is established.
      }
      @Override
      public void onEvent(JsonObject message) {
        String type = message.get("type").getAsString();
        switch(type) {
          case "session.created":
            // Handles the event when the session is created.
            break;
          case "response.audio.delta":
            String recvAudioB64 = message.get("delta").getAsString();
            // Plays the audio in real time.
            audioPlayer.write(recvAudioB64);
            break;
          case "response.done":
            // Handles the event when the response is complete.
            break;
          case "session.finished":
            // Handles the event when the session is finished.
            completeLatch.get().countDown();
          default:
            break;
        }
      }
      @Override
      public void onClose(int code, String reason) {
        // Handles the event when the connection is closed.
      }
    });
    qwenTtsRef.set(qwenTtsRealtime);
    try {
      qwenTtsRealtime.connect();
    } catch (NoApiKeyException e) {
      throw new RuntimeException(e);
    }
    QwenTtsRealtimeConfig config = QwenTtsRealtimeConfig.builder()
        .voice("Cherry")
        .responseFormat(QwenTtsRealtimeAudioFormat.PCM_24000HZ_MONO_16BIT)
        .mode("server_commit")
        // To use the instruction control feature, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime.
        // .instructions("")
        // .optimizeInstructions(true)
        .build();
    qwenTtsRealtime.updateSession(config);
    for (String text:textToSynthesize) {
      qwenTtsRealtime.appendText(text);
      Thread.sleep(100);
    }
    qwenTtsRealtime.finish();
    completeLatch.get().await();
    qwenTtsRealtime.close();

    // Waits for audio playback to complete and then shuts down the player.
    audioPlayer.waitForComplete();
    audioPlayer.shutdown();
    System.exit(0);
  }
}

Commit mode:

import com.alibaba.dashscope.audio.qwen_tts_realtime.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.SourceDataLine;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.DataLine;
import javax.sound.sampled.AudioSystem;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Base64;
import java.util.Queue;
import java.util.Scanner;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;

public class commit {
  // Real-time PCM audio player class
  public static class RealtimePcmPlayer {
    private int sampleRate;
    private SourceDataLine line;
    private AudioFormat audioFormat;
    private Thread decoderThread;
    private Thread playerThread;
    private AtomicBoolean stopped = new AtomicBoolean(false);
    private Queue<String> b64AudioBuffer = new ConcurrentLinkedQueue<>();
    private Queue<byte[]> RawAudioBuffer = new ConcurrentLinkedQueue<>();

    // The constructor initializes the audio format and audio line.
    public RealtimePcmPlayer(int sampleRate) throws LineUnavailableException {
      this.sampleRate = sampleRate;
      this.audioFormat = new AudioFormat(this.sampleRate, 16, 1, true, false);
      DataLine.Info info = new DataLine.Info(SourceDataLine.class, audioFormat);
      line = (SourceDataLine) AudioSystem.getLine(info);
      line.open(audioFormat);
      line.start();
      decoderThread = new Thread(new Runnable() {
        @Override
        public void run() {
          while (!stopped.get()) {
            String b64Audio = b64AudioBuffer.poll();
            if (b64Audio != null) {
              byte[] rawAudio = Base64.getDecoder().decode(b64Audio);
              RawAudioBuffer.add(rawAudio);
            } else {
              try {
                Thread.sleep(100);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            }
          }
        }
      });
      playerThread = new Thread(new Runnable() {
        @Override
        public void run() {
          while (!stopped.get()) {
            byte[] rawAudio = RawAudioBuffer.poll();
            if (rawAudio != null) {
              try {
                playChunk(rawAudio);
              } catch (IOException e) {
                throw new RuntimeException(e);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            } else {
              try {
                Thread.sleep(100);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            }
          }
        }
      });
      decoderThread.start();
      playerThread.start();
    }

    // Plays an audio chunk and blocks until playback is complete.
    private void playChunk(byte[] chunk) throws IOException, InterruptedException {
      if (chunk == null || chunk.length == 0) return;

      int bytesWritten = 0;
      while (bytesWritten < chunk.length) {
        bytesWritten += line.write(chunk, bytesWritten, chunk.length - bytesWritten);
      }
      int audioLength = chunk.length / (this.sampleRate*2/1000);
      // Waits for the audio in the buffer to finish playing.
      Thread.sleep(audioLength - 10);
    }

    public void write(String b64Audio) {
      b64AudioBuffer.add(b64Audio);
    }

    public void cancel() {
      b64AudioBuffer.clear();
      RawAudioBuffer.clear();
    }

    public void waitForComplete() throws InterruptedException {
      // Waits for all audio data in the buffers to finish playing.
      while (!b64AudioBuffer.isEmpty() || !RawAudioBuffer.isEmpty()) {
        Thread.sleep(100);
      }
      // Waits for the audio line to finish playing.
      line.drain();
    }

    public void shutdown() throws InterruptedException {
      stopped.set(true);
      decoderThread.join();
      playerThread.join();
      if (line != null && line.isRunning()) {
        line.drain();
        line.close();
      }
    }
  }

  public static void main(String[] args) throws InterruptedException, LineUnavailableException, FileNotFoundException {
    Scanner scanner = new Scanner(System.in);

    QwenTtsRealtimeParam param = QwenTtsRealtimeParam.builder()
    // To use the instruction control feature, replace the model with qwen3-tts-instruct-flash-realtime.
    .model("qwen3-tts-flash-realtime")
    .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
    .apikey(System.getenv("DASHSCOPE_API_KEY"))
    .build();

    AtomicReference<CountDownLatch> completeLatch = new AtomicReference<>(new CountDownLatch(1));

    // Creates a real-time player instance.
    RealtimePcmPlayer audioPlayer = new RealtimePcmPlayer(24000);

    final AtomicReference<QwenTtsRealtime> qwenTtsRef = new AtomicReference<>(null);
    QwenTtsRealtime qwenTtsRealtime = new QwenTtsRealtime(param, new QwenTtsRealtimeCallback() {
      //            File file = new File("result_24k.pcm");
      //            FileOutputStream fos = new FileOutputStream(file);
      @Override
      public void onOpen() {
        System.out.println("connection opened");
        System.out.println("Enter text and press Enter to send. Enter 'quit' to exit the program.");
      }
      @Override
      public void onEvent(JsonObject message) {
        String type = message.get("type").getAsString();
        switch(type) {
          case "session.created":
          System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
          break;
          case "response.audio.delta":
          String recvAudioB64 = message.get("delta").getAsString();
          byte[] rawAudio = Base64.getDecoder().decode(recvAudioB64);
          //                            fos.write(rawAudio);
          // Plays the audio in real time.
          audioPlayer.write(recvAudioB64);
          break;
          case "response.done":
          System.out.println("response done");
          // Waits for the audio playback to complete.
          try {
            audioPlayer.waitForComplete();
          } catch (InterruptedException e) {
            throw new RuntimeException(e);
          }
          // Prepares for the next input.
          completeLatch.get().countDown();
          break;
          case "session.finished":
          System.out.println("session finished");
          if (qwenTtsRef.get() != null) {
            System.out.println("[Metric] response: " + qwenTtsRef.get().getResponseId() +
            ", first audio delay: " + qwenTtsRef.get().getFirstAudioDelay() + " ms");
          }
          completeLatch.get().countDown();
          default:
          break;
        }
      }
      @Override
      public void onClose(int code, String reason) {
        System.out.println("connection closed code: " + code + ", reason: " + reason);
        try {
          //                    fos.close();
          // Waits for playback to complete and then shuts down the player.
          audioPlayer.waitForComplete();
          audioPlayer.shutdown();
        } catch (InterruptedException e) {
          throw new RuntimeException(e);
        }
      }
    });
    qwenTtsRef.set(qwenTtsRealtime);
    try {
      qwenTtsRealtime.connect();
    } catch (NoApiKeyException e) {
      throw new RuntimeException(e);
    }
    QwenTtsRealtimeConfig config = QwenTtsRealtimeConfig.builder()
    .voice("Cherry")
    .responseFormat(QwenTtsRealtimeAudioFormat.PCM_24000HZ_MONO_16BIT)
    .mode("commit")
    // To use the instruction control feature, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime.
    // .instructions("")
    // .optimizeInstructions(true)
    .build();
    qwenTtsRealtime.updateSession(config);

    // Reads user input in a loop.
    while (true) {
      System.out.print("Enter the text to synthesize: ");
      String text = scanner.nextLine();

      // If the user enters 'quit', exit the program.
      if ("quit".equalsIgnoreCase(text.trim())) {
        System.out.println("Closing the connection...");
        qwenTtsRealtime.finish();
        completeLatch.get().await();
        break;
      }

      // If the user input is empty, skip.
      if (text.trim().isEmpty()) {
        continue;
      }

      // Re-initializes the countdown latch.
      completeLatch.set(new CountDownLatch(1));

      // Sends the text.
      qwenTtsRealtime.appendText(text);
      qwenTtsRealtime.commit();

      // Waits for the current synthesis to complete.
      completeLatch.get().await();
    }

    // Cleans up resources.
    audioPlayer.waitForComplete();
    audioPlayer.shutdown();
    scanner.close();
    System.exit(0);
  }
}

Prepare runtime environment

Install pyaudio based on your operating system.

macOS
Debian/Ubuntu
CentOS
Windows

brew install portaudio && pip install pyaudio

Then, install WebSocket dependencies using pip:

pip install websocket-client==1.8.0 websockets

Create client

Create a new Python file locally named tts_realtime_client.py and copy the following code into the file:

tts_realtime_client.py

# -- coding: utf-8 --

import asyncio
import websockets
import json
import base64
import time
from typing import Optional, Callable, Dict, Any
from enum import Enum


class SessionMode(Enum):
  SERVER_COMMIT = "server_commit"
  COMMIT = "commit"


class TTSRealtimeClient:
  """
  Client for interacting with TTS Realtime API.

  This class provides methods to connect to the TTS Realtime API, send text data, receive audio output, and manage WebSocket connections.

  Attributes:
    base_url (str):
      Base URL for the Realtime API.
    api_key (str):
      API Key for authentication.
    voice (str):
      Voice used by the server for speech synthesis.
    mode (SessionMode):
      Session mode, either server_commit or commit.
    audio_callback (Callable[[bytes], None]):
      Callback function to receive audio data.
    language_type(str)
      Language for synthesized speech. Options: Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, Auto
  """

  def __init__(
      self,
      base_url: str,
      api_key: str,
      voice: str = "Cherry",
      mode: SessionMode = SessionMode.SERVER_COMMIT,
      audio_callback: Optional[Callable[[bytes], None]] = None,
    language_type: str = "Auto"):
    self.base_url = base_url
    self.api_key = api_key
    self.voice = voice
    self.mode = mode
    self.ws = None
    self.audio_callback = audio_callback
    self.language_type = language_type

    # Current response status
    self._current_response_id = None
    self._current_item_id = None
    self._is_responding = False
    self._response_done_future = None


  async def connect(self) -> None:
    """Establish WebSocket connection with TTS Realtime API."""
    headers = {
      "Authorization": f"Bearer {self.api_key}"
    }

    self.ws = await websockets.connect(self.base_url, additional_headers=headers)

    # Set default session configuration
    await self.update_session({
      "mode": self.mode.value,
      "voice": self.voice,
      # Uncomment the lines below and replace model with qwen3-tts-instruct-flash-realtime in server_commit.py or commit.py to use instruction control
      # "instructions": "Speak quickly with a noticeably rising intonation, suitable for introducing fashion products.",
      # "optimize_instructions": true
      "language_type": self.language_type,
      "response_format": "pcm",
      "sample_rate": 24000
    })


  async def send_event(self, event) -> None:
    """Send event to server."""
    event['event_id'] = "event_" + str(int(time.time() * 1000))
    print(f"Sending event: type={event['type']}, event_id={event['event_id']}")
    await self.ws.send(json.dumps(event))


  async def update_session(self, config: Dict[str, Any]) -> None:
    """Update session configuration."""
    event = {
      "type": "session.update",
      "session": config
    }
    print("Updating session configuration: ", event)
    await self.send_event(event)


  async def append_text(self, text: str) -> None:
    """Send text data to API."""
    event = {
      "type": "input_text_buffer.append",
      "text": text
    }
    await self.send_event(event)


  async def commit_text_buffer(self) -> None:
    """Submit text buffer to trigger processing."""
    event = {
      "type": "input_text_buffer.commit"
    }
    await self.send_event(event)


  async def clear_text_buffer(self) -> None:
    """Clear text buffer."""
    event = {
      "type": "input_text_buffer.clear"
    }
    await self.send_event(event)


  async def finish_session(self) -> None:
    """End session."""
    event = {
      "type": "session.finish"
    }
    await self.send_event(event)


  async def wait_for_response_done(self):
    """Wait for response.done event"""
    if self._response_done_future:
      await self._response_done_future


  async def handle_messages(self) -> None:
    """Handle messages from server."""
    try:
      async for message in self.ws:
        event = json.loads(message)
        event_type = event.get("type")

        if event_type != "response.audio.delta":
          print(f"Received event: {event_type}")

        if event_type == "error":
          print("Error: ", event.get('error', {}))
          continue
        elif event_type == "session.created":
          print("Session created, ID: ", event.get('session', {}).get('id'))
        elif event_type == "session.updated":
          print("Session updated, ID: ", event.get('session', {}).get('id'))
        elif event_type == "input_text_buffer.committed":
          print("Text buffer committed, item ID: ", event.get('item_id'))
        elif event_type == "input_text_buffer.cleared":
          print("Text buffer cleared")
        elif event_type == "response.created":
          self._current_response_id = event.get("response", {}).get("id")
          self._is_responding = True
          # Create new future to wait for response.done
          self._response_done_future = asyncio.Future()
          print("Response created, ID: ", self._current_response_id)
        elif event_type == "response.output_item.added":
          self._current_item_id = event.get("item", {}).get("id")
          print("Output item added, ID: ", self._current_item_id)
        # Handle audio delta
        elif event_type == "response.audio.delta" and self.audio_callback:
          audio_bytes = base64.b64decode(event.get("delta", ""))
          self.audio_callback(audio_bytes)
        elif event_type == "response.audio.done":
          print("Audio generation completed")
        elif event_type == "response.done":
          self._is_responding = False
          self._current_response_id = None
          self._current_item_id = None
          # Mark future as complete
          if self._response_done_future and not self._response_done_future.done():
            self._response_done_future.set_result(True)
          print("Response completed")
        elif event_type == "session.finished":
          print("Session ended")

    except websockets.exceptions.ConnectionClosed:
      print("Connection closed")
    except Exception as e:
      print("Error handling messages: ", str(e))


  async def close(self) -> None:
    """Close WebSocket connection."""
    if self.ws:
      await self.ws.close()

Select speech synthesis mode

The Realtime API supports two modes:

Server commit mode: The client sends text only. The server intelligently determines text segmentation and synthesis timing. Use this mode for low-latency scenarios without manual synthesis control, such as GPS navigation.
Commit mode: Add text to a buffer first, then trigger the server to synthesize the specified text. Use this mode for scenarios requiring fine-grained control over pauses and sentence breaks, such as news broadcasting.

Server commit mode
Commit mode

Create another Python file named server_commit.py in the same directory as tts_realtime_client.py, and copy the following code into the file:

server_commit.py

import os
import asyncio
import logging
import wave
from tts_realtime_client import TTSRealtimeClient, SessionMode
import pyaudio

# QwenTTS service configuration
# Replace model with qwen3-tts-instruct-flash-realtime and uncomment instructions in tts_realtime_client.py to use instruction control
URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-tts-flash-realtime"
# Replace with your Qwen Cloud API Key if environment variable is not configured: API_KEY="sk-xxx"
API_KEY = os.getenv("DASHSCOPE_API_KEY")

if not API_KEY:
  raise ValueError("Please set DASHSCOPE_API_KEY environment variable")

# Collect audio data
_audio_chunks = []
# Real-time playback settings
_AUDIO_SAMPLE_RATE = 24000
_audio_pyaudio = pyaudio.PyAudio()
_audio_stream = None  # Will be opened at runtime

def _audio_callback(audio_bytes: bytes):
  """TTSRealtimeClient audio callback: play and cache in real time"""
  global _audio_stream
  if _audio_stream is not None:
    try:
      _audio_stream.write(audio_bytes)
    except Exception as exc:
      logging.error(f"PyAudio playback error: {exc}")
  _audio_chunks.append(audio_bytes)
  logging.info(f"Received audio chunk: {len(audio_bytes)} bytes")

def _save_audio_to_file(filename: str = "output.wav", sample_rate: int = 24000) -> bool:
  """Save collected audio data to WAV file"""
  if not _audio_chunks:
    logging.warning("No audio data to save")
    return False

  try:
    audio_data = b"".join(_audio_chunks)
    with wave.open(filename, 'wb') as wav_file:
      wav_file.setnchannels(1)  # Mono
      wav_file.setsampwidth(2)  # 16-bit
      wav_file.setframerate(sample_rate)
      wav_file.writeframes(audio_data)
    logging.info(f"Audio saved to: {filename}")
    return True
  except Exception as exc:
    logging.error(f"Failed to save audio: {exc}")
    return False

async def _produce_text(client: TTSRealtimeClient):
  """Send text fragments to server"""
  text_fragments = [
    "Qwen Cloud is an all-in-one platform for model development and application building.",
    "Both developers and business personnel can deeply participate in designing and building model applications.", 
    "You can develop a model application in just 5 minutes through simple UI operations,",
    "or train a custom model within hours, allowing you to focus more on application innovation.",
  ]

  logging.info("Sending text fragments…")
  for text in text_fragments:
    logging.info(f"Sending fragment: {text}")
    await client.append_text(text)
    await asyncio.sleep(0.1)  # Brief delay between fragments

  # Wait for server to complete internal processing before ending session
  await asyncio.sleep(1.0)
  await client.finish_session()

async def _run_demo():
  """Run complete demo"""
  global _audio_stream
  # Open PyAudio output stream
  _audio_stream = _audio_pyaudio.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=_AUDIO_SAMPLE_RATE,
    output=True,
    frames_per_buffer=1024
  )

  client = TTSRealtimeClient(
    base_url=URL,
    api_key=API_KEY,
    voice="Cherry",
    mode=SessionMode.SERVER_COMMIT,
    audio_callback=_audio_callback
  )

  # Establish connection
  await client.connect()

  # Execute message handling and text sending in parallel
  consumer_task = asyncio.create_task(client.handle_messages())
  producer_task = asyncio.create_task(_produce_text(client))

  await producer_task  # Wait for text sending to complete

  # Wait for response.done
  await client.wait_for_response_done()

  # Close connection and cancel consumer task
  await client.close()
  consumer_task.cancel()

  # Close audio stream
  if _audio_stream is not None:
    _audio_stream.stop_stream()
    _audio_stream.close()
  _audio_pyaudio.terminate()

  # Save audio data
  os.makedirs("outputs", exist_ok=True)
  _save_audio_to_file(os.path.join("outputs", "qwen_tts_output.wav"))

def main():
  """Synchronous entry point"""
  logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [%(levelname)s] %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
  )
  logging.info("Starting QwenTTS Realtime Client demo…")
  asyncio.run(_run_demo())

if __name__ == "__main__":
  main() 

Run server_commit.py to listen to real-time audio generated by the Realtime API.

Create another Python file named commit.py in the same directory as tts_realtime_client.py, and copy the following code into the file:

commit.py

import os
import asyncio
import logging
import wave
from tts_realtime_client import TTSRealtimeClient, SessionMode
import pyaudio

# QwenTTS service configuration
# Replace model with qwen3-tts-instruct-flash-realtime and uncomment instructions in tts_realtime_client.py to use instruction control
URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-tts-flash-realtime"
# Replace with your Qwen Cloud API Key if environment variable is not configured: API_KEY="sk-xxx"
API_KEY = os.getenv("DASHSCOPE_API_KEY")

if not API_KEY:
  raise ValueError("Please set DASHSCOPE_API_KEY environment variable")

# Collect audio data
_audio_chunks = []
_AUDIO_SAMPLE_RATE = 24000
_audio_pyaudio = pyaudio.PyAudio()
_audio_stream = None

def _audio_callback(audio_bytes: bytes):
  """TTSRealtimeClient audio callback: play and cache in real time"""
  global _audio_stream
  if _audio_stream is not None:
    try:
      _audio_stream.write(audio_bytes)
    except Exception as exc:
      logging.error(f"PyAudio playback error: {exc}")
  _audio_chunks.append(audio_bytes)
  logging.info(f"Received audio chunk: {len(audio_bytes)} bytes")

def _save_audio_to_file(filename: str = "output.wav", sample_rate: int = 24000) -> bool:
  """Save collected audio data to WAV file"""
  if not _audio_chunks:
    logging.warning("No audio data to save")
    return False

  try:
    audio_data = b"".join(_audio_chunks)
    with wave.open(filename, 'wb') as wav_file:
      wav_file.setnchannels(1)  # Mono
      wav_file.setsampwidth(2)  # 16-bit
      wav_file.setframerate(sample_rate)
      wav_file.writeframes(audio_data)
    logging.info(f"Audio saved to: {filename}")
    return True
  except Exception as exc:
    logging.error(f"Failed to save audio: {exc}")
    return False

async def _user_input_loop(client: TTSRealtimeClient):
  """Continuously get user input and send text. When user enters empty text, send commit event and end current session"""
  print("Enter text (press Enter directly to send commit event and end current session, press Ctrl+C or Ctrl+D to exit entire program):")
  
  while True:
    try:
      user_text = input("> ")
      if not user_text:  # User entered empty input
        # Empty input signifies end of conversation: submit buffer -> end session -> break loop
        logging.info("Empty input, sending commit event and ending current session")
        await client.commit_text_buffer()
        # Wait briefly for server to process commit to prevent losing audio from premature session end
        await asyncio.sleep(0.3)
        await client.finish_session()
        break  # Exit user input loop directly, no need to press Enter again
      else:
        logging.info(f"Sending text: {user_text}")
        await client.append_text(user_text)
        
    except EOFError:  # User pressed Ctrl+D
      break
    except KeyboardInterrupt:  # User pressed Ctrl+C
      break
  
  # End session
  logging.info("Ending session...")
async def _run_demo():
  """Run complete demo"""
  global _audio_stream
  # Open PyAudio output stream
  _audio_stream = _audio_pyaudio.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=_AUDIO_SAMPLE_RATE,
    output=True,
    frames_per_buffer=1024
  )

  client = TTSRealtimeClient(
    base_url=URL,
    api_key=API_KEY,
    voice="Cherry",
    mode=SessionMode.COMMIT,  # Change to COMMIT mode
    audio_callback=_audio_callback
  )

  # Establish connection
  await client.connect()

  # Execute message handling and user input in parallel
  consumer_task = asyncio.create_task(client.handle_messages())
  producer_task = asyncio.create_task(_user_input_loop(client))

  await producer_task  # Wait for user input to complete

  # Wait for response.done
  await client.wait_for_response_done()

  # Close connection and cancel consumer task
  await client.close()
  consumer_task.cancel()

  # Close audio stream
  if _audio_stream is not None:
    _audio_stream.stop_stream()
    _audio_stream.close()
  _audio_pyaudio.terminate()

  # Save audio data
  os.makedirs("outputs", exist_ok=True)
  _save_audio_to_file(os.path.join("outputs", "qwen_tts_output.wav"))

def main():
  logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [%(levelname)s] %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
  )
  logging.info("Starting QwenTTS Realtime Client demo…")
  asyncio.run(_run_demo())

if __name__ == "__main__":
  main() 

Run commit.py to input multiple texts for synthesis. Press Enter without entering text to listen to the audio returned by the Realtime API through your speakers.

The voice cloning service does not provide preview audio. Test and evaluate the effect through the speech synthesis interface. Use short text for initial testing.This example adapts the "server commit mode" code, replacing the voice parameter with a cloned voice.

Key principle: Match the voice cloning model (target_model) with the speech synthesis model (model). Otherwise, synthesis fails.
The example uses a local audio file voice.mp3 for voice cloning. Replace it when running the code.

Python
Java

# coding=utf-8
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import pyaudio
import os
import requests
import base64
import pathlib
import threading
import time
import dashscope  # DashScope Python SDK version must be at least 1.23.9
from dashscope.audio.qwen_tts_realtime import QwenTtsRealtime, QwenTtsRealtimeCallback, AudioFormat

# ======= Constants =======
DEFAULT_TARGET_MODEL = "qwen3-tts-vc-realtime-2026-01-15"  # Use the same model for voice cloning and speech synthesis
DEFAULT_PREFERRED_NAME = "guanyu"
DEFAULT_AUDIO_MIME_TYPE = "audio/mpeg"
VOICE_FILE_PATH = "voice.mp3"  # Relative path to local audio file for voice cloning

TEXT_TO_SYNTHESIZE = [
  'Right? I really love this kind of supermarket,',
  'especially during Chinese New Year',
  'when I go shopping',
  'I feel',
  'super super happy!',
  'I want to buy so many things!'
]

def create_voice(file_path: str,
          target_model: str = DEFAULT_TARGET_MODEL,
          preferred_name: str = DEFAULT_PREFERRED_NAME,
          audio_mime_type: str = DEFAULT_AUDIO_MIME_TYPE) -> str:
  """
  Create voice and return voice parameter
  """
  # Replace with your Qwen Cloud API Key if environment variable is not configured: api_key = "sk-xxx"
  api_key = os.getenv("DASHSCOPE_API_KEY")

  file_path_obj = pathlib.Path(file_path)
  if not file_path_obj.exists():
    raise FileNotFoundError(f"Audio file not found: {file_path}")

  base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
  data_uri = f"data:{audio_mime_type};base64,{base64_str}"

  url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"
  payload = {
    "model": "qwen-voice-enrollment", # Do not modify this value
    "input": {
      "action": "create",
      "target_model": target_model,
      "preferred_name": preferred_name,
      "audio": {"data": data_uri}
    }
  }
  headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
  }

  resp = requests.post(url, json=payload, headers=headers)
  if resp.status_code != 200:
    raise RuntimeError(f"Voice creation failed: {resp.status_code}, {resp.text}")

  try:
    return resp.json()["output"]["voice"]
  except (KeyError, ValueError) as e:
    raise RuntimeError(f"Failed to parse voice response: {e}")

def init_dashscope_api_key():
  """
  Initialize DashScope SDK API key
  """
  # Replace with your Qwen Cloud API Key if environment variable is not configured: dashscope.api_key = "sk-xxx"
  dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

# ======= Callback class =======
class MyCallback(QwenTtsRealtimeCallback):
  """
  Custom TTS streaming callback
  """
  def __init__(self):
    self.complete_event = threading.Event()
    self._player = pyaudio.PyAudio()
    self._stream = self._player.open(
      format=pyaudio.paInt16, channels=1, rate=24000, output=True
    )

  def on_open(self) -> None:
    print('[TTS] Connection established')

  def on_close(self, close_status_code, close_msg) -> None:
    self._stream.stop_stream()
    self._stream.close()
    self._player.terminate()
    print(f'[TTS] Connection closed code={close_status_code}, msg={close_msg}')

  def on_event(self, response: dict) -> None:
    try:
      event_type = response.get('type', '')
      if event_type == 'session.created':
        print(f'[TTS] Session started: {response["session"]["id"]}')
      elif event_type == 'response.audio.delta':
        audio_data = base64.b64decode(response['delta'])
        self._stream.write(audio_data)
      elif event_type == 'response.done':
        print(f'[TTS] Response completed, Response ID: {qwen_tts_realtime.get_last_response_id()}')
      elif event_type == 'session.finished':
        print('[TTS] Session ended')
        self.complete_event.set()
    except Exception as e:
      print(f'[Error] Error handling callback event: {e}')

  def wait_for_finished(self):
    self.complete_event.wait()

# ======= Main execution logic =======
if __name__ == '__main__':
  init_dashscope_api_key()
  print('[System] Initializing Qwen TTS Realtime ...')

  callback = MyCallback()
  qwen_tts_realtime = QwenTtsRealtime(
    model=DEFAULT_TARGET_MODEL,
    callback=callback,
    url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime'
  )
  qwen_tts_realtime.connect()
  
  qwen_tts_realtime.update_session(
    voice=create_voice(VOICE_FILE_PATH), # Replace voice parameter with cloned custom voice
    response_format=AudioFormat.PCM_24000HZ_MONO_16BIT,
    mode='server_commit'
  )

  for text_chunk in TEXT_TO_SYNTHESIZE:
    print(f'[Sending text]: {text_chunk}')
    qwen_tts_realtime.append_text(text_chunk)
    time.sleep(0.1)

  qwen_tts_realtime.finish()
  callback.wait_for_finished()

  print(f'[Metric] session_id={qwen_tts_realtime.get_session_id()}, '
          f'first_audio_delay={qwen_tts_realtime.get_first_audio_delay()}s')

You need to import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:

Maven
Gradle

Add the following to your pom.xml:

<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
  <groupId>com.google.code.gson</groupId>
  <artifactId>gson</artifactId>
  <version>2.13.1</version>
</dependency>

import com.alibaba.dashscope.audio.qwen_tts_realtime.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import com.google.gson.JsonObject;

import javax.sound.sampled.*;
import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.file.*;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
import java.util.Queue;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;

public class Main {
  // ===== Constants =====
  // Use the same model for voice cloning and speech synthesis
  private static final String TARGET_MODEL = "qwen3-tts-vc-realtime-2026-01-15";
  private static final String PREFERRED_NAME = "guanyu";
  // Relative path to local audio file for voice cloning
  private static final String AUDIO_FILE = "voice.mp3";
  private static final String AUDIO_MIME_TYPE = "audio/mpeg";
  private static String[] textToSynthesize = {
      "Right? I really love this kind of supermarket",
      "especially during Chinese New Year",
      "when I go shopping",
      "I feel",
      "super super happy!",
      "I want to buy so many things!"
  };

  // Generate data URI
  public static String toDataUrl(String filePath) throws IOException {
    byte[] bytes = Files.readAllBytes(Paths.get(filePath));
    String encoded = Base64.getEncoder().encodeToString(bytes);
    return "data:" + AUDIO_MIME_TYPE + ";base64," + encoded;
  }

  // Call API to create voice
  public static String createVoice() throws Exception {
    // Replace with your Qwen Cloud API Key if environment variable is not configured: String apiKey = "sk-xxx"
    String apiKey = System.getenv("DASHSCOPE_API_KEY");

    String jsonPayload =
        "{"
            + "\"model\": \"qwen-voice-enrollment\"," // Do not modify this value
            + "\"input\": {"
            +     "\"action\": \"create\","
            +     "\"target_model\": \"" + TARGET_MODEL + "\","
            +     "\"preferred_name\": \"" + PREFERRED_NAME + "\","
            +     "\"audio\": {"
            +         "\"data\": \"" + toDataUrl(AUDIO_FILE) + "\""
            +     "}"
            + "}"
            + "}";

    HttpURLConnection con = (HttpURLConnection) new URL("https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization").openConnection();
    con.setRequestMethod("POST");
    con.setRequestProperty("Authorization", "Bearer " + apiKey);
    con.setRequestProperty("Content-Type", "application/json");
    con.setDoOutput(true);

    try (OutputStream os = con.getOutputStream()) {
      os.write(jsonPayload.getBytes(StandardCharsets.UTF_8));
    }

    int status = con.getResponseCode();
    System.out.println("HTTP status code: " + status);

    try (BufferedReader br = new BufferedReader(
        new InputStreamReader(status >= 200 && status < 300 ? con.getInputStream() : con.getErrorStream(),
            StandardCharsets.UTF_8))) {
      StringBuilder response = new StringBuilder();
      String line;
      while ((line = br.readLine()) != null) {
        response.append(line);
      }
      System.out.println("Response content: " + response);

      if (status == 200) {
        JsonObject jsonObj = new Gson().fromJson(response.toString(), JsonObject.class);
        return jsonObj.getAsJsonObject("output").get("voice").getAsString();
      }
      throw new IOException("Voice creation failed: " + status + " - " + response);
    }
  }

  // Real-time PCM audio player class
  public static class RealtimePcmPlayer {
    private int sampleRate;
    private SourceDataLine line;
    private AudioFormat audioFormat;
    private Thread decoderThread;
    private Thread playerThread;
    private AtomicBoolean stopped = new AtomicBoolean(false);
    private Queue<String> b64AudioBuffer = new ConcurrentLinkedQueue<>();
    private Queue<byte[]> RawAudioBuffer = new ConcurrentLinkedQueue<>();

    // Constructor to initialize audio format and audio line
    public RealtimePcmPlayer(int sampleRate) throws LineUnavailableException {
      this.sampleRate = sampleRate;
      this.audioFormat = new AudioFormat(this.sampleRate, 16, 1, true, false);
      DataLine.Info info = new DataLine.Info(SourceDataLine.class, audioFormat);
      line = (SourceDataLine) AudioSystem.getLine(info);
      line.open(audioFormat);
      line.start();
      decoderThread = new Thread(new Runnable() {
        @Override
        public void run() {
          while (!stopped.get()) {
            String b64Audio = b64AudioBuffer.poll();
            if (b64Audio != null) {
              byte[] rawAudio = Base64.getDecoder().decode(b64Audio);
              RawAudioBuffer.add(rawAudio);
            } else {
              try {
                Thread.sleep(100);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            }
          }
        }
      });
      playerThread = new Thread(new Runnable() {
        @Override
        public void run() {
          while (!stopped.get()) {
            byte[] rawAudio = RawAudioBuffer.poll();
            if (rawAudio != null) {
              try {
                playChunk(rawAudio);
              } catch (IOException e) {
                throw new RuntimeException(e);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            } else {
              try {
                Thread.sleep(100);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            }
          }
        }
      });
      decoderThread.start();
      playerThread.start();
    }

    // Play an audio chunk and block until playback completes
    private void playChunk(byte[] chunk) throws IOException, InterruptedException {
      if (chunk == null || chunk.length == 0) return;

      int bytesWritten = 0;
      while (bytesWritten < chunk.length) {
        bytesWritten += line.write(chunk, bytesWritten, chunk.length - bytesWritten);
      }
      int audioLength = chunk.length / (this.sampleRate*2/1000);
      // Wait for audio in buffer to finish playing
      Thread.sleep(audioLength - 10);
    }

    public void write(String b64Audio) {
      b64AudioBuffer.add(b64Audio);
    }

    public void cancel() {
      b64AudioBuffer.clear();
      RawAudioBuffer.clear();
    }

    public void waitForComplete() throws InterruptedException {
      while (!b64AudioBuffer.isEmpty() || !RawAudioBuffer.isEmpty()) {
        Thread.sleep(100);
      }
      line.drain();
    }

    public void shutdown() throws InterruptedException {
      stopped.set(true);
      decoderThread.join();
      playerThread.join();
      if (line != null && line.isRunning()) {
        line.drain();
        line.close();
      }
    }
  }

  public static void main(String[] args) throws Exception {
    QwenTtsRealtimeParam param = QwenTtsRealtimeParam.builder()
        .model(TARGET_MODEL)
        .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
        // Replace with your Qwen Cloud API Key if environment variable is not configured: .apikey("sk-xxx")
        .apikey(System.getenv("DASHSCOPE_API_KEY"))
        .build();
    AtomicReference<CountDownLatch> completeLatch = new AtomicReference<>(new CountDownLatch(1));
    final AtomicReference<QwenTtsRealtime> qwenTtsRef = new AtomicReference<>(null);

    // Create real-time audio player instance
    RealtimePcmPlayer audioPlayer = new RealtimePcmPlayer(24000);

    QwenTtsRealtime qwenTtsRealtime = new QwenTtsRealtime(param, new QwenTtsRealtimeCallback() {
      @Override
      public void onOpen() {
        // Handle connection established
      }
      @Override
      public void onEvent(JsonObject message) {
        String type = message.get("type").getAsString();
        switch(type) {
          case "session.created":
            // Handle session created
            break;
          case "response.audio.delta":
            String recvAudioB64 = message.get("delta").getAsString();
            // Play audio in real time
            audioPlayer.write(recvAudioB64);
            break;
          case "response.done":
            // Handle response completed
            break;
          case "session.finished":
            // Handle session finished
            completeLatch.get().countDown();
          default:
            break;
        }
      }
      @Override
      public void onClose(int code, String reason) {
        // Handle connection closed
      }
    });
    qwenTtsRef.set(qwenTtsRealtime);
    try {
      qwenTtsRealtime.connect();
    } catch (NoApiKeyException e) {
      throw new RuntimeException(e);
    }
    QwenTtsRealtimeConfig config = QwenTtsRealtimeConfig.builder()
        .voice(createVoice()) // Replace voice parameter with cloned custom voice
        .responseFormat(QwenTtsRealtimeAudioFormat.PCM_24000HZ_MONO_16BIT)
        .mode("server_commit")
        .build();
    qwenTtsRealtime.updateSession(config);
    for (String text:textToSynthesize) {
      qwenTtsRealtime.appendText(text);
      Thread.sleep(100);
    }
    qwenTtsRealtime.finish();
    completeLatch.get().await();

    // Wait for audio playback to complete and shut down player
    audioPlayer.waitForComplete();
    audioPlayer.shutdown();
    System.exit(0);
  }
}

The voice design feature returns preview audio data. Listen to this preview audio first to confirm the effect meets your expectations before using it for speech synthesis.

Generate a custom voice and preview the result

If you are satisfied with the result, proceed to the next step. Otherwise, generate it again.

Python
Java

import requests
import base64
import os

def create_voice_and_play():
  # If the environment variable is not set, replace the following line with your API key: api_key = "sk-xxx"
  api_key = os.getenv("DASHSCOPE_API_KEY")
  
  if not api_key:
    print("Error: DASHSCOPE_API_KEY environment variable not found. Please set the API key first.")
    return None, None, None
  
  # Prepare request data
  headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
  }
  
  data = {
    "model": "qwen-voice-design",
    "input": {
      "action": "create",
      "target_model": "qwen3-tts-vd-realtime-2026-01-15",
      "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
      "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
      "preferred_name": "announcer",
      "language": "en"
    },
    "parameters": {
      "sample_rate": 24000,
      "response_format": "wav"
    }
  }
  
  url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"
  
  try:
    # Send the request
    response = requests.post(
      url,
      headers=headers,
      json=data,
      timeout=60  # Add a timeout setting
    )
    
    if response.status_code == 200:
      result = response.json()
      
      # Get the voice name
      voice_name = result["output"]["voice"]
      print(f"Voice name: {voice_name}")
      
      # Get the preview audio data
      base64_audio = result["output"]["preview_audio"]["data"]
      
      # Decode the Base64 audio data
      audio_bytes = base64.b64decode(base64_audio)
      
      # Save the audio file locally
      filename = f"{voice_name}_preview.wav"
      
      # Write the audio data to a local file
      with open(filename, 'wb') as f:
        f.write(audio_bytes)
      
      print(f"Audio saved to local file: {filename}")
      print(f"File path: {os.path.abspath(filename)}")
      
      return voice_name, audio_bytes, filename
    else:
      print(f"Request failed with status code: {response.status_code}")
      print(f"Response content: {response.text}")
      return None, None, None
      
  except requests.exceptions.RequestException as e:
    print(f"A network request error occurred: {e}")
    return None, None, None
  except KeyError as e:
    print(f"Response data format error, missing required field: {e}")
    print(f"Response content: {response.text if 'response' in locals() else 'No response'}")
    return None, None, None
  except Exception as e:
    print(f"An unknown error occurred: {e}")
    return None, None, None

if __name__ == "__main__":
  print("Starting to create voice...")
  voice_name, audio_data, saved_filename = create_voice_and_play()
  
  if voice_name:
    print(f"\nSuccessfully created voice '{voice_name}'")
    print(f"Audio file saved as: '{saved_filename}'")
    print(f"File size: {os.path.getsize(saved_filename)} bytes")
  else:
    print("\nVoice creation failed")

You need to import the Gson dependency. If you are using Maven or Gradle, add the dependency as follows:

Maven
Gradle

Add the following content to pom.xml:

<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
  <groupId>com.google.code.gson</groupId>
  <artifactId>gson</artifactId>
  <version>2.13.1</version>
</dependency>

import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Base64;

public class Main {
  public static void main(String[] args) {
    Main example = new Main();
    example.createVoice();
  }

  public void createVoice() {
    // If the environment variable is not set, replace the following line with your API key: String apiKey = "sk-xxx"
    String apiKey = System.getenv("DASHSCOPE_API_KEY");

    // Create the JSON request body string
    String jsonBody = "{\n" +
        "    \"model\": \"qwen-voice-design\",\n" +
        "    \"input\": {\n" +
        "        \"action\": \"create\",\n" +
        "        \"target_model\": \"qwen3-tts-vd-realtime-2026-01-15\",\n" +
        "        \"voice_prompt\": \"A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.\",\n" +
        "        \"preview_text\": \"Dear listeners, hello everyone. Welcome to the evening news.\",\n" +
        "        \"preferred_name\": \"announcer\",\n" +
        "        \"language\": \"en\"\n" +
        "    },\n" +
        "    \"parameters\": {\n" +
        "        \"sample_rate\": 24000,\n" +
        "        \"response_format\": \"wav\"\n" +
        "    }\n" +
        "}";

    HttpURLConnection connection = null;
    try {
      URL url = new URL("https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization");
      connection = (HttpURLConnection) url.openConnection();

      // Set the request method and headers
      connection.setRequestMethod("POST");
      connection.setRequestProperty("Authorization", "Bearer " + apiKey);
      connection.setRequestProperty("Content-Type", "application/json");
      connection.setDoOutput(true);
      connection.setDoInput(true);

      // Send the request body
      try (OutputStream os = connection.getOutputStream()) {
        byte[] input = jsonBody.getBytes("UTF-8");
        os.write(input, 0, input.length);
        os.flush();
      }

      // Get the response
      int responseCode = connection.getResponseCode();
      if (responseCode == HttpURLConnection.HTTP_OK) {
        // Read the response content
        StringBuilder response = new StringBuilder();
        try (BufferedReader br = new BufferedReader(
            new InputStreamReader(connection.getInputStream(), "UTF-8"))) {
          String responseLine;
          while ((responseLine = br.readLine()) != null) {
            response.append(responseLine.trim());
          }
        }

        // Parse the JSON response
        JsonObject jsonResponse = JsonParser.parseString(response.toString()).getAsJsonObject();
        JsonObject outputObj = jsonResponse.getAsJsonObject("output");
        JsonObject previewAudioObj = outputObj.getAsJsonObject("preview_audio");

        // Get the voice name
        String voiceName = outputObj.get("voice").getAsString();
        System.out.println("Voice name: " + voiceName);

        // Get the Base64-encoded audio data
        String base64Audio = previewAudioObj.get("data").getAsString();

        // Decode the Base64 audio data
        byte[] audioBytes = Base64.getDecoder().decode(base64Audio);

        // Save the audio to a local file
        String filename = voiceName + "_preview.wav";
        saveAudioToFile(audioBytes, filename);

        System.out.println("Audio saved to local file: " + filename);

      } else {
        // Read the error response
        StringBuilder errorResponse = new StringBuilder();
        try (BufferedReader br = new BufferedReader(
            new InputStreamReader(connection.getErrorStream(), "UTF-8"))) {
          String responseLine;
          while ((responseLine = br.readLine()) != null) {
            errorResponse.append(responseLine.trim());
          }
        }

        System.out.println("Request failed with status code: " + responseCode);
        System.out.println("Error response: " + errorResponse.toString());
      }

    } catch (Exception e) {
      System.err.println("An error occurred during the request: " + e.getMessage());
      e.printStackTrace();
    } finally {
      if (connection != null) {
        connection.disconnect();
      }
    }
  }

  private void saveAudioToFile(byte[] audioBytes, String filename) {
    try {
      File file = new File(filename);
      try (FileOutputStream fos = new FileOutputStream(file)) {
        fos.write(audioBytes);
      }
      System.out.println("Audio saved to: " + file.getAbsolutePath());
    } catch (IOException e) {
      System.err.println("An error occurred while saving the audio file: " + e.getMessage());
      e.printStackTrace();
    }
  }
}

Use the custom voice for speech synthesis

This example refers to the "server commit mode" sample code from the DashScope SDK for speech synthesis with a system voice. It replaces the voice parameter with the custom voice generated by voice design.Key principle: The model used for voice design (target_model) must be the same as the model used for subsequent speech synthesis (model). Otherwise, the synthesis will fail.

Python
Java

# coding=utf-8
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import pyaudio
import os
import base64
import threading
import time
import dashscope  # DashScope Python SDK version must be 1.23.9 or later
from dashscope.audio.qwen_tts_realtime import QwenTtsRealtime, QwenTtsRealtimeCallback, AudioFormat

# ======= Constant Configuration =======
TEXT_TO_SYNTHESIZE = [
  'Right? I really like this kind of supermarket,',
  'especially during the New Year.',
  'Going to the supermarket',
  'just makes me feel',
  'super, super happy!',
  'I want to buy so many things!'
]

def init_dashscope_api_key():
  """
  Initialize the API key for the DashScope SDK.
  """
  # If the environment variable is not set, replace the following line with your API key: dashscope.api_key = "sk-xxx"
  dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

# ======= Callback Class =======
class MyCallback(QwenTtsRealtimeCallback):
  """
  Custom TTS streaming callback.
  """
  def __init__(self):
    self.complete_event = threading.Event()
    self._player = pyaudio.PyAudio()
    self._stream = self._player.open(
      format=pyaudio.paInt16, channels=1, rate=24000, output=True
    )

  def on_open(self) -> None:
    print('[TTS] Connection established')

  def on_close(self, close_status_code, close_msg) -> None:
    self._stream.stop_stream()
    self._stream.close()
    self._player.terminate()
    print(f'[TTS] Connection closed, code={close_status_code}, msg={close_msg}')

  def on_event(self, response: dict) -> None:
    try:
      event_type = response.get('type', '')
      if event_type == 'session.created':
        print(f'[TTS] Session started: {response["session"]["id"]}')
      elif event_type == 'response.audio.delta':
        audio_data = base64.b64decode(response['delta'])
        self._stream.write(audio_data)
      elif event_type == 'response.done':
        print(f'[TTS] Response complete, Response ID: {qwen_tts_realtime.get_last_response_id()}')
      elif event_type == 'session.finished':
        print('[TTS] Session finished')
        self.complete_event.set()
    except Exception as e:
      print(f'[Error] Exception processing callback event: {e}')

  def wait_for_finished(self):
    self.complete_event.wait()

# ======= Main Execution Logic =======
if __name__ == '__main__':
  init_dashscope_api_key()
  print('[System] Initializing Qwen TTS Realtime ...')

  callback = MyCallback()
  qwen_tts_realtime = QwenTtsRealtime(
    # Use the same model for voice design and speech synthesis
    model="qwen3-tts-vd-realtime-2026-01-15",
    callback=callback,
    url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime'
  )
  qwen_tts_realtime.connect()
  
  qwen_tts_realtime.update_session(
    voice="myvoice", # Replace the voice parameter with the custom voice generated by voice design
    response_format=AudioFormat.PCM_24000HZ_MONO_16BIT,
    mode='server_commit'
  )

  for text_chunk in TEXT_TO_SYNTHESIZE:
    print(f'[Sending text]: {text_chunk}')
    qwen_tts_realtime.append_text(text_chunk)
    time.sleep(0.1)

  qwen_tts_realtime.finish()
  callback.wait_for_finished()

  print(f'[Metric] session_id={qwen_tts_realtime.get_session_id()}, '
          f'first_audio_delay={qwen_tts_realtime.get_first_audio_delay()}s')

import com.alibaba.dashscope.audio.qwen_tts_realtime.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;

import javax.sound.sampled.*;
import java.io.*;
import java.util.Base64;
import java.util.Queue;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;

public class Main {
  // ===== Constant Definitions =====
  private static String[] textToSynthesize = {
      "Right? I really like this kind of supermarket,",
      "especially during the New Year.",
      "Going to the supermarket",
      "just makes me feel",
      "super, super happy!",
      "I want to buy so many things!"
  };

  // Real-time audio player class
  public static class RealtimePcmPlayer {
    private int sampleRate;
    private SourceDataLine line;
    private AudioFormat audioFormat;
    private Thread decoderThread;
    private Thread playerThread;
    private AtomicBoolean stopped = new AtomicBoolean(false);
    private Queue<String> b64AudioBuffer = new ConcurrentLinkedQueue<>();
    private Queue<byte[]> RawAudioBuffer = new ConcurrentLinkedQueue<>();

    // Constructor initializes audio format and audio line
    public RealtimePcmPlayer(int sampleRate) throws LineUnavailableException {
      this.sampleRate = sampleRate;
      this.audioFormat = new AudioFormat(this.sampleRate, 16, 1, true, false);
      DataLine.Info info = new DataLine.Info(SourceDataLine.class, audioFormat);
      line = (SourceDataLine) AudioSystem.getLine(info);
      line.open(audioFormat);
      line.start();
      decoderThread = new Thread(new Runnable() {
        @Override
        public void run() {
          while (!stopped.get()) {
            String b64Audio = b64AudioBuffer.poll();
            if (b64Audio != null) {
              byte[] rawAudio = Base64.getDecoder().decode(b64Audio);
              RawAudioBuffer.add(rawAudio);
            } else {
              try {
                Thread.sleep(100);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            }
          }
        }
      });
      playerThread = new Thread(new Runnable() {
        @Override
        public void run() {
          while (!stopped.get()) {
            byte[] rawAudio = RawAudioBuffer.poll();
            if (rawAudio != null) {
              try {
                playChunk(rawAudio);
              } catch (IOException e) {
                throw new RuntimeException(e);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            } else {
              try {
                Thread.sleep(100);
              } catch (InterruptedException e) {
                throw new RuntimeException(e);
              }
            }
          }
        }
      });
      decoderThread.start();
      playerThread.start();
    }

    // Plays an audio chunk and blocks until playback is complete
    private void playChunk(byte[] chunk) throws IOException, InterruptedException {
      if (chunk == null || chunk.length == 0) return;

      int bytesWritten = 0;
      while (bytesWritten < chunk.length) {
        bytesWritten += line.write(chunk, bytesWritten, chunk.length - bytesWritten);
      }
      int audioLength = chunk.length / (this.sampleRate*2/1000);
      // Wait for the audio in the buffer to finish playing
      Thread.sleep(audioLength - 10);
    }

    public void write(String b64Audio) {
      b64AudioBuffer.add(b64Audio);
    }

    public void cancel() {
      b64AudioBuffer.clear();
      RawAudioBuffer.clear();
    }

    public void waitForComplete() throws InterruptedException {
      while (!b64AudioBuffer.isEmpty() || !RawAudioBuffer.isEmpty()) {
        Thread.sleep(100);
      }
      line.drain();
    }

    public void shutdown() throws InterruptedException {
      stopped.set(true);
      decoderThread.join();
      playerThread.join();
      if (line != null && line.isRunning()) {
        line.drain();
        line.close();
      }
    }
  }

  public static void main(String[] args) throws Exception {
    QwenTtsRealtimeParam param = QwenTtsRealtimeParam.builder()
        // Use the same model for voice design and speech synthesis
        .model("qwen3-tts-vd-realtime-2026-01-15")
        .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
        // If the environment variable is not set, replace the following line with your API key: .apikey("sk-xxx")
        .apikey(System.getenv("DASHSCOPE_API_KEY"))
        .build();
    AtomicReference<CountDownLatch> completeLatch = new AtomicReference<>(new CountDownLatch(1));
    final AtomicReference<QwenTtsRealtime> qwenTtsRef = new AtomicReference<>(null);

    // Create a real-time audio player instance
    RealtimePcmPlayer audioPlayer = new RealtimePcmPlayer(24000);

    QwenTtsRealtime qwenTtsRealtime = new QwenTtsRealtime(param, new QwenTtsRealtimeCallback() {
      @Override
      public void onOpen() {
        // Handling for when the connection is established
      }
      @Override
      public void onEvent(JsonObject message) {
        String type = message.get("type").getAsString();
        switch(type) {
          case "session.created":
            // Handling for when the session is created
            break;
          case "response.audio.delta":
            String recvAudioB64 = message.get("delta").getAsString();
            // Play audio in real time
            audioPlayer.write(recvAudioB64);
            break;
          case "response.done":
            // Handling for when the response is complete
            break;
          case "session.finished":
            // Handling for when the session is finished
            completeLatch.get().countDown();
          default:
            break;
        }
      }
      @Override
      public void onClose(int code, String reason) {
        // Handling for when the connection is closed
      }
    });
    qwenTtsRef.set(qwenTtsRealtime);
    try {
      qwenTtsRealtime.connect();
    } catch (NoApiKeyException e) {
      throw new RuntimeException(e);
    }
    QwenTtsRealtimeConfig config = QwenTtsRealtimeConfig.builder()
        .voice("myvoice") // Replace the voice parameter with the custom voice generated by voice design
        .responseFormat(QwenTtsRealtimeAudioFormat.PCM_24000HZ_MONO_16BIT)
        .mode("server_commit")
        .build();
    qwenTtsRealtime.updateSession(config);
    for (String text:textToSynthesize) {
      qwenTtsRealtime.appendText(text);
      Thread.sleep(100);
    }
    qwenTtsRealtime.finish();
    completeLatch.get().await();

    // Wait for audio playback to complete and shut down the player
    audioPlayer.waitForComplete();
    audioPlayer.shutdown();
    System.exit(0);
  }
}

For more example code, see GitHub.

Interaction flow

CosyVoice
Qwen-TTS-Realtime

CosyVoice uses a WebSocket-based streaming protocol. For protocol details, see the CosyVoice WebSocket API reference.

Connect to the API

To use Qwen-TTS-Realtime, establish a WebSocket connection with the following parameters:

Parameter	Value
WebSocket URL	`wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=<model_name>`
Authentication	Bearer token in the `Authorization` header
Model parameter	Replace `<model_name>` with a supported model. See Supported models for the list.

# Example connection URL
wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-tts-flash-realtime

For complete code examples, see Getting started.

Server commit mode
Commit mode

Set the session.mode property of the session.update event to "server_commit" to enable this mode. The server intelligently handles text segmentation and synthesis timing.Interaction flow:

Client sends session.update event. Server responds with session.created and session.updated events.
Client sends input_text_buffer.append event to append text to the server buffer.
Server intelligently handles text segmentation and synthesis timing, returning response.created, response.output_item.added, response.content_part.added, and response.audio.delta events.
After completing the response, the server returns response.audio.done, response.content_part.done, response.output_item.done, and response.done.
The server responds with session.finished to end the session.

Lifecycle	Client events	Server events
Session initialization	`session.update` Session configuration	`session.created` Session created
		`session.updated` Session configuration updated
User text input	`input_text_buffer.append` Add text to server
	`input_text_buffer.commit` Immediately synthesize server-cached text
	`session.finish` Notify server no more text input	`input_text_buffer.committed` Server received submitted text
Server audio output	None	`response.created` Server starts generating response
		`response.output_item.added` New output content in response
		`response.content_part.added` New output content added to assistant message
		`response.audio.delta` Incremental audio generated by model
		`response.content_part.done` Text or audio content stream for assistant message completed
		`response.output_item.done` Entire output item stream for assistant message completed
		`response.audio.done` Audio generation completed
		`response.done` Response completed

Set the session.mode property of the session.update event to "commit" to enable this mode. The client must actively submit the text buffer to the server to obtain a response.Interaction flow:

Client sends session.update event. Server responds with session.created and session.updated events.
Client sends input_text_buffer.append event to append text to the server buffer.
Client sends input_text_buffer.commit event to submit the buffer to the server, and sends a session.finish event to indicate no more text input.
The server responds with response.created, starting response generation.
The server responds with response.output_item.added, response.content_part.added, and response.audio.delta events.
After completing the response, the server returns response.audio.done, response.content_part.done, response.output_item.done, and response.done.
The server responds with session.finished to end the session.

Lifecycle	Client events	Server events
Session initialization	`session.update` Session configuration	`session.created` Session created
		`session.updated` Session configuration updated
User text input	`input_text_buffer.append` Add text to buffer
	`input_text_buffer.commit` Submit buffer to server
	`input_text_buffer.clear` Clear buffer	`input_text_buffer.committed` Server received submitted text
Server audio output	None	`response.created` Server starts generating response
		`response.output_item.added` New output content in response
		`response.content_part.added` New output content added to assistant message
		`response.audio.delta` Incremental audio generated by model
		`response.content_part.done` Text or audio content stream for assistant message completed
		`response.output_item.done` Entire output item stream for assistant message completed
		`response.audio.done` Audio generation completed
		`response.done` Response completed

Instruction control

CosyVoice
Qwen-TTS-Realtime

CosyVoice supports instruction control only for cosyvoice-v3-flash. Use SSML for fine-grained pronunciation and prosody control with other CosyVoice models.

Control tone, speed, emotion, and voice characteristics using natural language descriptions instead of audio parameters.

Supported models: Supported only by Qwen3-TTS-Instruct-Flash-Realtime models.
Usage: Specify instruction content using the instructions parameter, such as: "Speak quickly with a noticeably rising intonation, suitable for introducing fashion products."
Supported languages: Description text supports Chinese and English only.
Length limit: Must not exceed 1600 tokens.

Applicable scenarios:

Audiobook and radio drama dubbing
Advertising and promotional video dubbing
Game character and animation dubbing
Emotionally Intelligent Voice Assistant
Documentary and news broadcasting

How to write high-quality voice descriptionsCore principles:

Be specific, not vague: Use words that describe concrete voice characteristics, such as "deep," "crisp," or "fast-paced." Avoid subjective terms lacking information, such as "nice-sounding" or "ordinary."
Be multidimensional, not single-dimensional: Good descriptions typically combine multiple dimensions (as described below: pitch, speed, emotion, and so on). Single-dimensional descriptions (such as just "high-pitched") are too broad to generate distinctive effects.
Be objective, not subjective: Focus on the physical and perceptual characteristics of the voice itself, not personal preferences. For example, you can use "slightly high-pitched with energy" instead of "my favorite voice."
Be original, not imitative: Describe voice characteristics rather than requesting imitation of specific people (such as celebrities or actors). Such requests involve copyright risks, and the model does not support direct imitation.
Be concise, not redundant: Ensure every word has meaning. Avoid repeating synonyms or using meaningless intensifiers (such as "very very great voice").

Description dimension reference: Combine multiple dimensions to create richer expressive effects.

Dimension	Description examples
Pitch	High, medium, low, slightly high, slightly low
Speed	Fast, medium, slow, slightly fast, slightly slow
Emotion	Cheerful, composed, gentle, serious, lively, calm, soothing
Characteristics	Magnetic, crisp, husky, mellow, sweet, rich, powerful
Purpose	News broadcasting, advertising voiceover, audiobooks, animation characters, voice assistants, documentary narration

Examples:

Standard broadcasting style: Clear and precise pronunciation, perfect articulation
Emotional progression effect: Volume quickly increases from normal conversation to shouting, straightforward personality, easily excited and expressive
Special emotional state: Slightly muffled pronunciation due to crying, slightly hoarse, with obvious tension from crying
Advertising voiceover style: Slightly high pitch, medium speed, full of energy and appeal, suitable for advertising
Gentle and soothing style: Slightly slow speed, gentle and sweet tone, caring and warm like a close friend

Voice customization

CosyVoice
Qwen-TTS-Realtime

Voice cloning: Input audio formats

High-quality input audio is the foundation for achieving excellent cloning results.

Item	Requirements
Supported formats	WAV (16-bit), MP3, M4A
Audio duration	Recommended: 10 to 20 seconds. Maximum: 60 seconds.
File size	≤ 10 MB
Sample rate	≥ 16 kHz
Sound channel	Mono or stereo. For stereo audio, only the first channel is processed. Make sure that the first channel contains a clear human voice.
Content	The audio must contain at least 5 seconds of continuous, clear speech without background sound. The rest of the audio can have only short pauses (≤ 2 seconds). The entire audio segment should be free of background music, noise, or other voices to ensure high-quality core speech content. Use normal spoken audio as input. Do not upload songs or singing audio to ensure accuracy and usability of the cloning effect.

Voice design: Write high-quality voice descriptions

Limitations

When writing voice descriptions (voice_prompt), follow these technical constraints:

Length limit: The content of voice_prompt must not exceed 500 characters.
Supported languages: The description text supports only Chinese and English.

Core principles

The voice_prompt guides the model to generate voices with specific characteristics.Follow these core principles when describing voices:

Be specific, not vague: Use words that describe concrete sound qualities, such as "deep," "crisp," or "fast-paced." Avoid subjective, uninformative terms such as "nice-sounding" or "ordinary."
Be multidimensional, not single-dimensional: Excellent descriptions typically combine multiple dimensions, such as gender, age, and emotion. Single-dimensional descriptions, such as "female voice," are too broad to generate distinctive voices.
Be objective, not subjective: Focus on the physical and perceptual characteristics of the sound itself, not your personal preferences. For example, use "high-pitched with energetic delivery" instead of "my favorite voice."
Be original, not imitative: Describe sound characteristics rather than requesting imitation of specific individuals, such as celebrities or actors. Such requests pose copyright risks, and the model does not support direct imitation.
Be concise, not redundant: Ensure every word adds meaning. Avoid repeating synonyms or using meaningless intensifiers, such as "very very nice voice."

Dimension example

Dimension	Example
Gender	Male, female, neutral
Age	Child (5-12 years), teenager (13-18 years), young adult (19-35 years), middle-aged (36-55 years), senior (55+ years)
Pitch	High, medium, low, slightly high, slightly low
Speech rate	Fast, medium, slow, slightly fast, slightly slow
Emotion	Cheerful, calm, gentle, serious, lively, cool, soothing
Characteristics	Magnetic, crisp, raspy, mellow, sweet, rich, powerful
Purpose	News broadcasting, advertisement voice-over, audiobooks, animated characters, voice assistants, documentary narration

Example comparison

Good cases:

"Young and lively female voice, fast speech rate with noticeable rising intonation, suitable for introducing fashion products."
- Analysis: This description combines age, personality, speech rate, and intonation, and specifies the use case, creating a clear voice profile.
"Calm middle-aged male, slow speech rate, deep and magnetic voice quality, suitable for reading news or documentary narration."
- Analysis: This description clearly defines gender, age range, speech rate, voice quality, and intended use.
"Cute child's voice, approximately 8-year-old girl, slightly childish speech, suitable for animated character dubbing."
- Analysis: This description pinpoints the specific age and voice quality (childishness) and has a clear purpose.
"Gentle and intellectual female, around 30 years old, calm tone, suitable for audiobook narration."
- Analysis: This description effectively conveys voice emotion and style through terms such as "intellectual" and "calm."

Bad cases and suggestions:

Bad case	Main issue	Improvement suggestion
'Nice-sounding voice'	This description is too vague and subjective, and lacks actionable detail.	Add specific dimensions, such as "Clear-toned young female voice with gentle intonation."
'Voice like a celebrity'	This poses a copyright risk. The model does not support direct imitation.	Extract the voice characteristics for the description, such as "Mature, magnetic, steady-paced male voice."
'Very very very nice female voice'	This description is redundant. Repeating words does not help define the voice.	Remove repetitions and add effective descriptions, such as "A 20- to 24-year-old female voice with a light, cheerful tone, lively pitch, and sweet quality."
123456	This is an invalid input. It cannot be parsed as voice characteristics.	Provide a meaningful text description. For more information, see the recommended examples above.

API reference

CosyVoice
Qwen-TTS-Realtime

Text to speech API reference

Model comparison

CosyVoice
Qwen-TTS-Realtime

Feature	cosyvoice-v3-plus	cosyvoice-v3-flash
Supported languages	Varies by system voice: Chinese (Mandarin, Northeastern, Minnan, Shaanxi), English, Japanese, Korean	Varies by system voice: Chinese (Mandarin), English
Audio format	pcm, wav, mp3, opus	pcm, wav, mp3, opus
Audio sample rate	8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz	8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz
Voice cloning	Not supported	Not supported
Voice design	Not supported	Not supported
SSML	Supported. This feature applies to cloned voices and system voices in the Voice list marked as supporting SSML. For usage instructions, see SSML	Supported. This feature applies to cloned voices and system voices in the Voice list marked as supporting SSML. For usage instructions, see SSML
LaTeX	Supported. For usage instructions, see LaTeX formula-to-speech	Supported. For usage instructions, see LaTeX formula-to-speech
Volume adjustment	Supported. See request parameter `volume`	Supported. See request parameter `volume`
Speech rate adjustment	Supported. See request parameter `speech_rate`. In the Java SDK, this parameter is `speechRate`	Supported. See request parameter `speech_rate`. In the Java SDK, this parameter is `speechRate`
Pitch adjustment	Supported. See the request parameter `pitch_rate`. In the Java SDK, this parameter is `pitchRate`	Supported. See the request parameter `pitch_rate`. In the Java SDK, this parameter is `pitchRate`
Bitrate adjustment	Supported. Only the opus audio format supports this feature. See the request parameter `bit_rate`. In the Java SDK, use `.parameter("bit_rate", value)`	Supported. Only the opus audio format supports this feature. See the request parameter `bit_rate`. In the Java SDK, use `.parameter("bit_rate", value)`
Timestamp	Supported. Disabled by default but can be enabled. This feature applies to cloned voices and system voices in the Voice list marked as supporting timestamps. See request parameter `word_timestamp_enabled`. In the Java SDK, this parameter is `enableWordTimestamp`	Supported. Disabled by default but can be enabled. This feature applies to cloned voices and system voices in the Voice list marked as supporting timestamps. See request parameter `word_timestamp_enabled`. In the Java SDK, this parameter is `enableWordTimestamp`
Instruction control (Instruct)	Not supported	Supported. This feature applies to system voices in the Voice list marked as supporting Instruct. See request parameter `instruction`
Streaming input	Supported	Supported
Streaming output	Supported	Supported
Rate limits (RPS)	3	3
Connection type	Java/Python SDK, WebSocket API	Java/Python SDK, WebSocket API
Price	$0.26 per 10,000 characters	$0.13 per 10,000 characters

Feature	Qwen3-TTS-Instruct-Flash-Realtime	Qwen3-TTS-VD-Realtime	Qwen3-TTS-VC-Realtime	Qwen3-TTS-Flash-Realtime	Qwen-TTS-Realtime
Supported languages	Chinese (Mandarin), English, Spanish, Russian, Italian, French, Korean, Japanese, German, Portuguese	Chinese (Mandarin), English, Spanish, Russian, Italian, French, Korean, Japanese, German, Portuguese	Chinese (Mandarin), English, Spanish, Russian, Italian, French, Korean, Japanese, German, Portuguese	Chinese (Mandarin, Beijing, Shanghai, Sichuan, Nanjing, Shaanxi, Minnan, Tianjin, Cantonese, varies by voice), English, Spanish, Russian, Italian, French, Korean, Japanese, German, Portuguese	Chinese, English
Audio formats	pcm, wav, mp3, opus	pcm, wav, mp3, opus	pcm, wav, mp3, opus	pcm, wav, mp3, opus	pcm
Audio sample rates	8kHz, 16kHz, 24kHz, 48kHz	8kHz, 16kHz, 24kHz, 48kHz	8kHz, 16kHz, 24kHz, 48kHz	8kHz, 16kHz, 24kHz, 48kHz	24kHz
Voice cloning	Not supported	Not supported	Supported	Not supported	Not supported
Voice design	Not supported	Supported	Not supported	Not supported	Not supported
SSML	Not supported	Not supported	Not supported	Not supported	Not supported
LaTeX	Not supported	Not supported	Not supported	Not supported	Not supported
Volume adjustment	Supported	Supported	Supported	Supported	Not supported
Speed adjustment	Supported	Supported	Supported	Supported	Not supported
Pitch adjustment	Supported	Supported	Supported	Supported	Not supported
Bitrate adjustment	Supported	Supported	Supported	Supported	Not supported
Timestamps	Not supported	Not supported	Not supported	Not supported	Not supported
Instruct	Supported	Not supported	Not supported	Not supported	Not supported
Streaming input	Supported	Supported	Supported	Supported	Supported
Streaming output	Supported	Supported	Supported	Supported	Supported
Rate limits	Requests per minute (RPM): 180	Requests per minute (RPM): 180	Requests per minute (RPM): 180	qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27 RPM: 180; qwen3-tts-flash-realtime-2025-09-18 RPM: 10	RPM: 10; Tokens per minute (TPM): 100,000
Access methods	Java/Python SDK, WebSocket API	Java/Python SDK, WebSocket API	Java/Python SDK, WebSocket API	Java/Python SDK, WebSocket API	Java/Python SDK, WebSocket API
Pricing	$0.143 per 10,000 characters	$0.143353 per 10,000 characters	$0.143353 per 10,000 characters	$0.13 per 10,000 characters	N/A

System voices

CosyVoice
Qwen-TTS-Realtime

CosyVoice voice list

Different models support different voices. Set the voice request parameter to the value listed in the voice parameter column of the voice list when making a request.

`voice` parameter	Details	Supported languages	Supported models
Cherry	Voice name: Cherry. A sunny, positive, friendly, and natural young woman (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18; Qwen-TTS-Realtime: qwen-tts-realtime
Serena	Voice name: Serena. A gentle young woman (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27; Qwen-TTS-Realtime: qwen-tts-realtime
Ethan	Voice name: Ethan. Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18; Qwen-TTS-Realtime: qwen-tts-realtime
Chelsie	Voice name: Chelsie. A two-dimensional virtual girlfriend (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27; Qwen-TTS-Realtime: qwen-tts-realtime
Momo	Voice name: Momo. Playful and mischievous, cheering you up (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Vivian	Voice name: Vivian. Confident, cute, and slightly feisty (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Moon	Voice name: Moon. A bold and handsome man named Yuebai (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Maia	Voice name: Maia. A blend of intellect and gentleness (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Kai	Voice name: Kai. A soothing audio spa for your ears (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Nofish	Voice name: Nofish. A designer who cannot pronounce retroflex sounds (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Bella	Voice name: Bella. A little girl who drinks but never throws punches when drunk (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Jennifer	Voice name: Jennifer. A premium, cinematic-quality American English female voice (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Ryan	Voice name: Ryan. Full of rhythm, bursting with dramatic flair, balancing authenticity and tension (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Katerina	Voice name: Katerina. A mature-woman voice with rich, memorable rhythm (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Aiden	Voice name: Aiden. An American English young man skilled in cooking (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Eldric Sage	Voice name: Eldric Sage. A calm and wise elder -- weathered like a pine tree, yet clear-minded as a mirror (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Mia	Voice name: Mia. Gentle as spring water, obedient as fresh snow (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Mochi	Voice name: Mochi. A clever, quick-witted young adult -- childlike innocence remains, yet wisdom shines through (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Bellona	Voice name: Bellona. A powerful, clear voice that brings characters to life -- so stirring it makes your blood boil. With heroic grandeur and perfect diction, this voice captures the full spectrum of human expression.	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Vincent	Voice name: Vincent. A uniquely raspy, smoky voice -- just one line evokes armies and heroic tales (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Bunny	Voice name: Bunny. A little girl overflowing with "cuteness" (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Neil	Voice name: Neil. A flat baseline intonation with precise, clear pronunciation -- the most professional news anchor (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Elias	Voice name: Elias. Maintains academic rigor while using storytelling techniques to turn complex knowledge into digestible learning modules (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Arthur	Voice name: Arthur. A simple, earthy voice steeped in time and tobacco smoke -- slowly unfolding village stories and curiosities (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Nini	Voice name: Nini. A soft, clingy voice like sweet rice cakes (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Ebona	Voice name: Ebona. Her whisper is like a rusty key slowly turning in the darkest corner of your mind (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Seren	Voice name: Seren. A gentle, soothing voice to help you fall asleep faster. Good night, sweet dreams (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Pip	Voice name: Pip. A playful, mischievous boy full of childlike wonder (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Stella	Voice name: Stella. Normally a cloyingly sweet, dazed teenage-girl voice -- but when shouting she instantly radiates unwavering love and justice (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22; Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Bodega	Voice name: Bodega. A passionate Spanish man (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Sonrisa	Voice name: Sonrisa. A cheerful, outgoing Latin American woman (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Alek	Voice name: Alek. Cold like the Russian spirit, yet warm like wool coat lining (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Dolce	Voice name: Dolce. A laid-back Italian man (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Sohee	Voice name: Sohee. A warm, cheerful, emotionally expressive Korean unnie (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Ono Anna	Voice name: Ono Anna. A clever, spirited childhood friend (female)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Lenn	Voice name: Lenn. Rational at heart, rebellious in detail -- a German youth who wears suits and listens to post-punk	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Emilien	Voice name: Emilien. A romantic French big brother (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Andre	Voice name: Andre. A magnetic, natural, and steady male voice	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Radio Gol	Voice name: Radio Gol. Football poet Radio Gol (male)	Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Jada	Voice name: Shanghai - Jada. A fast-paced, energetic Shanghai auntie (female)	Shanghainese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Dylan	Voice name: Beijing - Dylan. A young man raised in Beijing's hutongs (male)	Beijing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Li	Voice name: Nanjing - Li. A patient yoga teacher (male)	Nanjing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Marcus	Voice name: Shaanxi - Marcus. Broad face, few words, sincere heart, deep voice -- the authentic Shaanxi flavor (male)	Shaanxi dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Roy	Voice name: Southern Min - Roy. A humorous, straightforward, lively Taiwanese guy (male)	Southern Min, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Peter	Voice name: Tianjin - Peter. Tianjin-style crosstalk, professional foil (male)	Tianjin dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Sunny	Voice name: Sichuan - Sunny. A Sichuan girl sweet enough to melt your heart (female)	Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Eric	Voice name: Sichuan - Eric. A Sichuanese man from Chengdu who stands out in everyday life (male)	Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Rocky	Voice name: Cantonese - Rocky. A humorous, witty A Qiang providing live chat (male)	Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Kiki	Voice name: Cantonese - Kiki. A sweet Hong Kong girl best friend (female)	Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18

FAQ

CosyVoice
Qwen-TTS-Realtime

What should I do if speech synthesis produces incorrect pronunciations? How can I control the pronunciation of characters with multiple pronunciations?

Replace characters with multiple pronunciations with homophones to quickly resolve pronunciation issues.
Use the Speech Synthesis Markup Language (SSML) to control pronunciation.

​Core features

​Availability

​Model selection

​Getting started