Skip to main content
Text-to-speech

Real-time speech synthesis

Stream TTS in real time

Qwen Cloud provides two families of real-time speech synthesis models: CosyVoice for streaming synthesis with SSML control, and Qwen-TTS-Realtime for real-time synthesis with instruction-based voice control, voice cloning, and voice design.

Core features

  • Generates high-fidelity speech in real time with natural pronunciation in multiple languages, such as Chinese and English
  • Supports voice customization through Qwen-TTS-Realtime voice cloning and voice design
  • Supports streaming input and output with low-latency responses for real-time interactive scenarios
  • Adjustable speech rate, pitch, volume, and bitrate for fine-grained control over vocal expression
  • Compatible with mainstream audio formats, supporting output up to 48 kHz sample rate
  • Supports instruction control, enabling natural language instructions to control vocal expressiveness

Availability

  • CosyVoice
  • Qwen-TTS-Realtime
Supported models:When you invoke the following models, use an API key.
  • CosyVoice: cosyvoice-v3-plus, cosyvoice-v3-flash
For more information, see the Model list.

Model selection

  • CosyVoice
  • Qwen-TTS-Realtime
ScenarioRecommendedReasonNotes
Intelligent customer service / Voice assistantcosyvoice-v3-flashLower cost than plus models with support for streaming interaction and emotional expression, delivering fast responses at an affordable price point.
Educational applications (including formula reading)cosyvoice-v3-flash, cosyvoice-v3-plusSupports LaTeX formula-to-speech conversion, ideal for mathematics, physics, and chemistry instruction.cosyvoice-v3-plus has higher costs ($0.286706 per 10,000 characters).
Structured voice broadcasting (news/announcements)cosyvoice-v3-plus, cosyvoice-v3-flashSupports SSML for controlling speech rate, pauses, and pronunciation to enhance broadcast professionalism.Implement the SSML generation logic independently. This model does not support emotion settings.
Precise speech-text alignment for scenarios such as caption generation, lesson playback, and dictation practicecosyvoice-v3-flash, cosyvoice-v3-plusSupports timestamp output to synchronize the synthesized speech with the original text.Manually enable the timestamp feature.
Multilingual international productscosyvoice-v3-flash, cosyvoice-v3-plusSupports multiple languages.

Getting started

  • CosyVoice
  • Qwen-TTS-Realtime
For more code examples, see GitHub.Get an API key and set it as an environment variable. To use the SDK, install it.
  • Use system voices
Save synthesized audio to a file
For available voices, see the Voice list.
  • Python
  • Java
# coding=utf-8

import os
import dashscope
from dashscope.audio.tts_v2 import *

# If you have not configured environment variables, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For more information, see the CosyVoice voice list.
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send the text to be synthesized and get the binary audio.
audio = synthesizer.call("How is the weather today?")
# The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time.
print('[Metric] Request ID: {}, First packet delay: {} ms'.format(
  synthesizer.get_last_request_id(),
  synthesizer.get_first_package_delay()))

# Save the audio locally.
with open('output.mp3', 'wb') as f:
  f.write(audio)
Convert LLM-generated text to speech in real time and play it through speakers
Play text from a Qwen model (qwen3.5-flash) as speech in real time on a local device.
  • Python
  • Java
Before you run the Python example, install a third-party audio playback library using pip.
# coding=utf-8
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
import pyaudio
import dashscope
from dashscope.audio.tts_v2 import *


from http import HTTPStatus
from dashscope import Generation

# If you have not configured environment variables, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For more information, see the CosyVoice voice list.
model = "cosyvoice-v3-flash"
voice = "longanyang"


class Callback(ResultCallback):
  _player = None
  _stream = None

  def on_open(self):
    print("websocket is open.")
    self._player = pyaudio.PyAudio()
    self._stream = self._player.open(
      format=pyaudio.paInt16, channels=1, rate=22050, output=True
    )

  def on_complete(self):
    print("speech synthesis task complete successfully.")

  def on_error(self, message: str):
    print(f"speech synthesis task failed, {message}")

  def on_close(self):
    print("websocket is closed.")
    # stop player
    self._stream.stop_stream()
    self._stream.close()
    self._player.terminate()

  def on_event(self, message):
    print(f"recv speech synthsis message {message}")

  def on_data(self, data: bytes) -> None:
    print("audio result length:", len(data))
    self._stream.write(data)


def synthesizer_with_llm():
  callback = Callback()
  synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    format=AudioFormat.PCM_22050HZ_MONO_16BIT,
    callback=callback,
  )

  messages = [{"role": "user", "content": "Please introduce yourself"}]
  responses = Generation.call(
    model="qwen3.5-flash",
    messages=messages,
    result_format="message",  # set result format as 'message'
    stream=True,  # enable stream output
    incremental_output=True,  # enable incremental output
  )
  for response in responses:
    if response.status_code == HTTPStatus.OK:
      print(response.output.choices[0]["message"]["content"], end="")
      synthesizer.streaming_call(response.output.choices[0]["message"]["content"])
    else:
      print(
        "Request id: %s, Status code: %s, error code: %s, error message: %s"
        % (
          response.request_id,
          response.status_code,
          response.code,
          response.message,
        )
      )
  synthesizer.streaming_complete()
  print('requestId: ', synthesizer.get_last_request_id())


if __name__ == "__main__":
  synthesizer_with_llm()

Interaction flow

  • CosyVoice
  • Qwen-TTS-Realtime
CosyVoice uses a WebSocket-based streaming protocol. For protocol details, see the CosyVoice WebSocket API reference.

Instruction control

  • CosyVoice
  • Qwen-TTS-Realtime
CosyVoice supports instruction control only for cosyvoice-v3-flash. Use SSML for fine-grained pronunciation and prosody control with other CosyVoice models.

Voice customization

  • CosyVoice
  • Qwen-TTS-Realtime

Voice cloning: Input audio formats

High-quality input audio is the foundation for achieving excellent cloning results.
ItemRequirements
Supported formatsWAV (16-bit), MP3, M4A
Audio durationRecommended: 10 to 20 seconds. Maximum: 60 seconds.
File size≤ 10 MB
Sample rate≥ 16 kHz
Sound channelMono or stereo. For stereo audio, only the first channel is processed. Make sure that the first channel contains a clear human voice.
ContentThe audio must contain at least 5 seconds of continuous, clear speech without background sound. The rest of the audio can have only short pauses (≤ 2 seconds). The entire audio segment should be free of background music, noise, or other voices to ensure high-quality core speech content. Use normal spoken audio as input. Do not upload songs or singing audio to ensure accuracy and usability of the cloning effect.

Voice design: Write high-quality voice descriptions

Limitations

When writing voice descriptions (voice_prompt), follow these technical constraints:
  • Length limit: The content of voice_prompt must not exceed 500 characters.
  • Supported languages: The description text supports only Chinese and English.

Core principles

The voice_prompt guides the model to generate voices with specific characteristics.Follow these core principles when describing voices:
  • Be specific, not vague: Use words that describe concrete sound qualities, such as "deep," "crisp," or "fast-paced." Avoid subjective, uninformative terms such as "nice-sounding" or "ordinary."
  • Be multidimensional, not single-dimensional: Excellent descriptions typically combine multiple dimensions, such as gender, age, and emotion. Single-dimensional descriptions, such as "female voice," are too broad to generate distinctive voices.
  • Be objective, not subjective: Focus on the physical and perceptual characteristics of the sound itself, not your personal preferences. For example, use "high-pitched with energetic delivery" instead of "my favorite voice."
  • Be original, not imitative: Describe sound characteristics rather than requesting imitation of specific individuals, such as celebrities or actors. Such requests pose copyright risks, and the model does not support direct imitation.
  • Be concise, not redundant: Ensure every word adds meaning. Avoid repeating synonyms or using meaningless intensifiers, such as "very very nice voice."

Dimension example

DimensionExample
GenderMale, female, neutral
AgeChild (5-12 years), teenager (13-18 years), young adult (19-35 years), middle-aged (36-55 years), senior (55+ years)
PitchHigh, medium, low, slightly high, slightly low
Speech rateFast, medium, slow, slightly fast, slightly slow
EmotionCheerful, calm, gentle, serious, lively, cool, soothing
CharacteristicsMagnetic, crisp, raspy, mellow, sweet, rich, powerful
PurposeNews broadcasting, advertisement voice-over, audiobooks, animated characters, voice assistants, documentary narration

Example comparison

Good cases:
  • "Young and lively female voice, fast speech rate with noticeable rising intonation, suitable for introducing fashion products."
    • Analysis: This description combines age, personality, speech rate, and intonation, and specifies the use case, creating a clear voice profile.
  • "Calm middle-aged male, slow speech rate, deep and magnetic voice quality, suitable for reading news or documentary narration."
    • Analysis: This description clearly defines gender, age range, speech rate, voice quality, and intended use.
  • "Cute child's voice, approximately 8-year-old girl, slightly childish speech, suitable for animated character dubbing."
    • Analysis: This description pinpoints the specific age and voice quality (childishness) and has a clear purpose.
  • "Gentle and intellectual female, around 30 years old, calm tone, suitable for audiobook narration."
    • Analysis: This description effectively conveys voice emotion and style through terms such as "intellectual" and "calm."
Bad cases and suggestions:
Bad caseMain issueImprovement suggestion
'Nice-sounding voice'This description is too vague and subjective, and lacks actionable detail.Add specific dimensions, such as "Clear-toned young female voice with gentle intonation."
'Voice like a celebrity'This poses a copyright risk. The model does not support direct imitation.Extract the voice characteristics for the description, such as "Mature, magnetic, steady-paced male voice."
'Very very very nice female voice'This description is redundant. Repeating words does not help define the voice.Remove repetitions and add effective descriptions, such as "A 20- to 24-year-old female voice with a light, cheerful tone, lively pitch, and sweet quality."
123456This is an invalid input. It cannot be parsed as voice characteristics.Provide a meaningful text description. For more information, see the recommended examples above.

API reference

Model comparison

  • CosyVoice
  • Qwen-TTS-Realtime
Featurecosyvoice-v3-pluscosyvoice-v3-flash
Supported languagesVaries by system voice: Chinese (Mandarin, Northeastern, Minnan, Shaanxi), English, Japanese, KoreanVaries by system voice: Chinese (Mandarin), English
Audio formatpcm, wav, mp3, opuspcm, wav, mp3, opus
Audio sample rate8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz
Voice cloningNot supportedNot supported
Voice designNot supportedNot supported
SSMLSupported. This feature applies to cloned voices and system voices in the Voice list marked as supporting SSML. For usage instructions, see SSMLSupported. This feature applies to cloned voices and system voices in the Voice list marked as supporting SSML. For usage instructions, see SSML
LaTeXSupported. For usage instructions, see LaTeX formula-to-speechSupported. For usage instructions, see LaTeX formula-to-speech
Volume adjustmentSupported. See request parameter volumeSupported. See request parameter volume
Speech rate adjustmentSupported. See request parameter speech_rate. In the Java SDK, this parameter is speechRateSupported. See request parameter speech_rate. In the Java SDK, this parameter is speechRate
Pitch adjustmentSupported. See the request parameter pitch_rate. In the Java SDK, this parameter is pitchRateSupported. See the request parameter pitch_rate. In the Java SDK, this parameter is pitchRate
Bitrate adjustmentSupported. Only the opus audio format supports this feature. See the request parameter bit_rate. In the Java SDK, use .parameter("bit_rate", value)Supported. Only the opus audio format supports this feature. See the request parameter bit_rate. In the Java SDK, use .parameter("bit_rate", value)
TimestampSupported. Disabled by default but can be enabled. This feature applies to cloned voices and system voices in the Voice list marked as supporting timestamps. See request parameter word_timestamp_enabled. In the Java SDK, this parameter is enableWordTimestampSupported. Disabled by default but can be enabled. This feature applies to cloned voices and system voices in the Voice list marked as supporting timestamps. See request parameter word_timestamp_enabled. In the Java SDK, this parameter is enableWordTimestamp
Instruction control (Instruct)Not supportedSupported. This feature applies to system voices in the Voice list marked as supporting Instruct. See request parameter instruction
Streaming inputSupportedSupported
Streaming outputSupportedSupported
Rate limits (RPS)33
Connection typeJava/Python SDK, WebSocket APIJava/Python SDK, WebSocket API
Price$0.26 per 10,000 characters$0.13 per 10,000 characters

System voices

  • CosyVoice
  • Qwen-TTS-Realtime

FAQ

  • CosyVoice
  • Qwen-TTS-Realtime
  • Replace characters with multiple pronunciations with homophones to quickly resolve pronunciation issues.
  • Use the Speech Synthesis Markup Language (SSML) to control pronunciation.