CosyVoice Java SDK | Qwen Cloud

For model overviews and voice selection, see Speech synthesis.

Prerequisites

For temporary access to third-party apps or strict control over sensitive operations, use a temporary authentication token. Temporary tokens expire in 60 seconds, reducing leakage risk. Replace the API key in your code with the token.

Install the latest DashScope SDK.

Models and pricing

See Speech synthesis.

Text and format limits

Text length limits

Non-streaming, unidirectional streaming, or Flowable unidirectional streaming: Max 20,000 characters per request.
Bidirectional streaming or Flowable bidirectional streaming: Max 20,000 characters per request, 200,000 characters total.

Character counting rules

Chinese characters (simplified, traditional, Japanese Kanji, Korean Hanja) count as 2. All other characters (punctuation, letters, numbers, Kana, Hangul) count as 1.
SSML tags are excluded from the text length.
Examples:
- "你好" → 2 (Chinese character) + 2 (Chinese character) = 4 characters
- "中A文123" → 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters
- "中文。" → 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters
- "中文。" → 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters
- "<speak>你好</speak>" → 2 (Chinese character) + 2 (Chinese character) = 4 characters

Encoding format

Use UTF-8 encoding.

Math expressions

Math expression parsing is available for cosyvoice-v3-flash and cosyvoice-v3-plus. It supports common primary and secondary school math, including basic operations, algebra, and geometry.

This feature supports Chinese only.

See Convert LaTeX formulas to speech (Chinese language only).

SSML support

SSML is available for custom voices (voice design or cloning) on cosyvoice-v3-flash and cosyvoice-v3-plus, and for system voices marked SSML-supported in the voice list. Requirements:

DashScope SDK 2.20.3 or later.
Only non-streaming and unidirectional streaming calls (the call method of SpeechSynthesizer) are supported. Bidirectional streaming calls (streamingCall) and Flowable calls are not supported.
Pass text containing SSML to the call method, the same as for normal text.

Getting started

The SpeechSynthesizer class supports the following call methods:

Non-streaming: Sends full text and returns complete audio. Blocks until done. Best for short text.
Unidirectional streaming: Sends full text and returns audio via callback. Non-blocking. Best for short text with low latency needs.
Bidirectional streaming: Sends text fragments incrementally and returns audio in real time via callback. Non-blocking. Best for long text with low latency needs.

Non-streaming call

Sends text synchronously and returns the complete audio result.

Instantiate SpeechSynthesizer, bind the request parameters, and call the call method to get binary audio data. Max text length: 20,000 characters.

Re-initialize the SpeechSynthesizer instance before each call invocation.

Click to view the full example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Main {
  private static String model = "cosyvoice-v3-flash";
  private static String voice = "longanyang";

  public static void streamAudioDataToSpeaker() {
    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            // If you have not configured environment variables, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .model(model)
            .voice(voice)
            .build();

    // Synchronous mode: set the callback (second parameter) to null.
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
    ByteBuffer audio = null;
    try {
      // Blocks until the audio is returned.
      audio = synthesizer.call("What's the weather like today?");
    } catch (Exception e) {
      throw new RuntimeException(e);
    } finally {
      // Close the WebSocket connection after the task completes.
      synthesizer.getDuplexApi().close(1000, "bye");
    }
    if (audio != null) {
      File file = new File("output.mp3");
      // The first call includes WebSocket connection time in the first-packet latency.
      System.out.println(
          "[Metric] Request ID: "
              + synthesizer.getLastRequestId()
              + ", First-packet latency (ms): "
              + synthesizer.getFirstPackageDelay());
      try (FileOutputStream fos = new FileOutputStream(file)) {
        fos.write(audio.array());
      } catch (IOException e) {
        throw new RuntimeException(e);
      }
    }
  }

  public static void main(String[] args) {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    streamAudioDataToSpeaker();
    System.exit(0);
  }
}

Unidirectional streaming call

Submit text asynchronously and receive audio data incrementally through a ResultCallback.

Instantiate SpeechSynthesizer, bind the request parameters and the ResultCallback, and call the call method. Get audio in real time through onEvent. Max text length: 20,000 characters.

Re-initialize the SpeechSynthesizer instance before each call invocation.

Click to view the full example

import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;

class TimeUtils {
  private static final DateTimeFormatter formatter =
      DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

  public static String getTimestamp() {
    return LocalDateTime.now().format(formatter);
  }
}

public class Main {
  private static String model = "cosyvoice-v3-flash";
  private static String voice = "longanyang";

  public static void streamAudioDataToSpeaker() {
    CountDownLatch latch = new CountDownLatch(1);

    // Implement the ResultCallback interface.
    ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
      @Override
      public void onEvent(SpeechSynthesisResult result) {
        // System.out.println("Message received: " + result);
        if (result.getAudioFrame() != null) {
          // Add your audio processing logic here.
          System.out.println(TimeUtils.getTimestamp() + " Audio received");
        }
      }

      @Override
      public void onComplete() {
        System.out.println(TimeUtils.getTimestamp() + " Complete received. Speech synthesis finished.");
        latch.countDown();
      }

      @Override
      public void onError(Exception e) {
        System.out.println("An exception occurred: " + e.toString());
        latch.countDown();
      }
    };

    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            // If you have not configured environment variables, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .model(model)
            .voice(voice)
            .build();
    // Pass the callback as the second parameter for asynchronous mode.
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
    // Non-blocking: returns null immediately. Results arrive via onEvent.
    try {
      synthesizer.call("What's the weather like today?");
      latch.await();
    } catch (Exception e) {
      throw new RuntimeException(e);
    } finally {
      // Close the WebSocket connection after the task completes.
      synthesizer.getDuplexApi().close(1000, "bye");
    }
    // The first call includes WebSocket connection time in the first-packet latency.
    System.out.println(
        "[Metric] Request ID: "
            + synthesizer.getLastRequestId()
            + ", First-packet latency (ms): "
            + synthesizer.getFirstPackageDelay());
  }

  public static void main(String[] args) {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    streamAudioDataToSpeaker();
    System.exit(0);
  }
}

Bidirectional streaming call

Send text in multiple chunks and receive audio incrementally through a ResultCallback.

Call streamingCall multiple times to submit text fragments. The server auto-segments into sentences: complete sentences synthesize immediately, incomplete ones buffer until complete. streamingComplete() forces synthesis of all buffered text.
Send text fragments within 23-second intervals to avoid timeout errors. Call streamingComplete() when done.
The 23-second server timeout cannot be changed on the client.

Instantiate SpeechSynthesizer

Instantiate SpeechSynthesizer, and bind the request parameters and ResultCallback.

Stream text

Call streamingCall multiple times to send text in chunks. The server returns audio in real time through onEvent.Each streamingCall text fragment: max 20,000 characters. Total across all fragments: max 200,000 characters.

Complete synthesis

Call streamingComplete to finish. This blocks until onComplete or onError fires.Always call this method. Otherwise, trailing text may not be synthesized.

Click to view the full example

import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
  private static final DateTimeFormatter formatter =
      DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

  public static String getTimestamp() {
    return LocalDateTime.now().format(formatter);
  }
}


public class Main {
  private static String[] textArray = {"The streaming text-to-speech SDK ",
      "can convert input text ", "into binary audio data. ", "Compared to non-streaming synthesis, ",
      "streaming offers better real-time performance. ", "You can hear the audio output almost instantly as you type, ",
      "which greatly improves the user experience ", "and reduces waiting time. ",
      "This is ideal for applications that use large ", "language models (LLMs) ",
      "to synthesize speech from a stream of text."};
  private static String model = "cosyvoice-v3-flash";
  private static String voice = "longanyang";

  public static void streamAudioDataToSpeaker() {
    // Configure the callback.
    ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
      @Override
      public void onEvent(SpeechSynthesisResult result) {
        // System.out.println("Message received: " + result);
        if (result.getAudioFrame() != null) {
          // Add your audio processing logic here.
          System.out.println(TimeUtils.getTimestamp() + " Audio received");
        }
      }

      @Override
      public void onComplete() {
        System.out.println(TimeUtils.getTimestamp() + " Complete received. Speech synthesis finished.");
      }

      @Override
      public void onError(Exception e) {
        System.out.println("An exception occurred: " + e.toString());
      }
    };

    // Request parameters
    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            // If you have not configured environment variables, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .model(model)
            .voice(voice)
            .format(SpeechSynthesisAudioFormat
                .PCM_22050HZ_MONO_16BIT) // Use PCM or MP3 for streaming synthesis.
            .build();
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
    try {
      for (String text : textArray) {
        // Send text fragments. Audio arrives in real time via onEvent.
        synthesizer.streamingCall(text);
      }
      // Wait for streaming synthesis to complete.
      synthesizer.streamingComplete();
    } catch (Exception e) {
      throw new RuntimeException(e);
    } finally {
      // Close the WebSocket connection after the task completes.
      synthesizer.getDuplexApi().close(1000, "bye");
    }

    // The first call includes WebSocket connection time in the first-packet latency.
    System.out.println(
        "[Metric] Request ID: "
            + synthesizer.getLastRequestId()
            + ", First-packet latency (ms): "
            + synthesizer.getFirstPackageDelay());
  }

  public static void main(String[] args) {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    streamAudioDataToSpeaker();
    System.exit(0);
  }
}

Call using Flowable

Flowable is an open-source reactive programming framework under the Apache 2.0 license. For details, see the Flowable API docs. Integrate the RxJava library and understand reactive programming basics before using Flowable.

Unidirectional streaming call
Bidirectional streaming call

Use blockingForEach on a Flowable object to block and get each SpeechSynthesisResult. The complete result is also available through getAudioData after all streaming data returns.

Click to view the full example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
  private static final DateTimeFormatter formatter =
      DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

  public static String getTimestamp() {
    return LocalDateTime.now().format(formatter);
  }
}

public class Main {
  private static String model = "cosyvoice-v3-flash";
  private static String voice = "longanyang";

  public static void streamAudioDataToSpeaker() throws NoApiKeyException {
    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            // If you have not configured environment variables, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .model(model)
            .voice(voice)
            .build();
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
    synthesizer.callAsFlowable("What's the weather like today?").blockingForEach(result -> {
      // System.out.println("Message received: " + result);
      if (result.getAudioFrame() != null) {
        // Add your audio processing logic here.
        System.out.println(TimeUtils.getTimestamp() + " Audio received");
      }
    });
    // Close the WebSocket connection after the task completes.
    synthesizer.getDuplexApi().close(1000, "bye");
    // The first call includes WebSocket connection time in the first-packet latency.
    System.out.println(
        "[Metric] Request ID: "
            + synthesizer.getLastRequestId()
            + ", First-packet latency (ms): "
            + synthesizer.getFirstPackageDelay());
  }

  public static void main(String[] args) throws NoApiKeyException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    streamAudioDataToSpeaker();
    System.exit(0);
  }
}

Use a Flowable object as input for the text stream, and use blockingForEach on the returned Flowable to get each SpeechSynthesisResult. The complete result is also available through getAudioData after all streaming data returns.

Click to view the full example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
  private static final DateTimeFormatter formatter =
      DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

  public static String getTimestamp() {
    return LocalDateTime.now().format(formatter);
  }
}

public class Main {
  private static String[] textArray = {"The streaming text-to-speech SDK ",
      "can convert input text ", "into binary audio data. ", "Compared to non-streaming synthesis, ",
      "streaming offers better real-time performance. ", "You can hear the audio output almost instantly as you type, ",
      "which greatly improves the user experience ", "and reduces waiting time. ",
      "This is ideal for applications that use large ", "language models (LLMs) ",
      "to synthesize speech from a stream of text."};
  private static String model = "cosyvoice-v3-flash";
  private static String voice = "longanyang";

  public static void streamAudioDataToSpeaker() throws NoApiKeyException {
    // Simulate streaming input.
    Flowable<String> textSource = Flowable.create(emitter -> {
      new Thread(() -> {
        for (int i = 0; i < textArray.length; i++) {
          emitter.onNext(textArray[i]);
          try {
            Thread.sleep(1000);
          } catch (InterruptedException e) {
            throw new RuntimeException(e);
          }
        }
        emitter.onComplete();
      }).start();
    }, BackpressureStrategy.BUFFER);

    // Request parameters
    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            // If you have not configured environment variables, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .model(model)
            .voice(voice)
            .build();
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
    synthesizer.streamingCallAsFlowable(textSource).blockingForEach(result -> {
      if (result.getAudioFrame() != null) {
        // Add your audio playback logic here.
        System.out.println(
            TimeUtils.getTimestamp() +
                " Binary audio size: " + result.getAudioFrame().capacity());
      }
    });
    synthesizer.getDuplexApi().close(1000, "bye");
    // The first call includes WebSocket connection time in the first-packet latency.
    System.out.println(
        "[Metric] Request ID: "
            + synthesizer.getLastRequestId()
            + ", First-packet latency (ms): "
            + synthesizer.getFirstPackageDelay());
  }

  public static void main(String[] args) throws NoApiKeyException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    streamAudioDataToSpeaker();
    System.exit(0);
  }
}

High-concurrency calls

The DashScope Java SDK uses OkHttp3 connection pooling to reduce connection overhead. See High-concurrency management.

Request parameters

Use the chained methods of SpeechSynthesisParam to configure parameters like model and voice. Pass the configured object to the SpeechSynthesizer constructor.

Click to view an example

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .format(SpeechSynthesisAudioFormat.WAV_8000HZ_MONO_16BIT) // Audio format and sample rate
  .volume(50) // Volume: [0, 100]
  .speechRate(1.0f) // Speech rate: [0.5, 2]
  .pitchRate(1.0f) // Pitch: [0.5, 2]
  .build();

Parameter	Type	Required	Description
model	String	Yes	The text-to-speech model. See Voice list for all options.
voice	String	Yes	The voice for synthesis. See Voice list for available system voices.
format	enum	No	Audio format and sample rate. Default: MP3 at 22.05 kHz. Default sample rate is optimal for the voice. Downsampling and upsampling are supported. Supported formats: All models: SpeechSynthesisAudioFormat.WAV_8000HZ_MONO_16BIT: WAV, 8 kHz SpeechSynthesisAudioFormat.WAV_16000HZ_MONO_16BIT: WAV, 16 kHz SpeechSynthesisAudioFormat.WAV_22050HZ_MONO_16BIT: WAV, 22.05 kHz SpeechSynthesisAudioFormat.WAV_24000HZ_MONO_16BIT: WAV, 24 kHz SpeechSynthesisAudioFormat.WAV_44100HZ_MONO_16BIT: WAV, 44.1 kHz SpeechSynthesisAudioFormat.WAV_48000HZ_MONO_16BIT: WAV, 48 kHz SpeechSynthesisAudioFormat.MP3_8000HZ_MONO_128KBPS: MP3, 8 kHz SpeechSynthesisAudioFormat.MP3_16000HZ_MONO_128KBPS: MP3, 16 kHz SpeechSynthesisAudioFormat.MP3_22050HZ_MONO_256KBPS: MP3, 22.05 kHz SpeechSynthesisAudioFormat.MP3_24000HZ_MONO_256KBPS: MP3, 24 kHz SpeechSynthesisAudioFormat.MP3_44100HZ_MONO_256KBPS: MP3, 44.1 kHz SpeechSynthesisAudioFormat.MP3_48000HZ_MONO_256KBPS: MP3, 48 kHz SpeechSynthesisAudioFormat.PCM_8000HZ_MONO_16BIT: PCM, 8 kHz SpeechSynthesisAudioFormat.PCM_16000HZ_MONO_16BIT: PCM, 16 kHz SpeechSynthesisAudioFormat.PCM_22050HZ_MONO_16BIT: PCM, 22.05 kHz SpeechSynthesisAudioFormat.PCM_24000HZ_MONO_16BIT: PCM, 24 kHz SpeechSynthesisAudioFormat.PCM_44100HZ_MONO_16BIT: PCM, 44.1 kHz SpeechSynthesisAudioFormat.PCM_48000HZ_MONO_16BIT: PCM, 48 kHz Opus (DashScope 2.21.0+). Adjust bitrate with `bit_rate`: SpeechSynthesisAudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: Opus, 8 kHz, 32 kbps SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: Opus, 16 kHz, 16 kbps SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: Opus, 16 kHz, 32 kbps SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: Opus, 16 kHz, 64 kbps SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: Opus, 24 kHz, 16 kbps SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: Opus, 24 kHz, 32 kbps SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: Opus, 24 kHz, 64 kbps SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: Opus, 48 kHz, 16 kbps SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: Opus, 48 kHz, 32 kbps SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: Opus, 48 kHz, 64 kbps
volume	int	No	Volume. Default: 50. Range: [0, 100]. Scales linearly. 0 is silent, 100 is maximum.
speechRate	float	No	Speech rate. Default: 1.0. Range: [0.5, 2.0]. Values below 1.0 slow down speech; above 1.0 speed it up.
pitchRate	float	No	Pitch multiplier. Default: 1.0. Range: [0.5, 2.0]. The relationship to perceived pitch is non-linear. Values above 1.0 raise pitch; below 1.0 lower it. Test to find suitable values.
bit_rate	int	No	Audio bitrate in kbps for Opus format. Default: 32. Range: [6, 510]. Set using the `parameter` or `parameters` method of `SpeechSynthesisParam`. See examples below.
enableWordTimestamp	boolean	No	Enable word-level timestamps. Default: false. Available only for system voices marked as supported in the voice list. Timestamp results are only available through the callback interface.
seed	int	No	Random seed for generation. Different seeds produce different results. Same seed with identical parameters reproduces the same output. Default: 0. Range: [0, 65535].
languageHints	List	No	Target language for synthesis. Use when pronunciation of numbers, abbreviations, or symbols is inaccurate, or when less common languages need improvement. Valid values: zh (Chinese), en (English), fr (French), de (German), ja (Japanese), ko (Korean), ru (Russian), pt (Portuguese), th (Thai), id (Indonesian), vi (Vietnamese). Note: This is an array, but only the first element is processed. Pass one value only.
instruction	String	No	Control dialect, emotion, or speaking style via instructions. Available for system voices marked Instruct-supported in the voice list. Max length: 100 characters. See instruction examples below.
enable_aigc_tag	boolean	No	Add an invisible AIGC identifier to the audio. When true, an invisible identifier is embedded in supported formats (WAV, MP3, Opus). Default: false. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus.
aigc_propagator	String	No	Sets the `ContentPropagator` field in the AIGC identifier to identify the content propagator. Takes effect only when `enable_aigc_tag` is true. Default: UID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus.
aigc_propagate_id	String	No	Sets the `PropagateID` field in the AIGC identifier to uniquely identify a propagation behavior. Takes effect only when `enable_aigc_tag` is true. Default: the current request ID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus.
hotFix	ParamHotFix	No	Text hotpatching configuration. Customize pronunciation of specific words or replace text before synthesis. Available only for cosyvoice-v3-flash. See the hotFix example below.
enable_markdown_filter	boolean	No	Remove Markdown symbols from input text before synthesis, preventing them from being read aloud. Default: false. Available only for cosyvoice-v3-flash.

Set `bit_rate`

Using parameter()
Using parameters()

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .parameter("bit_rate", 32)
  .build();

Set `enable_aigc_tag`

Using parameter()
Using parameters()

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .parameter("enable_aigc_tag", true)
  .build();

Set `aigc_propagator`

Using parameter()
Using parameters()

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .parameter("enable_aigc_tag", true)
  .parameter("aigc_propagator", "xxxx")
  .build();

Map<String, Object> map = new HashMap();
map.put("enable_aigc_tag", true);
map.put("aigc_propagator", "xxxx");

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .parameters(map)
  .build();

Set `aigc_propagate_id`

Using parameter()
Using parameters()

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .parameter("enable_aigc_tag", true)
  .parameter("aigc_propagate_id", "xxxx")
  .build();

Map<String, Object> map = new HashMap();
map.put("enable_aigc_tag", true);
map.put("aigc_propagate_id", "xxxx");

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .parameters(map)
  .build();

Set `enable_markdown_filter`

Using parameter()
Using parameters()

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("your_voice") // Replace with a cloned voice for cosyvoice-v3-flash
  .parameter("enable_markdown_filter", true)
  .build();

Map<String, Object> map = new HashMap();
map.put("enable_markdown_filter", true);

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("your_voice") // Replace with a cloned voice for cosyvoice-v3-flash
  .parameters(map)
  .build();

`instruction` examples

cosyvoice-v3-flash:

Cloned voices: Use any natural language instruction to control synthesis. Instruction examples:

Please speak in Cantonese. (Supported dialects: Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, Yunnan.)
Please say a sentence as loudly as possible.
Please say a sentence as slowly as possible.
Please say a sentence as quickly as possible.
Please say a sentence very softly.
Can you speak a little slower?
Can you speak very quickly?
Can you speak very slowly?
Can you speak a little faster?
Please say a sentence very angrily.
Please say a sentence very happily.
Please say a sentence very fearfully.
Please say a sentence very sadly.
Please say a sentence very surprisedly.
Please try to sound as firm as possible.
Please try to sound as angry as possible.
Please try an approachable tone.
Please speak in a cold tone.
Please speak in a majestic tone.
I want to experience a natural tone.
I want to see how you express a threat.
I want to see how you express wisdom.
I want to see how you express seduction.
I want to hear you speak in a lively way.
I want to hear you speak with passion.
I want to hear you speak in a steady manner.
I want to hear you speak with confidence.
Can you talk to me with excitement?
Can you show an arrogant emotion?
Can you show an elegant emotion?
Can you answer the question happily?
Can you give a gentle emotional demonstration?
Can you talk to me in a calm tone?
Can you answer me in a deep way?
Can you talk to me with a gruff attitude?
Tell me the answer in a sinister voice.
Tell me the answer in a resilient voice.
Narrate in a natural and friendly chat style.
Speak in the tone of a radio drama podcaster.

System voices: Instructions must use a fixed format. See the voice list for details.

`hotFix` example

List<ParamHotFix.PronunciationItem> pronunciationItems = new ArrayList<>();
pronunciationItems.add(new ParamHotFix.PronunciationItem("weather", "tian1 qi4"));

List<ParamHotFix.ReplaceItem> replaceItems = new ArrayList<>();
replaceItems.add(new ParamHotFix.ReplaceItem("today", "gold day"));

ParamHotFix paramHotFix = new ParamHotFix();
paramHotFix.setPronunciation(pronunciationItems);
paramHotFix.setReplace(replaceItems);

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
      .model("cosyvoice-v3-flash")
      .voice("your_voice") // Replace with a cloned voice for cosyvoice-v3-flash
      .hotFix(paramHotFix)
      .build();

Key interfaces

`SpeechSynthesizer` class

Import with import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;.

Interface/Method	Parameter	Return value	Description
`public SpeechSynthesizer(SpeechSynthesisParam param, ResultCallback<SpeechSynthesisResult> callback)`	`param`: Request parameters. `callback`: ResultCallback for streaming calls, or `null` for non-streaming/Flowable calls.	`SpeechSynthesizer` instance	Constructor. Set callback to ResultCallback for unidirectional or bidirectional streaming. Set to null for non-streaming or Flowable calls.
`public ByteBuffer call(String text)`	`text`: Text to synthesize (UTF-8).	`ByteBuffer` or `null`	Converts text (plain or SSML) to speech. Without callback: blocks until complete. With callback: returns null immediately; results arrive via `onEvent`.
`public void streamingCall(String text)`	`text`: Text to synthesize (UTF-8).	None	Sends text for streaming synthesis. SSML is not supported. Call multiple times to send text in chunks. Results arrive via `onEvent`. See Bidirectional streaming call.
`public void streamingComplete() throws RuntimeException`	None	None	Ends streaming synthesis. Blocks until synthesis completes, the session interrupts, or the 10-minute timeout occurs. See Bidirectional streaming call.
`public Flowable<SpeechSynthesisResult> callAsFlowable(String text)`	`text`: Text to synthesize (UTF-8).	`Flowable<SpeechSynthesisResult>`	Converts non-streaming text to streaming speech output. SSML is not supported. See Call using Flowable.
`boolean getDuplexApi().close(int code, String reason)`	code: WebSocket close code. reason: Close reason. See The WebSocket Protocol.	true	Close the WebSocket connection after each task to prevent connection leaks. For connection reuse, see High-concurrency management.
`public Flowable<SpeechSynthesisResult> streamingCallAsFlowable(Flowable<String> textStream)`	`textStream`: Flowable wrapping text to synthesize.	`Flowable<SpeechSynthesisResult>`	Converts streaming text input to streaming speech output. SSML is not supported. See Call using Flowable.
`public String getLastRequestId()`	None	Request ID of the previous task.	Gets the request ID after starting a new task via `call`, `streamingCall`, `callAsFlowable`, or `streamingCallAsFlowable`.
`public long getFirstPackageDelay()`	None	First-packet latency in milliseconds.	Gets the time between sending text and receiving the first audio packet. Call after the task completes.

Important usage requirements:

Re-initialize the SpeechSynthesizer instance before each call invocation.
Always call streamingComplete during bidirectional streaming to avoid missing synthesized speech.

Factors affecting first-packet latency:

WebSocket connection establishment (first call)
Voice loading time (varies by voice)
Service load (peak-hour queuing)
Network latency

Typical latency:

Reused connection with loaded voice: ~500 ms
First connection or voice switch: 1,500-2,000 ms

If latency consistently exceeds 2,000 ms:

Use connection pooling to pre-establish connections (high-concurrency scenarios).
Check network quality.
Avoid peak hours.

`ResultCallback` interface

Get synthesis results through ResultCallback during unidirectional or bidirectional streaming calls. Import with import com.alibaba.dashscope.common.ResultCallback;.

Click to view an example

ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
  @Override
  public void onEvent(SpeechSynthesisResult result) {
    System.out.println("Request ID: " + result.getRequestId());
    // Process audio chunks in real time (e.g., play or write to a buffer).
  }

  @Override
  public void onComplete() {
    System.out.println("Task complete");
    // Handle synthesis completion logic (e.g., release the player).
  }

  @Override
  public void onError(Exception e) {
    System.out.println("Task failed: " + e.getMessage());
    // Handle exceptions (network errors or server-side error codes).
  }
};

Interface/Method	Parameter	Return value	Description
`public void onEvent(SpeechSynthesisResult result)`	`result`: A SpeechSynthesisResult instance.	None	Called when the server pushes audio data. Use `getAudioFrame` on SpeechSynthesisResult to get binary audio. Use `getUsage` to get the billable character count so far.
`public void onComplete()`	None	None	Called after all synthesis data has been returned.
`public void onError(Exception e)`	`e`: Exception information.	None	Called when an exception occurs. Implement exception logging and resource cleanup in this method.

Response

The server returns binary audio data:

Non-streaming: Process the ByteBuffer returned by call.
Unidirectional streaming or bidirectional streaming: Process the SpeechSynthesisResult parameter in onEvent.

Key interfaces of SpeechSynthesisResult:

Interface/Method	Parameter	Return value	Description
`public ByteBuffer getAudioFrame()`	None	Binary audio data	Returns binary audio for the current segment. May be empty if no new data has arrived. Combine segments into a complete file, or play them with a streaming player.
`public String getRequestId()`	None	Request ID	Gets the task request ID. Returns `null` when `getAudioFrame` returns data.
`public SpeechSynthesisUsage getUsage()`	None	`SpeechSynthesisUsage` or `null`	Returns billable character count so far via `getCharacters()`. Use the last received value as final.
`public Sentence getTimestamp()`	None	`Sentence` or `null`	Returns timestamp data when `enableWordTimestamp` is true. `Sentence` methods: `getIndex` (sentence number, from 0), `getWords` (returns `List<Word>`). `Word` methods: `getText`, `getBeginIndex`, `getEndIndex`, `getBeginTime`, `getEndTime`.

For compressed formats (MP3, Opus) in streaming synthesis, use a streaming player. Playing frame by frame causes decoding failures.Streaming players include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).When combining audio into a complete file, write in append mode. For WAV and MP3 streaming audio, only the first frame contains header information.

More examples

For more examples, see GitHub.

​Prerequisites

​Models and pricing

​Text and format limits

​Text length limits

​Character counting rules

​Encoding format

​Math expressions

​SSML support

​Getting started

​Non-streaming call

​Unidirectional streaming call

​Bidirectional streaming call

​Call using Flowable

​High-concurrency calls

​Request parameters

​Set bit_rate

​Set enable_aigc_tag

​Set aigc_propagator

​Set aigc_propagate_id

​Set enable_markdown_filter

​instruction examples

​hotFix example

​Key interfaces

​SpeechSynthesizer class

​ResultCallback interface

​Response

​More examples

Prerequisites

Models and pricing

Text and format limits

Text length limits

Character counting rules

Encoding format

Math expressions

SSML support

Getting started

Non-streaming call

Unidirectional streaming call

Bidirectional streaming call

Call using Flowable

High-concurrency calls

Request parameters

Set `bit_rate`

Set `enable_aigc_tag`

Set `aigc_propagator`

Set `aigc_propagate_id`

Set `enable_markdown_filter`

`instruction` examples

`hotFix` example

Key interfaces

`SpeechSynthesizer` class

`ResultCallback` interface

Response

More examples