Skip to main content
CosyVoice

CosyVoice Java SDK

CosyVoice Java reference

For model overviews and voice selection, see Speech synthesis.

Prerequisites

For temporary access to third-party apps or strict control over sensitive operations, use a temporary authentication token. Temporary tokens expire in 60 seconds, reducing leakage risk. Replace the API key in your code with the token.

Models and pricing

See Speech synthesis.

Text and format limits

Text length limits

Character counting rules

  • Chinese characters (simplified, traditional, Japanese Kanji, Korean Hanja) count as 2. All other characters (punctuation, letters, numbers, Kana, Hangul) count as 1.
  • SSML tags are excluded from the text length.
  • Examples:
    • "你好" → 2 (Chinese character) + 2 (Chinese character) = 4 characters
    • "中A文123" → 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters
    • "中文。" → 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters
    • "中 文。" → 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters
    • "<speak>你好</speak>" → 2 (Chinese character) + 2 (Chinese character) = 4 characters

Encoding format

Use UTF-8 encoding.

Math expressions

Math expression parsing is available for cosyvoice-v3-flash and cosyvoice-v3-plus. It supports common primary and secondary school math, including basic operations, algebra, and geometry.
This feature supports Chinese only.
See Convert LaTeX formulas to speech (Chinese language only).

SSML support

SSML is available for custom voices (voice design or cloning) on cosyvoice-v3-flash and cosyvoice-v3-plus, and for system voices marked SSML-supported in the voice list. Requirements:

Getting started

The SpeechSynthesizer class supports the following call methods:
  • Non-streaming: Sends full text and returns complete audio. Blocks until done. Best for short text.
  • Unidirectional streaming: Sends full text and returns audio via callback. Non-blocking. Best for short text with low latency needs.
  • Bidirectional streaming: Sends text fragments incrementally and returns audio in real time via callback. Non-blocking. Best for long text with low latency needs.

Non-streaming call

Sends text synchronously and returns the complete audio result.
image
Instantiate SpeechSynthesizer, bind the request parameters, and call the call method to get binary audio data. Max text length: 20,000 characters.
Re-initialize the SpeechSynthesizer instance before each call invocation.
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Main {
  private static String model = "cosyvoice-v3-flash";
  private static String voice = "longanyang";

  public static void streamAudioDataToSpeaker() {
    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            // If you have not configured environment variables, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .model(model)
            .voice(voice)
            .build();

    // Synchronous mode: set the callback (second parameter) to null.
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
    ByteBuffer audio = null;
    try {
      // Blocks until the audio is returned.
      audio = synthesizer.call("What's the weather like today?");
    } catch (Exception e) {
      throw new RuntimeException(e);
    } finally {
      // Close the WebSocket connection after the task completes.
      synthesizer.getDuplexApi().close(1000, "bye");
    }
    if (audio != null) {
      File file = new File("output.mp3");
      // The first call includes WebSocket connection time in the first-packet latency.
      System.out.println(
          "[Metric] Request ID: "
              + synthesizer.getLastRequestId()
              + ", First-packet latency (ms): "
              + synthesizer.getFirstPackageDelay());
      try (FileOutputStream fos = new FileOutputStream(file)) {
        fos.write(audio.array());
      } catch (IOException e) {
        throw new RuntimeException(e);
      }
    }
  }

  public static void main(String[] args) {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    streamAudioDataToSpeaker();
    System.exit(0);
  }
}

Unidirectional streaming call

Submit text asynchronously and receive audio data incrementally through a ResultCallback.
image
Instantiate SpeechSynthesizer, bind the request parameters and the ResultCallback, and call the call method. Get audio in real time through onEvent. Max text length: 20,000 characters.
Re-initialize the SpeechSynthesizer instance before each call invocation.
import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;

class TimeUtils {
  private static final DateTimeFormatter formatter =
      DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

  public static String getTimestamp() {
    return LocalDateTime.now().format(formatter);
  }
}

public class Main {
  private static String model = "cosyvoice-v3-flash";
  private static String voice = "longanyang";

  public static void streamAudioDataToSpeaker() {
    CountDownLatch latch = new CountDownLatch(1);

    // Implement the ResultCallback interface.
    ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
      @Override
      public void onEvent(SpeechSynthesisResult result) {
        // System.out.println("Message received: " + result);
        if (result.getAudioFrame() != null) {
          // Add your audio processing logic here.
          System.out.println(TimeUtils.getTimestamp() + " Audio received");
        }
      }

      @Override
      public void onComplete() {
        System.out.println(TimeUtils.getTimestamp() + " Complete received. Speech synthesis finished.");
        latch.countDown();
      }

      @Override
      public void onError(Exception e) {
        System.out.println("An exception occurred: " + e.toString());
        latch.countDown();
      }
    };

    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            // If you have not configured environment variables, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .model(model)
            .voice(voice)
            .build();
    // Pass the callback as the second parameter for asynchronous mode.
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
    // Non-blocking: returns null immediately. Results arrive via onEvent.
    try {
      synthesizer.call("What's the weather like today?");
      latch.await();
    } catch (Exception e) {
      throw new RuntimeException(e);
    } finally {
      // Close the WebSocket connection after the task completes.
      synthesizer.getDuplexApi().close(1000, "bye");
    }
    // The first call includes WebSocket connection time in the first-packet latency.
    System.out.println(
        "[Metric] Request ID: "
            + synthesizer.getLastRequestId()
            + ", First-packet latency (ms): "
            + synthesizer.getFirstPackageDelay());
  }

  public static void main(String[] args) {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    streamAudioDataToSpeaker();
    System.exit(0);
  }
}

Bidirectional streaming call

Send text in multiple chunks and receive audio incrementally through a ResultCallback.
  • Call streamingCall multiple times to submit text fragments. The server auto-segments into sentences: complete sentences synthesize immediately, incomplete ones buffer until complete. streamingComplete() forces synthesis of all buffered text.
  • Send text fragments within 23-second intervals to avoid timeout errors. Call streamingComplete() when done.
    The 23-second server timeout cannot be changed on the client.
image
1

Instantiate SpeechSynthesizer

Instantiate SpeechSynthesizer, and bind the request parameters and ResultCallback.
2

Stream text

Call streamingCall multiple times to send text in chunks. The server returns audio in real time through onEvent.Each streamingCall text fragment: max 20,000 characters. Total across all fragments: max 200,000 characters.
3

Complete synthesis

Call streamingComplete to finish. This blocks until onComplete or onError fires.Always call this method. Otherwise, trailing text may not be synthesized.
import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
  private static final DateTimeFormatter formatter =
      DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

  public static String getTimestamp() {
    return LocalDateTime.now().format(formatter);
  }
}


public class Main {
  private static String[] textArray = {"The streaming text-to-speech SDK ",
      "can convert input text ", "into binary audio data. ", "Compared to non-streaming synthesis, ",
      "streaming offers better real-time performance. ", "You can hear the audio output almost instantly as you type, ",
      "which greatly improves the user experience ", "and reduces waiting time. ",
      "This is ideal for applications that use large ", "language models (LLMs) ",
      "to synthesize speech from a stream of text."};
  private static String model = "cosyvoice-v3-flash";
  private static String voice = "longanyang";

  public static void streamAudioDataToSpeaker() {
    // Configure the callback.
    ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
      @Override
      public void onEvent(SpeechSynthesisResult result) {
        // System.out.println("Message received: " + result);
        if (result.getAudioFrame() != null) {
          // Add your audio processing logic here.
          System.out.println(TimeUtils.getTimestamp() + " Audio received");
        }
      }

      @Override
      public void onComplete() {
        System.out.println(TimeUtils.getTimestamp() + " Complete received. Speech synthesis finished.");
      }

      @Override
      public void onError(Exception e) {
        System.out.println("An exception occurred: " + e.toString());
      }
    };

    // Request parameters
    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            // If you have not configured environment variables, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .model(model)
            .voice(voice)
            .format(SpeechSynthesisAudioFormat
                .PCM_22050HZ_MONO_16BIT) // Use PCM or MP3 for streaming synthesis.
            .build();
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
    try {
      for (String text : textArray) {
        // Send text fragments. Audio arrives in real time via onEvent.
        synthesizer.streamingCall(text);
      }
      // Wait for streaming synthesis to complete.
      synthesizer.streamingComplete();
    } catch (Exception e) {
      throw new RuntimeException(e);
    } finally {
      // Close the WebSocket connection after the task completes.
      synthesizer.getDuplexApi().close(1000, "bye");
    }

    // The first call includes WebSocket connection time in the first-packet latency.
    System.out.println(
        "[Metric] Request ID: "
            + synthesizer.getLastRequestId()
            + ", First-packet latency (ms): "
            + synthesizer.getFirstPackageDelay());
  }

  public static void main(String[] args) {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    streamAudioDataToSpeaker();
    System.exit(0);
  }
}

Call using Flowable

Flowable is an open-source reactive programming framework under the Apache 2.0 license. For details, see the Flowable API docs. Integrate the RxJava library and understand reactive programming basics before using Flowable.
  • Unidirectional streaming call
  • Bidirectional streaming call
Use blockingForEach on a Flowable object to block and get each SpeechSynthesisResult. The complete result is also available through getAudioData after all streaming data returns.
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
  private static final DateTimeFormatter formatter =
      DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

  public static String getTimestamp() {
    return LocalDateTime.now().format(formatter);
  }
}

public class Main {
  private static String model = "cosyvoice-v3-flash";
  private static String voice = "longanyang";

  public static void streamAudioDataToSpeaker() throws NoApiKeyException {
    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            // If you have not configured environment variables, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .model(model)
            .voice(voice)
            .build();
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
    synthesizer.callAsFlowable("What's the weather like today?").blockingForEach(result -> {
      // System.out.println("Message received: " + result);
      if (result.getAudioFrame() != null) {
        // Add your audio processing logic here.
        System.out.println(TimeUtils.getTimestamp() + " Audio received");
      }
    });
    // Close the WebSocket connection after the task completes.
    synthesizer.getDuplexApi().close(1000, "bye");
    // The first call includes WebSocket connection time in the first-packet latency.
    System.out.println(
        "[Metric] Request ID: "
            + synthesizer.getLastRequestId()
            + ", First-packet latency (ms): "
            + synthesizer.getFirstPackageDelay());
  }

  public static void main(String[] args) throws NoApiKeyException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    streamAudioDataToSpeaker();
    System.exit(0);
  }
}

High-concurrency calls

The DashScope Java SDK uses OkHttp3 connection pooling to reduce connection overhead. See High-concurrency management.

Request parameters

Use the chained methods of SpeechSynthesisParam to configure parameters like model and voice. Pass the configured object to the SpeechSynthesizer constructor.
SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .format(SpeechSynthesisAudioFormat.WAV_8000HZ_MONO_16BIT) // Audio format and sample rate
  .volume(50) // Volume: [0, 100]
  .speechRate(1.0f) // Speech rate: [0.5, 2]
  .pitchRate(1.0f) // Pitch: [0.5, 2]
  .build();
ParameterTypeRequiredDescription
modelStringYesThe text-to-speech model. See Voice list for all options.
voiceStringYesThe voice for synthesis. See Voice list for available system voices.
formatenumNoAudio format and sample rate. Default: MP3 at 22.05 kHz. Default sample rate is optimal for the voice. Downsampling and upsampling are supported.

Supported formats:
  • All models:
    • SpeechSynthesisAudioFormat.WAV_8000HZ_MONO_16BIT: WAV, 8 kHz
    • SpeechSynthesisAudioFormat.WAV_16000HZ_MONO_16BIT: WAV, 16 kHz
    • SpeechSynthesisAudioFormat.WAV_22050HZ_MONO_16BIT: WAV, 22.05 kHz
    • SpeechSynthesisAudioFormat.WAV_24000HZ_MONO_16BIT: WAV, 24 kHz
    • SpeechSynthesisAudioFormat.WAV_44100HZ_MONO_16BIT: WAV, 44.1 kHz
    • SpeechSynthesisAudioFormat.WAV_48000HZ_MONO_16BIT: WAV, 48 kHz
    • SpeechSynthesisAudioFormat.MP3_8000HZ_MONO_128KBPS: MP3, 8 kHz
    • SpeechSynthesisAudioFormat.MP3_16000HZ_MONO_128KBPS: MP3, 16 kHz
    • SpeechSynthesisAudioFormat.MP3_22050HZ_MONO_256KBPS: MP3, 22.05 kHz
    • SpeechSynthesisAudioFormat.MP3_24000HZ_MONO_256KBPS: MP3, 24 kHz
    • SpeechSynthesisAudioFormat.MP3_44100HZ_MONO_256KBPS: MP3, 44.1 kHz
    • SpeechSynthesisAudioFormat.MP3_48000HZ_MONO_256KBPS: MP3, 48 kHz
    • SpeechSynthesisAudioFormat.PCM_8000HZ_MONO_16BIT: PCM, 8 kHz
    • SpeechSynthesisAudioFormat.PCM_16000HZ_MONO_16BIT: PCM, 16 kHz
    • SpeechSynthesisAudioFormat.PCM_22050HZ_MONO_16BIT: PCM, 22.05 kHz
    • SpeechSynthesisAudioFormat.PCM_24000HZ_MONO_16BIT: PCM, 24 kHz
    • SpeechSynthesisAudioFormat.PCM_44100HZ_MONO_16BIT: PCM, 44.1 kHz
    • SpeechSynthesisAudioFormat.PCM_48000HZ_MONO_16BIT: PCM, 48 kHz
  • Opus (DashScope 2.21.0+). Adjust bitrate with bit_rate:
    • SpeechSynthesisAudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: Opus, 8 kHz, 32 kbps
    • SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: Opus, 16 kHz, 16 kbps
    • SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: Opus, 16 kHz, 32 kbps
    • SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: Opus, 16 kHz, 64 kbps
    • SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: Opus, 24 kHz, 16 kbps
    • SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: Opus, 24 kHz, 32 kbps
    • SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: Opus, 24 kHz, 64 kbps
    • SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: Opus, 48 kHz, 16 kbps
    • SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: Opus, 48 kHz, 32 kbps
    • SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: Opus, 48 kHz, 64 kbps
volumeintNoVolume. Default: 50. Range: [0, 100]. Scales linearly. 0 is silent, 100 is maximum.
speechRatefloatNoSpeech rate. Default: 1.0. Range: [0.5, 2.0]. Values below 1.0 slow down speech; above 1.0 speed it up.
pitchRatefloatNoPitch multiplier. Default: 1.0. Range: [0.5, 2.0]. The relationship to perceived pitch is non-linear. Values above 1.0 raise pitch; below 1.0 lower it. Test to find suitable values.
bit_rateintNoAudio bitrate in kbps for Opus format. Default: 32. Range: [6, 510].

Set using the parameter or parameters method of SpeechSynthesisParam. See examples below.
enableWordTimestampbooleanNoEnable word-level timestamps. Default: false. Available only for system voices marked as supported in the voice list.

Timestamp results are only available through the callback interface.
seedintNoRandom seed for generation. Different seeds produce different results. Same seed with identical parameters reproduces the same output. Default: 0. Range: [0, 65535].
languageHintsListNoTarget language for synthesis. Use when pronunciation of numbers, abbreviations, or symbols is inaccurate, or when less common languages need improvement. Valid values: zh (Chinese), en (English), fr (French), de (German), ja (Japanese), ko (Korean), ru (Russian), pt (Portuguese), th (Thai), id (Indonesian), vi (Vietnamese). Note: This is an array, but only the first element is processed. Pass one value only.
instructionStringNoControl dialect, emotion, or speaking style via instructions. Available for system voices marked Instruct-supported in the voice list. Max length: 100 characters. See instruction examples below.
enable_aigc_tagbooleanNoAdd an invisible AIGC identifier to the audio. When true, an invisible identifier is embedded in supported formats (WAV, MP3, Opus). Default: false. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus.
aigc_propagatorStringNoSets the ContentPropagator field in the AIGC identifier to identify the content propagator. Takes effect only when enable_aigc_tag is true. Default: UID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus.
aigc_propagate_idStringNoSets the PropagateID field in the AIGC identifier to uniquely identify a propagation behavior. Takes effect only when enable_aigc_tag is true. Default: the current request ID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus.
hotFixParamHotFixNoText hotpatching configuration. Customize pronunciation of specific words or replace text before synthesis. Available only for cosyvoice-v3-flash. See the hotFix example below.
enable_markdown_filterbooleanNoRemove Markdown symbols from input text before synthesis, preventing them from being read aloud. Default: false. Available only for cosyvoice-v3-flash.

Set bit_rate

  • Using parameter()
  • Using parameters()
SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .parameter("bit_rate", 32)
  .build();

Set enable_aigc_tag

  • Using parameter()
  • Using parameters()
SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .parameter("enable_aigc_tag", true)
  .build();

Set aigc_propagator

  • Using parameter()
  • Using parameters()
SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .parameter("enable_aigc_tag", true)
  .parameter("aigc_propagator", "xxxx")
  .build();

Set aigc_propagate_id

  • Using parameter()
  • Using parameters()
SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("longanyang")
  .parameter("enable_aigc_tag", true)
  .parameter("aigc_propagate_id", "xxxx")
  .build();

Set enable_markdown_filter

  • Using parameter()
  • Using parameters()
SpeechSynthesisParam param = SpeechSynthesisParam.builder()
  .model("cosyvoice-v3-flash")
  .voice("your_voice") // Replace with a cloned voice for cosyvoice-v3-flash
  .parameter("enable_markdown_filter", true)
  .build();

instruction examples

cosyvoice-v3-flash:
  • Cloned voices: Use any natural language instruction to control synthesis. Instruction examples:
Please speak in Cantonese. (Supported dialects: Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, Yunnan.)
Please say a sentence as loudly as possible.
Please say a sentence as slowly as possible.
Please say a sentence as quickly as possible.
Please say a sentence very softly.
Can you speak a little slower?
Can you speak very quickly?
Can you speak very slowly?
Can you speak a little faster?
Please say a sentence very angrily.
Please say a sentence very happily.
Please say a sentence very fearfully.
Please say a sentence very sadly.
Please say a sentence very surprisedly.
Please try to sound as firm as possible.
Please try to sound as angry as possible.
Please try an approachable tone.
Please speak in a cold tone.
Please speak in a majestic tone.
I want to experience a natural tone.
I want to see how you express a threat.
I want to see how you express wisdom.
I want to see how you express seduction.
I want to hear you speak in a lively way.
I want to hear you speak with passion.
I want to hear you speak in a steady manner.
I want to hear you speak with confidence.
Can you talk to me with excitement?
Can you show an arrogant emotion?
Can you show an elegant emotion?
Can you answer the question happily?
Can you give a gentle emotional demonstration?
Can you talk to me in a calm tone?
Can you answer me in a deep way?
Can you talk to me with a gruff attitude?
Tell me the answer in a sinister voice.
Tell me the answer in a resilient voice.
Narrate in a natural and friendly chat style.
Speak in the tone of a radio drama podcaster.
  • System voices: Instructions must use a fixed format. See the voice list for details.

hotFix example

List<ParamHotFix.PronunciationItem> pronunciationItems = new ArrayList<>();
pronunciationItems.add(new ParamHotFix.PronunciationItem("weather", "tian1 qi4"));

List<ParamHotFix.ReplaceItem> replaceItems = new ArrayList<>();
replaceItems.add(new ParamHotFix.ReplaceItem("today", "gold day"));

ParamHotFix paramHotFix = new ParamHotFix();
paramHotFix.setPronunciation(pronunciationItems);
paramHotFix.setReplace(replaceItems);

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
      .model("cosyvoice-v3-flash")
      .voice("your_voice") // Replace with a cloned voice for cosyvoice-v3-flash
      .hotFix(paramHotFix)
      .build();

Key interfaces

SpeechSynthesizer class

Import with import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;.
Interface/MethodParameterReturn valueDescription
public SpeechSynthesizer(SpeechSynthesisParam param, ResultCallback<SpeechSynthesisResult> callback)param: Request parameters. callback: ResultCallback for streaming calls, or null for non-streaming/Flowable calls.SpeechSynthesizer instanceConstructor. Set callback to ResultCallback for unidirectional or bidirectional streaming. Set to null for non-streaming or Flowable calls.
public ByteBuffer call(String text)text: Text to synthesize (UTF-8).ByteBuffer or nullConverts text (plain or SSML) to speech. Without callback: blocks until complete. With callback: returns null immediately; results arrive via onEvent.
public void streamingCall(String text)text: Text to synthesize (UTF-8).NoneSends text for streaming synthesis. SSML is not supported. Call multiple times to send text in chunks. Results arrive via onEvent. See Bidirectional streaming call.
public void streamingComplete() throws RuntimeExceptionNoneNoneEnds streaming synthesis. Blocks until synthesis completes, the session interrupts, or the 10-minute timeout occurs. See Bidirectional streaming call.
public Flowable<SpeechSynthesisResult> callAsFlowable(String text)text: Text to synthesize (UTF-8).Flowable<SpeechSynthesisResult>Converts non-streaming text to streaming speech output. SSML is not supported. See Call using Flowable.
boolean getDuplexApi().close(int code, String reason)code: WebSocket close code. reason: Close reason. See The WebSocket Protocol.trueClose the WebSocket connection after each task to prevent connection leaks. For connection reuse, see High-concurrency management.
public Flowable<SpeechSynthesisResult> streamingCallAsFlowable(Flowable<String> textStream)textStream: Flowable wrapping text to synthesize.Flowable<SpeechSynthesisResult>Converts streaming text input to streaming speech output. SSML is not supported. See Call using Flowable.
public String getLastRequestId()NoneRequest ID of the previous task.Gets the request ID after starting a new task via call, streamingCall, callAsFlowable, or streamingCallAsFlowable.
public long getFirstPackageDelay()NoneFirst-packet latency in milliseconds.Gets the time between sending text and receiving the first audio packet. Call after the task completes.
Important usage requirements:
  • Re-initialize the SpeechSynthesizer instance before each call invocation.
  • Always call streamingComplete during bidirectional streaming to avoid missing synthesized speech.
Factors affecting first-packet latency:
  • WebSocket connection establishment (first call)
  • Voice loading time (varies by voice)
  • Service load (peak-hour queuing)
  • Network latency
Typical latency:
  • Reused connection with loaded voice: ~500 ms
  • First connection or voice switch: 1,500-2,000 ms
If latency consistently exceeds 2,000 ms:
  1. Use connection pooling to pre-establish connections (high-concurrency scenarios).
  2. Check network quality.
  3. Avoid peak hours.

ResultCallback interface

Get synthesis results through ResultCallback during unidirectional or bidirectional streaming calls. Import with import com.alibaba.dashscope.common.ResultCallback;.
ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
  @Override
  public void onEvent(SpeechSynthesisResult result) {
    System.out.println("Request ID: " + result.getRequestId());
    // Process audio chunks in real time (e.g., play or write to a buffer).
  }

  @Override
  public void onComplete() {
    System.out.println("Task complete");
    // Handle synthesis completion logic (e.g., release the player).
  }

  @Override
  public void onError(Exception e) {
    System.out.println("Task failed: " + e.getMessage());
    // Handle exceptions (network errors or server-side error codes).
  }
};
Interface/MethodParameterReturn valueDescription
public void onEvent(SpeechSynthesisResult result)result: A SpeechSynthesisResult instance.NoneCalled when the server pushes audio data. Use getAudioFrame on SpeechSynthesisResult to get binary audio. Use getUsage to get the billable character count so far.
public void onComplete()NoneNoneCalled after all synthesis data has been returned.
public void onError(Exception e)e: Exception information.NoneCalled when an exception occurs. Implement exception logging and resource cleanup in this method.

Response

The server returns binary audio data: Key interfaces of SpeechSynthesisResult:
Interface/MethodParameterReturn valueDescription
public ByteBuffer getAudioFrame()NoneBinary audio dataReturns binary audio for the current segment. May be empty if no new data has arrived. Combine segments into a complete file, or play them with a streaming player.
public String getRequestId()NoneRequest IDGets the task request ID. Returns null when getAudioFrame returns data.
public SpeechSynthesisUsage getUsage()NoneSpeechSynthesisUsage or nullReturns billable character count so far via getCharacters(). Use the last received value as final.
public Sentence getTimestamp()NoneSentence or nullReturns timestamp data when enableWordTimestamp is true. Sentence methods: getIndex (sentence number, from 0), getWords (returns List<Word>). Word methods: getText, getBeginIndex, getEndIndex, getBeginTime, getEndTime.
For compressed formats (MP3, Opus) in streaming synthesis, use a streaming player. Playing frame by frame causes decoding failures.Streaming players include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).When combining audio into a complete file, write in append mode. For WAV and MP3 streaming audio, only the first frame contains header information.

More examples

For more examples, see GitHub.