CosyVoice Java reference
For model overviews and voice selection, see Speech synthesis.
See Speech synthesis.
Use UTF-8 encoding.
Math expression parsing is available for cosyvoice-v3-flash and cosyvoice-v3-plus. It supports common primary and secondary school math, including basic operations, algebra, and geometry.
See Convert LaTeX formulas to speech (Chinese language only).
SSML is available for custom voices (voice design or cloning) on cosyvoice-v3-flash and cosyvoice-v3-plus, and for system voices marked SSML-supported in the voice list.
Requirements:
The SpeechSynthesizer class supports the following call methods:
Sends text synchronously and returns the complete audio result.
Instantiate SpeechSynthesizer, bind the request parameters, and call the
Submit text asynchronously and receive audio data incrementally through a
Instantiate SpeechSynthesizer, bind the request parameters and the ResultCallback, and call the
Send text in multiple chunks and receive audio incrementally through a
Flowable is an open-source reactive programming framework under the Apache 2.0 license. For details, see the Flowable API docs. Integrate the RxJava library and understand reactive programming basics before using Flowable.
The DashScope Java SDK uses OkHttp3 connection pooling to reduce connection overhead. See High-concurrency management.
Use the chained methods of SpeechSynthesisParam to configure parameters like model and voice. Pass the configured object to the SpeechSynthesizer constructor.
cosyvoice-v3-flash:
Import with
Factors affecting first-packet latency:
Get synthesis results through
The server returns binary audio data:
For more examples, see GitHub.
Prerequisites
- Sign in to Qwen Cloud and create an API key. Export it as an environment variable instead of hard-coding it.
For temporary access to third-party apps or strict control over sensitive operations, use a temporary authentication token. Temporary tokens expire in 60 seconds, reducing leakage risk. Replace the API key in your code with the token.
Models and pricing
See Speech synthesis.
Text and format limits
Text length limits
- Non-streaming, unidirectional streaming, or Flowable unidirectional streaming: Max 20,000 characters per request.
- Bidirectional streaming or Flowable bidirectional streaming: Max 20,000 characters per request, 200,000 characters total.
Character counting rules
- Chinese characters (simplified, traditional, Japanese Kanji, Korean Hanja) count as 2. All other characters (punctuation, letters, numbers, Kana, Hangul) count as 1.
- SSML tags are excluded from the text length.
- Examples:
"你好"→ 2 (Chinese character) + 2 (Chinese character) = 4 characters"中A文123"→ 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters"中文。"→ 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters"中 文。"→ 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters"<speak>你好</speak>"→ 2 (Chinese character) + 2 (Chinese character) = 4 characters
Encoding format
Use UTF-8 encoding.
Math expressions
Math expression parsing is available for cosyvoice-v3-flash and cosyvoice-v3-plus. It supports common primary and secondary school math, including basic operations, algebra, and geometry.
This feature supports Chinese only.
SSML support
SSML is available for custom voices (voice design or cloning) on cosyvoice-v3-flash and cosyvoice-v3-plus, and for system voices marked SSML-supported in the voice list.
Requirements:
- DashScope SDK 2.20.3 or later.
- Only non-streaming and unidirectional streaming calls (the
callmethod of SpeechSynthesizer) are supported. Bidirectional streaming calls (streamingCall) and Flowable calls are not supported. - Pass text containing SSML to the
callmethod, the same as for normal text.
Getting started
The SpeechSynthesizer class supports the following call methods:
- Non-streaming: Sends full text and returns complete audio. Blocks until done. Best for short text.
- Unidirectional streaming: Sends full text and returns audio via callback. Non-blocking. Best for short text with low latency needs.
- Bidirectional streaming: Sends text fragments incrementally and returns audio in real time via callback. Non-blocking. Best for long text with low latency needs.
Non-streaming call
Sends text synchronously and returns the complete audio result.
call method to get binary audio data. Max text length: 20,000 characters.
Re-initialize the
SpeechSynthesizer instance before each call invocation.Click to view the full example
Click to view the full example
Unidirectional streaming call
Submit text asynchronously and receive audio data incrementally through a ResultCallback.
call method. Get audio in real time through onEvent. Max text length: 20,000 characters.
Re-initialize the
SpeechSynthesizer instance before each call invocation.Click to view the full example
Click to view the full example
Bidirectional streaming call
Send text in multiple chunks and receive audio incrementally through a ResultCallback.
-
Call
streamingCallmultiple times to submit text fragments. The server auto-segments into sentences: complete sentences synthesize immediately, incomplete ones buffer until complete.streamingComplete()forces synthesis of all buffered text. -
Send text fragments within 23-second intervals to avoid timeout errors. Call
streamingComplete()when done.The 23-second server timeout cannot be changed on the client.
1
Instantiate SpeechSynthesizer
Instantiate SpeechSynthesizer, and bind the request parameters and ResultCallback.
2
Stream text
Call
streamingCall multiple times to send text in chunks. The server returns audio in real time through onEvent.Each streamingCall text fragment: max 20,000 characters. Total across all fragments: max 200,000 characters.3
Complete synthesis
Call
streamingComplete to finish. This blocks until onComplete or onError fires.Always call this method. Otherwise, trailing text may not be synthesized.Click to view the full example
Click to view the full example
Call using Flowable
Flowable is an open-source reactive programming framework under the Apache 2.0 license. For details, see the Flowable API docs. Integrate the RxJava library and understand reactive programming basics before using Flowable.
- Unidirectional streaming call
- Bidirectional streaming call
Use
blockingForEach on a Flowable object to block and get each SpeechSynthesisResult. The complete result is also available through getAudioData after all streaming data returns.Click to view the full example
Click to view the full example
High-concurrency calls
The DashScope Java SDK uses OkHttp3 connection pooling to reduce connection overhead. See High-concurrency management.
Request parameters
Use the chained methods of SpeechSynthesisParam to configure parameters like model and voice. Pass the configured object to the SpeechSynthesizer constructor.
Click to view an example
Click to view an example
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | String | Yes | The text-to-speech model. See Voice list for all options. |
| voice | String | Yes | The voice for synthesis. See Voice list for available system voices. |
| format | enum | No | Audio format and sample rate. Default: MP3 at 22.05 kHz. Default sample rate is optimal for the voice. Downsampling and upsampling are supported. Supported formats:
|
| volume | int | No | Volume. Default: 50. Range: [0, 100]. Scales linearly. 0 is silent, 100 is maximum. |
| speechRate | float | No | Speech rate. Default: 1.0. Range: [0.5, 2.0]. Values below 1.0 slow down speech; above 1.0 speed it up. |
| pitchRate | float | No | Pitch multiplier. Default: 1.0. Range: [0.5, 2.0]. The relationship to perceived pitch is non-linear. Values above 1.0 raise pitch; below 1.0 lower it. Test to find suitable values. |
| bit_rate | int | No | Audio bitrate in kbps for Opus format. Default: 32. Range: [6, 510]. Set using the parameter or parameters method of SpeechSynthesisParam. See examples below. |
| enableWordTimestamp | boolean | No | Enable word-level timestamps. Default: false. Available only for system voices marked as supported in the voice list. Timestamp results are only available through the callback interface. |
| seed | int | No | Random seed for generation. Different seeds produce different results. Same seed with identical parameters reproduces the same output. Default: 0. Range: [0, 65535]. |
| languageHints | List | No | Target language for synthesis. Use when pronunciation of numbers, abbreviations, or symbols is inaccurate, or when less common languages need improvement. Valid values: zh (Chinese), en (English), fr (French), de (German), ja (Japanese), ko (Korean), ru (Russian), pt (Portuguese), th (Thai), id (Indonesian), vi (Vietnamese). Note: This is an array, but only the first element is processed. Pass one value only. |
| instruction | String | No | Control dialect, emotion, or speaking style via instructions. Available for system voices marked Instruct-supported in the voice list. Max length: 100 characters. See instruction examples below. |
| enable_aigc_tag | boolean | No | Add an invisible AIGC identifier to the audio. When true, an invisible identifier is embedded in supported formats (WAV, MP3, Opus). Default: false. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. |
| aigc_propagator | String | No | Sets the ContentPropagator field in the AIGC identifier to identify the content propagator. Takes effect only when enable_aigc_tag is true. Default: UID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. |
| aigc_propagate_id | String | No | Sets the PropagateID field in the AIGC identifier to uniquely identify a propagation behavior. Takes effect only when enable_aigc_tag is true. Default: the current request ID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. |
| hotFix | ParamHotFix | No | Text hotpatching configuration. Customize pronunciation of specific words or replace text before synthesis. Available only for cosyvoice-v3-flash. See the hotFix example below. |
| enable_markdown_filter | boolean | No | Remove Markdown symbols from input text before synthesis, preventing them from being read aloud. Default: false. Available only for cosyvoice-v3-flash. |
Set bit_rate
- Using parameter()
- Using parameters()
Set enable_aigc_tag
- Using parameter()
- Using parameters()
Set aigc_propagator
- Using parameter()
- Using parameters()
Set aigc_propagate_id
- Using parameter()
- Using parameters()
Set enable_markdown_filter
- Using parameter()
- Using parameters()
instruction examples
cosyvoice-v3-flash:
- Cloned voices: Use any natural language instruction to control synthesis. Instruction examples:
- System voices: Instructions must use a fixed format. See the voice list for details.
hotFix example
Key interfaces
SpeechSynthesizer class
Import with import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;.
| Interface/Method | Parameter | Return value | Description |
|---|---|---|---|
public SpeechSynthesizer(SpeechSynthesisParam param, ResultCallback<SpeechSynthesisResult> callback) | param: Request parameters. callback: ResultCallback for streaming calls, or null for non-streaming/Flowable calls. | SpeechSynthesizer instance | Constructor. Set callback to ResultCallback for unidirectional or bidirectional streaming. Set to null for non-streaming or Flowable calls. |
public ByteBuffer call(String text) | text: Text to synthesize (UTF-8). | ByteBuffer or null | Converts text (plain or SSML) to speech. Without callback: blocks until complete. With callback: returns null immediately; results arrive via onEvent. |
public void streamingCall(String text) | text: Text to synthesize (UTF-8). | None | Sends text for streaming synthesis. SSML is not supported. Call multiple times to send text in chunks. Results arrive via onEvent. See Bidirectional streaming call. |
public void streamingComplete() throws RuntimeException | None | None | Ends streaming synthesis. Blocks until synthesis completes, the session interrupts, or the 10-minute timeout occurs. See Bidirectional streaming call. |
public Flowable<SpeechSynthesisResult> callAsFlowable(String text) | text: Text to synthesize (UTF-8). | Flowable<SpeechSynthesisResult> | Converts non-streaming text to streaming speech output. SSML is not supported. See Call using Flowable. |
boolean getDuplexApi().close(int code, String reason) | code: WebSocket close code. reason: Close reason. See The WebSocket Protocol. | true | Close the WebSocket connection after each task to prevent connection leaks. For connection reuse, see High-concurrency management. |
public Flowable<SpeechSynthesisResult> streamingCallAsFlowable(Flowable<String> textStream) | textStream: Flowable wrapping text to synthesize. | Flowable<SpeechSynthesisResult> | Converts streaming text input to streaming speech output. SSML is not supported. See Call using Flowable. |
public String getLastRequestId() | None | Request ID of the previous task. | Gets the request ID after starting a new task via call, streamingCall, callAsFlowable, or streamingCallAsFlowable. |
public long getFirstPackageDelay() | None | First-packet latency in milliseconds. | Gets the time between sending text and receiving the first audio packet. Call after the task completes. |
Important usage requirements:
- Re-initialize the
SpeechSynthesizerinstance before eachcallinvocation. - Always call
streamingCompleteduring bidirectional streaming to avoid missing synthesized speech.
- WebSocket connection establishment (first call)
- Voice loading time (varies by voice)
- Service load (peak-hour queuing)
- Network latency
- Reused connection with loaded voice: ~500 ms
- First connection or voice switch: 1,500-2,000 ms
- Use connection pooling to pre-establish connections (high-concurrency scenarios).
- Check network quality.
- Avoid peak hours.
ResultCallback interface
Get synthesis results through ResultCallback during unidirectional or bidirectional streaming calls. Import with import com.alibaba.dashscope.common.ResultCallback;.
Click to view an example
Click to view an example
| Interface/Method | Parameter | Return value | Description |
|---|---|---|---|
public void onEvent(SpeechSynthesisResult result) | result: A SpeechSynthesisResult instance. | None | Called when the server pushes audio data. Use getAudioFrame on SpeechSynthesisResult to get binary audio. Use getUsage to get the billable character count so far. |
public void onComplete() | None | None | Called after all synthesis data has been returned. |
public void onError(Exception e) | e: Exception information. | None | Called when an exception occurs. Implement exception logging and resource cleanup in this method. |
Response
The server returns binary audio data:
- Non-streaming: Process the
ByteBufferreturned bycall. - Unidirectional streaming or bidirectional streaming: Process the
SpeechSynthesisResultparameter inonEvent.
SpeechSynthesisResult:
| Interface/Method | Parameter | Return value | Description |
|---|---|---|---|
public ByteBuffer getAudioFrame() | None | Binary audio data | Returns binary audio for the current segment. May be empty if no new data has arrived. Combine segments into a complete file, or play them with a streaming player. |
public String getRequestId() | None | Request ID | Gets the task request ID. Returns null when getAudioFrame returns data. |
public SpeechSynthesisUsage getUsage() | None | SpeechSynthesisUsage or null | Returns billable character count so far via getCharacters(). Use the last received value as final. |
public Sentence getTimestamp() | None | Sentence or null | Returns timestamp data when enableWordTimestamp is true. Sentence methods: getIndex (sentence number, from 0), getWords (returns List<Word>). Word methods: getText, getBeginIndex, getEndIndex, getBeginTime, getEndTime. |
For compressed formats (MP3, Opus) in streaming synthesis, use a streaming player. Playing frame by frame causes decoding failures.Streaming players include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).When combining audio into a complete file, write in append mode. For WAV and MP3 streaming audio, only the first frame contains header information.