Speech-to-speech models

S2S vs pipeline

Two ways to build voice-enabled apps:

	S2S	Pipeline (ASR + LLM + TTS)
Latency	Low — single model, streaming	Higher — 3 sequential hops
Audio understanding	End-to-end — hears tone, emotion, responds in kind	Transcribes to text first — audio nuance lost
Voice customization	Preset voices via system prompt	Voice cloning, voice design (CosyVoice)

Use S2S when interactive conversation, low latency, and audio-aware responses matter. Continue reading this page.
Use Pipeline when you need custom voices or want to mix-and-match the best ASR, LLM, and TTS for each stage.

Real-time or file-based?

Real-time (WebSocket) — Use for live voice interfaces: voice assistants, call centers, simultaneous interpretation. Audio streams in, speech streams out. Model names contain -realtime.
File-based (HTTP) — Use when you can trade latency for better results: video dubbing, podcast translation, offline content processing. Unlocks function calling (Qwen3.5-Omni, Qwen3-Omni-Flash), web search (Qwen3.5-Omni), thinking mode (Qwen3-Omni-Flash), and video context (Livetranslate).

Function calling

Let the model take actions based on what it hears and sees — check a knowledge base, query a schedule, trigger a workflow. Use qwen3.5-omni-plus (HTTP), qwen3.5-omni-flash (HTTP), or qwen3-omni-flash (HTTP). Not available on realtime or Livetranslate models.

Web search

Let the model retrieve real-time information to answer questions about current events, stock prices, weather, and more. Use qwen3.5-omni-plus (HTTP) or qwen3.5-omni-plus-realtime (WebSocket). The model autonomously decides whether to search. Not available on Qwen3-Omni-Flash or Livetranslate models.

Thinking mode

Use qwen3-omni-flash (HTTP) when answer quality matters more than latency. The model reasons step-by-step before producing speech — useful for technical support, complex Q&A, or multi-step instructions. Not available on Qwen3.5-Omni models.

Translation

All three model families can translate speech:

Qwen3-Livetranslate — 18 languages + 5 Chinese dialects, ~3-second latency, out of the box. File-based variant accepts video for context-aware accuracy. 7 languages output text only (no audio).
Qwen3.5-Omni — 29 output languages + 7 Chinese dialects. Superior audio-video understanding and web search. Inject terminology and domain context via system prompt. Both realtime and file-based.
Qwen3-Omni-Flash — 11 output languages + 8 Chinese dialects. Inject terminology and domain context via system prompt for specialized fields. Both realtime and file-based. Lower cost.

Livetranslate for quick setup; Qwen3.5-Omni for best quality and broadest language coverage; Qwen3-Omni-Flash for cost-sensitive scenarios.

Supported languages

Language	Qwen3-Livetranslate	Qwen3.5-Omni	Qwen3-Omni-Flash
English	✓	✓	✓
Chinese (Mandarin)	✓	✓	✓
+ Cantonese	✓	✓	✓
+ Sichuanese	✓	✓	✓
+ Shanghainese	✓	✓	✓
+ Beijing	✓	✓	✓
+ Tianjin	✓	✓	✓
+ Nanjing	—	✓	✓
+ Shaanxi	—	✓	✓
+ Hokkien	—	✓	✓
French	✓	✓	✓
German	✓	✓	✓
Russian	✓	✓	✓
Italian	✓	✓	✓
Spanish	✓	✓	✓
Portuguese	✓	✓	✓
Japanese	✓	✓	✓
Korean	✓	✓	✓
Thai	Text only	✓	✓
Indonesian	Text only	✓	—
Vietnamese	Text only	✓	—
Arabic	Text only	✓	—
Hindi	Text only	✓	—
Turkish	Text only	✓	—
Finnish	—	✓	—
Polish	—	✓	—
Dutch	—	✓	—
Czech	—	✓	—
Urdu	—	✓	—
Tagalog	—	✓	—
Swedish	—	✓	—
Danish	—	✓	—
Hebrew	—	✓	—
Icelandic	—	✓	—
Malay	—	✓	—
Norwegian	—	✓	—
Persian	—	✓	—
Greek	Text only	—	—

✓ = audio + text output. "Text only" = no audio output for that language.Qwen3.5-Omni supports 113 input languages/dialects total. See full list for details.Legacy qwen-omni-turbo supports Chinese and English only.

Recommended models

Model	API	Input	Function calling	Web search	Thinking	Batch
`qwen3.5-omni-plus-realtime`	WebSocket	Text, audio, image, video	—	✓	—	—
`qwen3.5-omni-plus`	HTTP	Text, audio, image, video	✓	✓	—	—
`qwen3.5-omni-flash-realtime`	WebSocket	Text, audio, image, video	—	✓	—	—
`qwen3.5-omni-flash`	HTTP	Text, audio, image, video	✓	✓	—	—
`qwen3-omni-flash-realtime`	WebSocket	Text, audio, image, video	—	—	—	—
`qwen3-omni-flash`	HTTP	Text, audio, image, video	✓	—	✓	—
`qwen3-livetranslate-flash-realtime`	WebSocket	Audio	—	—	—	—
`qwen3-livetranslate-flash`	HTTP	Audio, video	—	—	—	—

All models

Qwen3.5-Omni

Model	API	Input	Function calling	Web search	Thinking	Batch
`qwen3.5-omni-plus-realtime`	WebSocket	Text, audio, image, video	—	✓	—	—
`qwen3.5-omni-plus-realtime-2026-03-15`	WebSocket	Text, audio, image, video	—	✓	—	—
`qwen3.5-omni-flash-realtime`	WebSocket	Text, audio, image, video	—	✓	—	—
`qwen3.5-omni-flash-realtime-2026-03-15`	WebSocket	Text, audio, image, video	—	✓	—	—
`qwen3.5-omni-plus`	HTTP	Text, audio, image, video	✓	✓	—	—
`qwen3.5-omni-plus-2026-03-15`	HTTP	Text, audio, image, video	✓	✓	—	—
`qwen3.5-omni-flash`	HTTP	Text, audio, image, video	✓	✓	—	—
`qwen3.5-omni-flash-2026-03-15`	HTTP	Text, audio, image, video	✓	✓	—	—

Qwen3-Omni-Flash

Model	API	Input	Function calling	Web search	Thinking	Batch
`qwen3-omni-flash-realtime`	WebSocket	Text, audio, image, video	—	—	—	—
`qwen3-omni-flash-realtime-2025-12-01`	WebSocket	Text, audio, image, video	—	—	—	—
`qwen3-omni-flash-realtime-2025-09-15`	WebSocket	Text, audio, image, video	—	—	—	—
`qwen3-omni-flash`	HTTP	Text, audio, image, video	✓	—	✓	—
`qwen3-omni-flash-2025-12-01`	HTTP	Text, audio, image, video	✓	—	✓	—
`qwen3-omni-flash-2025-09-15`	HTTP	Text, audio, image, video	✓	—	✓	—

Qwen3-Livetranslate

Model	API	Input	Languages
`qwen3-livetranslate-flash-realtime`	WebSocket	Audio	18
`qwen3-livetranslate-flash-realtime-2025-09-22`	WebSocket	Audio	18
`qwen3-livetranslate-flash`	HTTP	Audio, video	18
`qwen3-livetranslate-flash-2025-12-01`	HTTP	Audio, video	18

Legacy

These models are no longer updated. Use Qwen3.5-Omni or Qwen3-Omni-Flash for new projects.

Model	Input	API
`qwen2.5-omni-7b`	Text, audio, image, video	HTTP
`qwen-omni-turbo`	Text, audio, image, video	HTTP
`qwen-omni-turbo-latest`	Text, audio, image, video	HTTP
`qwen-omni-turbo-2025-03-26`	Text, audio, image, video	HTTP
`qwen-omni-turbo-realtime`	Text, audio	WebSocket
`qwen-omni-turbo-realtime-latest`	Text, audio	WebSocket
`qwen-omni-turbo-realtime-2025-05-08`	Text, audio	WebSocket

Learn more

Real-time conversation

Build real-time multimodal voice assistants.

File-based conversation

Process audio and video with speech output.

Real-time translation

Translate speech across languages in real time.

File-based translation

Translate audio and video files.

​S2S vs pipeline

​Real-time or file-based?

​Function calling

​Web search

​Thinking mode

​Translation

​Recommended models

​All models

​Learn more

Real-time conversation

File-based conversation

Real-time translation

File-based translation

S2S vs pipeline

Real-time or file-based?

Function calling

Web search

Thinking mode

Translation

Recommended models

All models

Learn more