Skip to main content
Speech-to-speech

Speech-to-speech models

Choose a model for voice conversation, speech translation, and more.

S2S vs pipeline

Two ways to build voice-enabled apps:
S2SPipeline (ASR + LLM + TTS)
LatencyLow — single model, streamingHigher — 3 sequential hops
Audio understandingEnd-to-end — hears tone, emotion, responds in kindTranscribes to text first — audio nuance lost
Voice customizationPreset voices via system promptVoice cloning, voice design (CosyVoice)
  • Use S2S when interactive conversation, low latency, and audio-aware responses matter. Continue reading this page.
  • Use Pipeline when you need custom voices or want to mix-and-match the best ASR, LLM, and TTS for each stage.

Real-time or file-based?

  • Real-time (WebSocket) — Use for live voice interfaces: voice assistants, call centers, simultaneous interpretation. Audio streams in, speech streams out. Model names contain -realtime.
  • File-based (HTTP) — Use when you can trade latency for better results: video dubbing, podcast translation, offline content processing. Unlocks function calling (Qwen3.5-Omni, Qwen3-Omni-Flash), web search (Qwen3.5-Omni), thinking mode (Qwen3-Omni-Flash), and video context (Livetranslate).

Function calling

Let the model take actions based on what it hears and sees — check a knowledge base, query a schedule, trigger a workflow. Use qwen3.5-omni-plus (HTTP), qwen3.5-omni-flash (HTTP), or qwen3-omni-flash (HTTP). Not available on realtime or Livetranslate models. Let the model retrieve real-time information to answer questions about current events, stock prices, weather, and more. Use qwen3.5-omni-plus (HTTP) or qwen3.5-omni-plus-realtime (WebSocket). The model autonomously decides whether to search. Not available on Qwen3-Omni-Flash or Livetranslate models.

Thinking mode

Use qwen3-omni-flash (HTTP) when answer quality matters more than latency. The model reasons step-by-step before producing speech — useful for technical support, complex Q&A, or multi-step instructions. Not available on Qwen3.5-Omni models.

Translation

All three model families can translate speech:
  • Qwen3-Livetranslate — 18 languages + 5 Chinese dialects, ~3-second latency, out of the box. File-based variant accepts video for context-aware accuracy. 7 languages output text only (no audio).
  • Qwen3.5-Omni — 29 output languages + 7 Chinese dialects. Superior audio-video understanding and web search. Inject terminology and domain context via system prompt. Both realtime and file-based.
  • Qwen3-Omni-Flash — 11 output languages + 8 Chinese dialects. Inject terminology and domain context via system prompt for specialized fields. Both realtime and file-based. Lower cost.
Livetranslate for quick setup; Qwen3.5-Omni for best quality and broadest language coverage; Qwen3-Omni-Flash for cost-sensitive scenarios.
LanguageQwen3-LivetranslateQwen3.5-OmniQwen3-Omni-Flash
English
Chinese (Mandarin)
  + Cantonese
  + Sichuanese
  + Shanghainese
  + Beijing
  + Tianjin
  + Nanjing
  + Shaanxi
  + Hokkien
French
German
Russian
Italian
Spanish
Portuguese
Japanese
Korean
ThaiText only
IndonesianText only
VietnameseText only
ArabicText only
HindiText only
TurkishText only
Finnish
Polish
Dutch
Czech
Urdu
Tagalog
Swedish
Danish
Hebrew
Icelandic
Malay
Norwegian
Persian
GreekText only
✓ = audio + text output. "Text only" = no audio output for that language.Qwen3.5-Omni supports 113 input languages/dialects total. See full list for details.Legacy qwen-omni-turbo supports Chinese and English only.
ModelAPIInputFunction callingWeb searchThinkingBatch
qwen3.5-omni-plus-realtimeWebSocketText, audio, image, video
qwen3.5-omni-plusHTTPText, audio, image, video
qwen3.5-omni-flash-realtimeWebSocketText, audio, image, video
qwen3.5-omni-flashHTTPText, audio, image, video
qwen3-omni-flash-realtimeWebSocketText, audio, image, video
qwen3-omni-flashHTTPText, audio, image, video
qwen3-livetranslate-flash-realtimeWebSocketAudio
qwen3-livetranslate-flashHTTPAudio, video

All models

ModelAPIInputFunction callingWeb searchThinkingBatch
qwen3.5-omni-plus-realtimeWebSocketText, audio, image, video
qwen3.5-omni-plus-realtime-2026-03-15WebSocketText, audio, image, video
qwen3.5-omni-flash-realtimeWebSocketText, audio, image, video
qwen3.5-omni-flash-realtime-2026-03-15WebSocketText, audio, image, video
qwen3.5-omni-plusHTTPText, audio, image, video
qwen3.5-omni-plus-2026-03-15HTTPText, audio, image, video
qwen3.5-omni-flashHTTPText, audio, image, video
qwen3.5-omni-flash-2026-03-15HTTPText, audio, image, video
ModelAPIInputFunction callingWeb searchThinkingBatch
qwen3-omni-flash-realtimeWebSocketText, audio, image, video
qwen3-omni-flash-realtime-2025-12-01WebSocketText, audio, image, video
qwen3-omni-flash-realtime-2025-09-15WebSocketText, audio, image, video
qwen3-omni-flashHTTPText, audio, image, video
qwen3-omni-flash-2025-12-01HTTPText, audio, image, video
qwen3-omni-flash-2025-09-15HTTPText, audio, image, video
ModelAPIInputLanguages
qwen3-livetranslate-flash-realtimeWebSocketAudio18
qwen3-livetranslate-flash-realtime-2025-09-22WebSocketAudio18
qwen3-livetranslate-flashHTTPAudio, video18
qwen3-livetranslate-flash-2025-12-01HTTPAudio, video18
These models are no longer updated. Use Qwen3.5-Omni or Qwen3-Omni-Flash for new projects.
ModelInputAPI
qwen2.5-omni-7bText, audio, image, videoHTTP
qwen-omni-turboText, audio, image, videoHTTP
qwen-omni-turbo-latestText, audio, image, videoHTTP
qwen-omni-turbo-2025-03-26Text, audio, image, videoHTTP
qwen-omni-turbo-realtimeText, audioWebSocket
qwen-omni-turbo-realtime-latestText, audioWebSocket
qwen-omni-turbo-realtime-2025-05-08Text, audioWebSocket

Learn more