Reference-to-video | Qwen Cloud

Character portrayal: Replicate a character's appearance from a reference image or video. If the reference is a video, the model can also replicate voice timbre.
Multi-character interaction: Compose scenes with up to five characters for natural dialogue and interaction.
Multi-shot narrative: Maintain character consistency across shots with intelligent multi-shot scheduling.
Voice cloning (wan2.7): Provide a reference_voice audio file to set the voice timbre independently from reference videos.

Quick links: API reference: wan2.7, wan2.6 | Prompt guide

Supported models

Model	Features	Input	Output
wan2.7-r2v `Recommended`	Audio sync, multi-character, voice cloning, first-frame control, `media` array input	Text, image, video, audio	720P or 1080P, 2-15s, 30 fps, MP4 (H.264)
wan2.6-r2v-flash	Video with or without audio, single or multi-character, multi-shot narrative, audio-video sync. Fast and cost-effective.	Text, image, video	720P or 1080P, 2-10s, 30 fps, MP4 (H.264)
wan2.6-r2v	Video with audio, multi-role reference-to-video, multi-shot narrative, audio-video sync	Text, image, video	720P or 1080P, 2-10s, 30 fps, MP4 (H.264)

Prerequisites

Get a DashScope API key from the Qwen Cloud. Set it as an environment variable:

export DASHSCOPE_API_KEY="YOUR_API_KEY"

How it works

Reference-to-video tasks run asynchronously:

Submit a task

POST your model, prompt, reference URLs, and parameters. The API returns a task_id.

Poll for results

GET the task status using the task_id. When the status is SUCCEEDED, the response includes a URL to the generated video.

Task status	Description
`PENDING`	Task is queued
`RUNNING`	Video is being generated
`SUCCEEDED`	Generation complete; video URL available in `output.video_url`
`FAILED`	Generation failed; check `message` for details

The output video URL expires after 24 hours. Download or transfer it before expiration.

Getting started

Submit a reference-to-video task and poll for the result.

curl (wan2.7)
curl (wan2.6)

Wan 2.7 uses the media array to provide references. Use Video N or Image N in the prompt to reference characters (numbered separately by type).Step 1: Submit the task

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/video-generation/video-synthesis' \
  -H 'X-DashScope-Async: enable' \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "wan2.7-r2v",
  "input": {
    "prompt": "Video 2 holds Image 3 and plays a soothing American country ballad in a coffee shop, while Video 1 smiles, watches Video 2, and slowly walks towards him",
    "media": [
      {"type": "reference_video", "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260129/hfugmr/wan-r2v-role1.mp4"},
      {"type": "reference_video", "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260129/qigswt/wan-r2v-role2.mp4"},
      {"type": "reference_image", "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260129/qpzxps/wan-r2v-object4.png"}
    ]
  },
  "parameters": {
    "resolution": "720P",
    "duration": 10,
    "prompt_extend": false,
    "watermark": true
  }
}'

Step 2: Get the result using the task IDReplace {task_id} with the task_id value returned by the previous API call.

curl -X GET 'https://dashscope-intl.aliyuncs.com/api/v1/tasks/{task_id}' \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY"

Key differences from wan2.6: wan2.7 uses media array instead of reference_urls, characters are referenced as Video 1/Image 1 (not character1), uses resolution+ratio instead of size, supports reference_voice for voice cloning, and watermark defaults to false.

Step 1: Submit the task

curl -s --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/video-generation/video-synthesis' \
  -H 'X-DashScope-Async: enable' \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "wan2.6-r2v-flash",
  "input": {
    "prompt": "character1 gives an enthusiastic product introduction to camera, smiling and gesturing naturally.",
    "reference_urls": ["https://example.com/presenter.mp4"]
  },
  "parameters": {
    "size": "1280*720",
    "duration": 5
  }
}'

Step 2: Poll for the resultReplace {task_id} with the task_id from the Step 1 response.

curl -s https://dashscope-intl.aliyuncs.com/api/v1/tasks/{task_id} \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY"

Sample output When the task succeeds, the response includes the video URL:

{
  "output": {
    "task_id": "your-task-id",
    "task_status": "SUCCEEDED",
    "submit_time": "2026-01-01 12:00:00.000",
    "scheduled_time": "2026-01-01 12:00:01.000",
    "end_time": "2026-01-01 12:01:00.000",
    "orig_prompt": "...",
    "video_url": "https://dashscope-result.aliyuncs.com/..."
  },
  "usage": {
    "duration": 10,
    "input_video_duration": 5,
    "output_video_duration": 10,
    "video_count": 1,
    "audio": true,
    "SR": 720
  },
  "request_id": "abc123"
}

For a complete Python example with built-in polling, see Multi-character interaction.

Parameters

wan2.7-r2v

Parameter	Type	Required	Description
`model`	string	Yes	`"wan2.7-r2v"`
`input.prompt`	string	Yes	Up to 5,000 characters. Use `Video 1`/`Image 1` to reference characters by their order in the `media` array (images and videos are counted separately).
`input.media`	array	Yes	1-5 items. Each has `type` (`reference_image`, `reference_video`, or `first_frame`) and `url`.
`input.negative_prompt`	string	No	Up to 500 characters. Content to exclude.
`input.reference_voice`	string	No	URL to an audio file (WAV/MP3, 1-10s, max 15 MB) for voice cloning. Overrides audio from reference videos.
`parameters.resolution`	string	No	`"720P"` or `"1080P"` (default).
`parameters.ratio`	string	No	`"16:9"` (default), `"9:16"`, `"1:1"`, `"4:3"`, `"3:4"`. Ignored when `first_frame` is provided.
`parameters.duration`	integer	No	2-15 (images only) or 2-10 (with video). Default: 5.
`parameters.prompt_extend`	boolean	No	Rewrite prompt via LLM. Default: `true`.
`parameters.watermark`	boolean	No	Default: `false`.
`parameters.seed`	integer	No	0 to 2,147,483,647.

Character referencing (wan2.7)

Each media item maps to a character identifier based on its position. Images and videos are counted separately:

"media": [
  {"type": "reference_video", "url": "https://example.com/person-a.mp4"},   // Video 1
  {"type": "reference_video", "url": "https://example.com/person-b.mp4"},   // Video 2
  {"type": "reference_image", "url": "https://example.com/guitar.png"}      // Image 3
]

Use these identifiers in your prompt: "Video 1 plays guitar while Video 2 sings along, holding Image 3."

wan2.6 models

Parameter	Type	Required	Description
`model`	string	Yes	`"wan2.6-r2v-flash"` or `"wan2.6-r2v"`
`input.prompt`	string	Yes	Text prompt describing the scene. Reference characters as `character1`, `character2`, etc., matching the order of `reference_urls`.
`input.reference_urls`	array	Yes	Up to 5 URLs pointing to reference images or videos. Each reference must contain a single character.
`parameters.size`	string	No	Output resolution. Example: `"1280*720"` (16:9). See the API reference for all options.
`parameters.duration`	integer	No	Output video length in seconds.
`parameters.audio`	boolean	No	`true` for audio, `false` for silent video. Default: `true`. Silent video is supported only by `wan2.6-r2v-flash`.
`parameters.shot_type`	string	No	`"multi"` for multi-shot switching or `"single"` for a fixed perspective.
`parameters.watermark`	boolean	No	`true` to add a watermark.

Reference input limits

Type	Maximum count	Notes
Images	5	People, objects, or backgrounds
Videos	3	Best for character or object references. Avoid videos of backgrounds or empty scenes.
Total (images + videos)	5	Combined limit across all references

Input method: public URL (HTTP or HTTPS).

Character referencing (wan2.6)

Each reference URL maps to a character identifier based on its position in the reference_urls array:

"reference_urls": [
  "https://example.com/person-a.mp4",   // character1
  "https://example.com/person-b.mp4",   // character2
  "https://example.com/guitar.png"      // character3
]

Use these identifiers in your prompt: "character1 plays guitar while character2 sings along, holding character3."

Multi-character interaction

Compose scenes with up to five characters for natural dialogue and interactions -- interviews, conversations, tutorials, and more. Supported models: All models. Set shot_type to multi for dynamic multi-shot switching, or single for a fixed perspective.

Example: Four-reference scene

Prompt: "character2 sits on a chair by the window, holding character3, and plays a soothing American country folk song next to character4. character1 says to character2: 'that sounds great'"

Input	Type	Reference
wan-r2v-role1.mp4	Video	character1 (person)
wan-r2v-role2.mp4	Video	character2 (person)
	Image	character3 (object)
	Image	character4 (background)

Output: Multi-shot video with audio

Example: Two-character dialogue

Prompt: "character1 says to character2: 'I'll rely on you tomorrow morning!' character2 replies: 'You can count on me!'"

Input	Type	Reference
compressed-video (1).mp4	Video	character1
compressed-video.mp4	Video	character2

Output: Multi-shot video with audio

Sample code (curl)

All examples use the X-DashScope-Async: enable header to submit asynchronous tasks. Step 1: Submit the task

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/video-generation/video-synthesis' \
  -H 'X-DashScope-Async: enable' \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "wan2.6-r2v-flash",
  "input": {
    "prompt": "Character2 sits on a chair by the window, holding character3, and plays a soothing American country folk song next to character4. Character1 says to Character2: \"that sounds great\"",
    "reference_urls": [
      "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20260205/aacgyk/wan-r2v-role1.mp4",
      "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20260205/mmizqq/wan-r2v-role2.mp4",
      "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260129/qpzxps/wan-r2v-object4.png",
      "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260129/wfjikw/wan-r2v-backgroud5.png"
    ]
  },
  "parameters": {
    "size": "1280*720",
    "duration": 10,
    "audio": true,
    "shot_type": "multi",
    "watermark": true
  }
}'

Step 2: Retrieve the result Replace <task-id> with the task_id from the Step 1 response.

curl -X GET https://dashscope-intl.aliyuncs.com/api/v1/tasks/<task-id> \
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"

Sample code (Python)

This example submits a multi-character task and polls until the result is ready.

import os
import time
import requests

API_KEY = os.environ.get("DASHSCOPE_API_KEY")
BASE_URL = "https://dashscope-intl.aliyuncs.com/api/v1"

headers = {
  "Authorization": f"Bearer {API_KEY}",
  "Content-Type": "application/json",
  "X-DashScope-Async": "enable",
}

# Step 1: Submit the task
payload = {
  "model": "wan2.6-r2v-flash",
  "input": {
    "prompt": (
      'Character2 sits on a chair by the window, holding character3, '
      'and plays a soothing American country folk song next to character4. '
      'Character1 says to Character2: "that sounds great"'
    ),
    "reference_urls": [
      "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20260205/aacgyk/wan-r2v-role1.mp4",
      "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20260205/mmizqq/wan-r2v-role2.mp4",
      "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260129/qpzxps/wan-r2v-object4.png",
      "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260129/wfjikw/wan-r2v-backgroud5.png",
    ],
  },
  "parameters": {
    "size": "1280*720",
    "duration": 10,
    "audio": True,
    "shot_type": "multi",
    "watermark": True,
  },
}

response = requests.post(
  f"{BASE_URL}/services/aigc/video-generation/video-synthesis",
  headers=headers,
  json=payload,
)
result = response.json()
task_id = result["output"]["task_id"]
print(f"Task submitted: {task_id}")

# Step 2: Poll for the result
poll_headers = {"Authorization": f"Bearer {API_KEY}"}

while True:
  status_response = requests.get(
    f"{BASE_URL}/tasks/{task_id}", headers=poll_headers
  )
  status = status_response.json()
  task_status = status["output"]["task_status"]

  if task_status == "SUCCEEDED":
    video_url = status["output"]["video_url"]
    print(f"Video ready: {video_url}")
    break
  elif task_status == "FAILED":
    print(f"Task failed: {status['output'].get('message', 'Unknown error')}")
    break
  else:
    print(f"Status: {task_status} -- waiting 10s...")
    time.sleep(10)

Single-character performance

Create a complete character performance across different scenes from a single reference video or image -- personal branding, product endorsements, educational training, and more. Supported models: All models. Pass a single URL in reference_urls and use character1 in the prompt. Setting shot_type to multi is recommended.

Example: Holiday unboxing video

Prompt: "Create a festive holiday unboxing experience. Shot 1 [0-2s]: Character1 sits by a beautifully decorated Christmas tree with twinkling lights, holding a wrapped gift box with elegant red and gold wrapping. Shot 2 [2-4s]: Close-up as Character1 carefully unwraps the gift, revealing premium skincare products inside. Shot 3 [4-6s]: Character1 applies the product with delight, saying: 'This holiday glow is exactly what I wanted!' Shot 4 [6-10s]: Character1 admires their radiant skin in a handheld mirror, surrounded by festive decorations, ending with a warm smile to camera."

Input	Output
wan-r2v-role-4.mp4 (reference video)	Multi-shot video with audio

Sample code (curl)

Step 1: Submit the task

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/video-generation/video-synthesis' \
  -H 'X-DashScope-Async: enable' \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "wan2.6-r2v-flash",
  "input": {
    "prompt": "Create a festive holiday unboxing experience.Shot 1 [0-2s]: Character1 sits by a beautifully decorated Christmas tree with twinkling lights, holding a wrapped gift box with elegant red and gold wrapping. Shot 2 [2-4s]: Close-up as Character1 carefully unwraps the gift, revealing premium skincare products inside. Shot 3 [4-6s]: Character1 applies the product with delight, saying: \"This holiday glow is exactly what I wanted!\" Shot 4 [6-10s]: Character1 admires their radiant skin in a handheld mirror, surrounded by festive decorations, ending with a warm smile to camera.",
    "reference_urls":["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20260205/mjgmzx/wan-r2v-role-4.mp4"]
  },
  "parameters": {
    "size": "1280*720",
    "duration": 10,
    "shot_type":"multi",
    "watermark": true
  }
}'

Step 2: Retrieve the result Replace <task-id> with the task_id from the Step 1 response.

curl -X GET https://dashscope-intl.aliyuncs.com/api/v1/tasks/<task-id> \
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"

Silent video generation

Create visual-only videos without audio -- animated posters, silent short videos, and similar use cases. Supported model: wan2.6-r2v-flash only. Set audio to false.

Example: Silent dance video

Prompt: "character1 drinks bubble tea while dancing spontaneously to the music."

Input	Output
wan-r2v-role-1.mp4 (reference video)	Silent video

Sample code (curl)

Step 1: Submit the task

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/video-generation/video-synthesis' \
  -H 'X-DashScope-Async: enable' \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "wan2.6-r2v-flash",
  "input": {
    "prompt": "character1 drinks bubble tea while dancing spontaneously to the music.",
    "reference_urls":["https://cdn.wanx.aliyuncs.com/static/demo-wan26/vace.mp4"]
  },
  "parameters": {
    "size": "1280*720",
    "duration": 5,
    "shot_type":"multi",
    "audio": false,
    "watermark": true
  }
}'

Step 2: Retrieve the result Replace <task-id> with the task_id from the Step 1 response.

curl -X GET https://dashscope-intl.aliyuncs.com/api/v1/tasks/<task-id> \
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"

Output specifications

Property	Details
Number of videos	1 per task
Format	MP4 (H.264, 30 fps)
Resolution	wan2.7: Set by `resolution` + `ratio`. wan2.6: Set by `size` parameter (e.g. `1280*720`).
URL expiration	24 hours

Response fields

Field	Description
`output.task_id`	Task identifier for polling status.
`output.task_status`	`PENDING`, `RUNNING`, `SUCCEEDED`, or `FAILED`.
`output.video_url`	URL to the generated video. Available when `task_status` is `SUCCEEDED`.
`output.orig_prompt`	The original input prompt.
`usage.duration`	Output video duration in seconds.
`usage.input_video_duration`	Total duration of input videos in seconds.
`usage.output_video_duration`	Output video duration in seconds.
`usage.video_count`	Number of videos generated (always 1).
`usage.audio`	Whether the output video has audio.
`usage.SR`	Resolution tier (720 or 1080).

Billing and rate limits

For free quota and pricing, see Model invocation pricing.
For rate limits, see Rate limits.

Billing rules

Item	Billed?	Unit
Input images	No	--
Input videos	Yes	Per second
Output videos	Yes	Per second
Failed tasks	No	Do not consume the free quota for new users

Videos with audio and silent videos are priced differently. For example, wan2.6-r2v-flash has separate rates for each. Total billed duration = Input video duration (capped at 5s) + Output video duration

How input video duration is billed

The 5-second cap is distributed evenly across all references (images and videos combined). Each video is billed for min(actual duration, truncation limit). Images are free.

Number of references	Truncation limit per video
1	5s
2	2.5s
3	1.65s
4	1.25s
5	1s

Example: 3 references (1 image + 2 videos) with a 1.65s truncation limit per video: Billed input duration = min(video 1 duration, 1.65s) + min(video 2 duration, 1.65s). The image is not billed.

API reference

FAQ

How do I set the video aspect ratio?

wan2.7: Use the ratio parameter (16:9, 9:16, 1:1, 4:3, 3:4). When a first_frame image is provided, the ratio is inferred from the image. wan2.6: Use the size parameter. Each resolution maps to a fixed aspect ratio -- for example, size=1280*720 produces a 16:9 video.

How do I reference characters in the prompt?

wan2.7: Use Video N or Image N identifiers. Images and videos are counted separately:

"media": [
  {"type": "reference_video", "url": "https://example.com/girl.mp4"},   // Video 1
  {"type": "reference_image", "url": "https://example.com/clock.png"}   // Image 2
]

wan2.6: Use character1, character2, and so on -- the identifiers map directly to the order of URLs in reference_urls:

"reference_urls": [
  "https://example.com/girl.mp4",   // character1
  "https://example.com/clock.png"   // character2
]

What happens when a task fails?

Poll the task status endpoint. If task_status is FAILED, the message field in the response describes the error. Common causes include invalid reference URLs, unsupported file formats, or exceeding rate limits. Failed tasks are not billed.

​Supported models

​Prerequisites

​How it works

​Getting started

​Parameters

​wan2.7-r2v

​Character referencing (wan2.7)

​wan2.6 models

​Reference input limits

​Character referencing (wan2.6)

​Multi-character interaction

​Example: Four-reference scene

​Example: Two-character dialogue

​Sample code (curl)

​Sample code (Python)

​Single-character performance

​Example: Holiday unboxing video

​Sample code (curl)

​Silent video generation

​Example: Silent dance video

​Sample code (curl)

​Output specifications

​Response fields

​Billing and rate limits

​Billing rules

​How input video duration is billed

​API reference

​FAQ

​How do I set the video aspect ratio?

​How do I reference characters in the prompt?

​What happens when a task fails?

Supported models

Prerequisites

How it works

Getting started

Parameters

wan2.7-r2v

Character referencing (wan2.7)

wan2.6 models

Reference input limits

Character referencing (wan2.6)

Multi-character interaction

Example: Four-reference scene

Example: Two-character dialogue

Sample code (curl)

Sample code (Python)

Single-character performance

Example: Holiday unboxing video

Sample code (curl)

Silent video generation

Example: Silent dance video

Sample code (curl)

Output specifications

Response fields

Billing and rate limits

Billing rules

How input video duration is billed

API reference

FAQ

How do I set the video aspect ratio?

How do I reference characters in the prompt?

What happens when a task fails?