At the moment, the situation can be viewed on two levels: whether the model supports video, and whether the Dify frontend/API accepts video file types.
First, regarding whether file upload is supported, it depends on whether features has enabled the video file type under files.
Second, for GLM, according to the official manual, the sample code is as follows:
# Video understanding example, upload video URL
from zhipuai import ZhipuAI
client = ZhipuAI(api_key="YOUR API KEY") # Fill in your own API key
response = client.chat.completions.create(
model="glm-4v-plus-0111", # Fill in the model name to call
messages=[
{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {
"url" : "https://sfile.chatglm.cn/testpath/video/xxxxx.mp4"
}
},
{
"type": "text",
"text": "Please describe this video in detail."
}
]
}
]
)
print(response.choices[0].message)
This shows that for GLM models, users only need to pass a valid video URL for the model to analyze the video.
If using the Qwen model series, according to the official Qwen API documentation:
Qwen-VL analyzes content by extracting a sequence of frames from the video. The frame extraction frequency determines the granularity of analysis. Different SDKs have different default frame extraction frequencies, and the model supports controlling the frequency via the
fpsparameter (extract one frame every1/fpsseconds, range[0.1, 10], default2.0). It is recommended to set a higherfpsfor fast-motion scenes and a lowerfpsfor static or long videos.
import dashscope
import os
# The following is the Singapore region base_url. If using the Virginia region model,
# change base_url to https://dashscope-us.aliyuncs.com/api/v1
# If using the Beijing region model, change base_url to: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{"role": "user",
"content": [
# fps can control the video frame extraction frequency, meaning one frame is extracted every 1/fps seconds.
# Full usage: https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
{"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
{"text": "What is this video about?"}
]
}
]
response = dashscope.MultiModalConversation.call(
# API keys differ by region. Get an API key:
# https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you don't have environment variables configured, replace the next line with: api_key="sk-xxx"
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen3-vl-plus',
messages=messages
)
print(response.output.choices[0].message.content[0]["text"])
Another form, based on Alibaba’s official manual:
import os
# dashscope version must be >= 1.20.10
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{"role": "user",
"content": [
# If the model belongs to the Qwen2.5-VL series and an image list is provided,
# you can set fps to indicate that the image list was extracted from the original video every 1/fps seconds.
{"video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
"fps": 2},
{"text": "Describe the detailed process shown in this video."}]}]
response = dashscope.MultiModalConversation.call(
# If you don't have environment variables configured, replace the next line with: api_key="sk-xxx"
# API keys differ for Singapore/Virginia and Beijing regions. Get an API key:
# https://www.alibabacloud.com/help/en/model-studio/get-api-key
api_key=os.getenv("DASHSCOPE_API_KEY"),
model='qwen2.5-vl-72b-instruct', # Example model; replace as needed. Model list:
# https://www.alibabacloud.com/help/en/model-studio/models
messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
From the screenshot, the key point is not whether the user is using GLM or Qwen, but that the user is using the SiliconFlow plugin. According to SiliconFlow’s official documentation, its vision handling works like this:
2. Usage
For VLM models, you can call the
/chat/completionsendpoint and construct message content that includes an image URL or a base64-encoded image. Use thedetailparameter to control the image preprocessing mode.2.1 Detail parameter
SiliconFlow provides three
detailoptions:low,high, andauto. For currently supported models, ifdetailis omitted or set tohigh, high (“high resolution”) mode is used; if set toloworauto, low (“low resolution”) mode is used.
4. Billing for visual inputs
Visual inputs such as images are converted into tokens and billed together with text as part of the context. Different models convert visual content differently; below is the current conversion rule.
4.1 Qwen series
Rules:
Qwen supports a maximum resolution of 3584 × 3584 = 12,845,056 pixels and a minimum resolution of 56 × 56 = 3,136 pixels. Each image is first resized so that both sides are multiples of 28, i.e. (h * 28) × (w * 28). If the result falls outside the min/max pixel range, it is further scaled proportionally into that range.
When
detail=low, all images are resized to 448 × 448, which maps to 256 tokens.When
detail=high, the image is scaled proportionally: first round width/height up to the nearest multiple of 28, then scale proportionally into the pixel range (3136, 12845056) while keeping both sides as multiples of 28.Examples:
For images sized 224 × 448, 1024 × 1024, and 3172 × 4096, choosing
detail=lowalways costs 256 tokens.For 224 × 448 with
detail=high: it is within the pixel range and both sides are multiples of 28, so cost is (224/28) × (448/28) = 8 × 16 = 128 tokens.For 1024 × 1024 with
detail=high: round up to 1036 × 1036 (nearest multiples of 28), within range, so cost is (1036/28) × (1036/28) = 1369 tokens.For 3172 × 4096 with
detail=high: round up to 3192 × 4116, exceeding max pixels, then scale proportionally down to 3136 × 4060, so cost is (3136/28) × (4060/28) = 16240 tokens.The official API docs do not mention video handling, which indicates that the core issue is not Dify itself, but that SiliconFlow does not support video processing. This also explains why the user mentioned doing manual frame extraction.
Solution
-
Wait for SiliconFlow to officially support video/stream inputs, and notify Dify maintainers (and the SiliconFlow plugin author) to update the plugin accordingly.
-
Alternatively, switch to the Dify Tongyi plugin and use an Alibaba Cloud Bailian (Model Studio) API key.
