Whether it’s a chat assistant or a workflow configuration, multimodal models are unable to receive video files.
What kind of error message you got?
![]()
Currently, very few large models support video uploads. It’s best to test for support during the setup phase. The known model that supports video is Claude 4.5 @Dify_Forum_Helper
At the moment, the situation can be viewed on two levels: whether the model supports video, and whether the Dify frontend/API accepts video file types.
First, regarding whether file upload is supported, it depends on whether features has enabled the video file type under files.
Second, for GLM, according to the official manual, the sample code is as follows:
# Video understanding example, upload video URL
from zhipuai import ZhipuAI
client = ZhipuAI(api_key="YOUR API KEY") # Fill in your own API key
response = client.chat.completions.create(
model="glm-4v-plus-0111", # Fill in the model name to call
messages=[
{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {
"url" : "https://sfile.chatglm.cn/testpath/video/xxxxx.mp4"
}
},
{
"type": "text",
"text": "Please describe this video in detail."
}
]
}
]
)
print(response.choices[0].message)
This shows that for GLM models, users only need to pass a valid video URL for the model to analyze the video.
If using the Qwen model series, according to the official Qwen API documentation:
Qwen-VL analyzes content by extracting a sequence of frames from the video. The frame extraction frequency determines the granularity of analysis. Different SDKs have different default frame extraction frequencies, and the model supports controlling the frequency via the
fpsparameter (extract one frame every1/fpsseconds, range[0.1, 10], default2.0). It is recommended to set a higherfpsfor fast-motion scenes and a lowerfpsfor static or long videos.
import dashscope
import os
# The following is the Singapore region base_url. If using the Virginia region model,
# change base_url to https://dashscope-us.aliyuncs.com/api/v1
# If using the Beijing region model, change base_url to: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{"role": "user",
"content": [
# fps can control the video frame extraction frequency, meaning one frame is extracted every 1/fps seconds.
# Full usage: https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
{"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
{"text": "What is this video about?"}
]
}
]
response = dashscope.MultiModalConversation.call(
# API keys differ by region. Get an API key:
# https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you don't have environment variables configured, replace the next line with: api_key="sk-xxx"
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen3-vl-plus',
messages=messages
)
print(response.output.choices[0].message.content[0]["text"])
Another form, based on Alibaba’s official manual:
import os
# dashscope version must be >= 1.20.10
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{"role": "user",
"content": [
# If the model belongs to the Qwen2.5-VL series and an image list is provided,
# you can set fps to indicate that the image list was extracted from the original video every 1/fps seconds.
{"video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
"fps": 2},
{"text": "Describe the detailed process shown in this video."}]}]
response = dashscope.MultiModalConversation.call(
# If you don't have environment variables configured, replace the next line with: api_key="sk-xxx"
# API keys differ for Singapore/Virginia and Beijing regions. Get an API key:
# https://www.alibabacloud.com/help/en/model-studio/get-api-key
api_key=os.getenv("DASHSCOPE_API_KEY"),
model='qwen2.5-vl-72b-instruct', # Example model; replace as needed. Model list:
# https://www.alibabacloud.com/help/en/model-studio/models
messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
From the screenshot, the key point is not whether the user is using GLM or Qwen, but that the user is using the SiliconFlow plugin. According to SiliconFlow’s official documentation, its vision handling works like this:
2. Usage
For VLM models, you can call the
/chat/completionsendpoint and construct message content that includes an image URL or a base64-encoded image. Use thedetailparameter to control the image preprocessing mode.2.1 Detail parameter
SiliconFlow provides three
detailoptions:low,high, andauto. For currently supported models, ifdetailis omitted or set tohigh, high (“high resolution”) mode is used; if set toloworauto, low (“low resolution”) mode is used.
4. Billing for visual inputs
Visual inputs such as images are converted into tokens and billed together with text as part of the context. Different models convert visual content differently; below is the current conversion rule.
4.1 Qwen series
Rules:
Qwen supports a maximum resolution of 3584 × 3584 = 12,845,056 pixels and a minimum resolution of 56 × 56 = 3,136 pixels. Each image is first resized so that both sides are multiples of 28, i.e. (h * 28) × (w * 28). If the result falls outside the min/max pixel range, it is further scaled proportionally into that range.
When
detail=low, all images are resized to 448 × 448, which maps to 256 tokens.When
detail=high, the image is scaled proportionally: first round width/height up to the nearest multiple of 28, then scale proportionally into the pixel range (3136, 12845056) while keeping both sides as multiples of 28.Examples:
For images sized 224 × 448, 1024 × 1024, and 3172 × 4096, choosing
detail=lowalways costs 256 tokens.For 224 × 448 with
detail=high: it is within the pixel range and both sides are multiples of 28, so cost is (224/28) × (448/28) = 8 × 16 = 128 tokens.For 1024 × 1024 with
detail=high: round up to 1036 × 1036 (nearest multiples of 28), within range, so cost is (1036/28) × (1036/28) = 1369 tokens.For 3172 × 4096 with
detail=high: round up to 3192 × 4116, exceeding max pixels, then scale proportionally down to 3136 × 4060, so cost is (3136/28) × (4060/28) = 16240 tokens.The official API docs do not mention video handling, which indicates that the core issue is not Dify itself, but that SiliconFlow does not support video processing. This also explains why the user mentioned doing manual frame extraction.
Solution
-
Wait for SiliconFlow to officially support video/stream inputs, and notify Dify maintainers (and the SiliconFlow plugin author) to update the plugin accordingly.
-
Alternatively, switch to the Dify Tongyi plugin and use an Alibaba Cloud Bailian (Model Studio) API key.
Zhipu’s GLM4.6V and Qianwen 2.5vl are also supported.
Yes, currently we are doing this by performing frame extraction on videos, but it only supports 6 photos. We suggest that the platform relax the restrictions on video files in the future, because the vendor’s model capabilities are developed through plugins, eliminating the need to worry about manufacturer support. The platform only needs to pass the file as a parameter to the model API. We hope for future compatibility.
I have tested the video-capable model glm-4.1v-thinking-flashx, and it can indeed upload videos. Therefore, would it be convenient for you to provide the model you are using and your workflow for further investigation?
Here’s the thing, uploading videos is supported through workflows, but the actual uploaded video files cannot be passed to the large model, and the large model does not read them.
@Abyssedge I have updated the explanation: Dify 的最新版本仍然不支持视频上传吗? - #6 by Dify_Forum_Helper
Thanks, understood. This means that SiliconFlow does not have the processing logic to convert video files uploaded to Dify, which results in user files not being able to be passed to the model. However, if a URL is input, it works.
Yes, this means that currently, if you use SiliconFlow, you’ll need to handle the video-to-image conversion yourself first.
@Dify_Forum_Helper Summarize the Q&A in no more than five sentences.
- Dify’s frontend and workflow currently support uploading video files, but whether they can actually be read by the model depends on the model used and its plugin chain.
- The core issue in this post is that SiliconFlow currently only performs visual processing on images and does not convert user-uploaded videos in Dify to pass them to the model.
- Therefore, the same models (such as GLM and Qwen series), if called directly via official APIs using a video URL, can understand video. However, when used in Dify via the SiliconFlow plugin, they do not receive video content.
- At this stage, when using SiliconFlow, you need to manually perform “video to image conversion/frame extraction” or switch to models and plugins that natively support video, such as Tongyi Qianwen.
- We will need to wait for official SiliconFlow support for video streams, followed by a plugin update, to enable Dify to automatically pass uploaded videos to the model.





