Does Dify's latest version still not support video upload?

Abyssedge · January 22, 2026, 10:41am

Whether it’s a chat assistant or a workflow configuration, multimodal models are unable to receive video files.

Sherry_M · January 22, 2026, 4:51pm

What kind of error message you got?

Abyssedge · January 23, 2026, 2:30am

I am using a model that supports video understanding, and the visual section is enabled. Not only does the interface not support video files, but files uploaded via API calls also cannot be received by the large model.

Abyssedge · January 23, 2026, 2:30am

Sherry_M · January 27, 2026, 5:52am

Currently, very few large models support video uploads. It’s best to test for support during the setup phase. The known model that supports video is Claude 4.5 @Dify_Forum_Helper

Dify_Forum_Helper · January 27, 2026, 5:53am

At the moment, the situation can be viewed on two levels: whether the model supports video, and whether the Dify frontend/API accepts video file types.

First, regarding whether file upload is supported, it depends on whether features has enabled the video file type under files.

Second, for GLM, according to the official manual, the sample code is as follows:

# Video understanding example, upload video URL
from zhipuai import ZhipuAI

client = ZhipuAI(api_key="YOUR API KEY")  # Fill in your own API key
response = client.chat.completions.create(
    model="glm-4v-plus-0111",  # Fill in the model name to call
    messages=[
      {
        "role": "user",
        "content": [
          {
            "type": "video_url",
            "video_url": {
                "url" : "https://sfile.chatglm.cn/testpath/video/xxxxx.mp4"
            }
          },
          {
            "type": "text",
            "text": "Please describe this video in detail."
          }
        ]
      }
    ]
)
print(response.choices[0].message)

This shows that for GLM models, users only need to pass a valid video URL for the model to analyze the video.

If using the Qwen model series, according to the official Qwen API documentation:

Qwen-VL analyzes content by extracting a sequence of frames from the video. The frame extraction frequency determines the granularity of analysis. Different SDKs have different default frame extraction frequencies, and the model supports controlling the frequency via the fps parameter (extract one frame every 1/fps seconds, range [0.1, 10], default 2.0). It is recommended to set a higher fps for fast-motion scenes and a lower fps for static or long videos.

import dashscope
import os

# The following is the Singapore region base_url. If using the Virginia region model,
# change base_url to https://dashscope-us.aliyuncs.com/api/v1
# If using the Beijing region model, change base_url to: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
    {"role": "user",
        "content": [
            # fps can control the video frame extraction frequency, meaning one frame is extracted every 1/fps seconds.
            # Full usage: https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "What is this video about?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys differ by region. Get an API key:
    # https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you don't have environment variables configured, replace the next line with: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Another form, based on Alibaba’s official manual:

import os
# dashscope version must be >= 1.20.10
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{"role": "user",
             "content": [
                 # If the model belongs to the Qwen2.5-VL series and an image list is provided,
                 # you can set fps to indicate that the image list was extracted from the original video every 1/fps seconds.
                 {"video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
                  "fps": 2},
                 {"text": "Describe the detailed process shown in this video."}]}]
response = dashscope.MultiModalConversation.call(
    # If you don't have environment variables configured, replace the next line with: api_key="sk-xxx"
    # API keys differ for Singapore/Virginia and Beijing regions. Get an API key:
    # https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen2.5-vl-72b-instruct',  # Example model; replace as needed. Model list:
    # https://www.alibabacloud.com/help/en/model-studio/models
    messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

From the screenshot, the key point is not whether the user is using GLM or Qwen, but that the user is using the SiliconFlow plugin. According to SiliconFlow’s official documentation, its vision handling works like this:

2. Usage

For VLM models, you can call the /chat/completions endpoint and construct message content that includes an image URL or a base64-encoded image. Use the detail parameter to control the image preprocessing mode.

2.1 Detail parameter

SiliconFlow provides three detail options: low, high, and auto. For currently supported models, if detail is omitted or set to high, high (“high resolution”) mode is used; if set to low or auto, low (“low resolution”) mode is used.

4. Billing for visual inputs

Visual inputs such as images are converted into tokens and billed together with text as part of the context. Different models convert visual content differently; below is the current conversion rule.

4.1 Qwen series

Rules:

Qwen supports a maximum resolution of 3584 × 3584 = 12,845,056 pixels and a minimum resolution of 56 × 56 = 3,136 pixels. Each image is first resized so that both sides are multiples of 28, i.e. (h * 28) × (w * 28). If the result falls outside the min/max pixel range, it is further scaled proportionally into that range.

When detail=low, all images are resized to 448 × 448, which maps to 256 tokens.

When detail=high, the image is scaled proportionally: first round width/height up to the nearest multiple of 28, then scale proportionally into the pixel range (3136, 12845056) while keeping both sides as multiples of 28.

Examples:

For images sized 224 × 448, 1024 × 1024, and 3172 × 4096, choosing detail=low always costs 256 tokens.

For 224 × 448 with detail=high: it is within the pixel range and both sides are multiples of 28, so cost is (224/28) × (448/28) = 8 × 16 = 128 tokens.

For 1024 × 1024 with detail=high: round up to 1036 × 1036 (nearest multiples of 28), within range, so cost is (1036/28) × (1036/28) = 1369 tokens.

For 3172 × 4096 with detail=high: round up to 3192 × 4116, exceeding max pixels, then scale proportionally down to 3136 × 4060, so cost is (3136/28) × (4060/28) = 16240 tokens.

The official API docs do not mention video handling, which indicates that the core issue is not Dify itself, but that SiliconFlow does not support video processing. This also explains why the user mentioned doing manual frame extraction.

Solution

Wait for SiliconFlow to officially support video/stream inputs, and notify Dify maintainers (and the SiliconFlow plugin author) to update the plugin accordingly.
Alternatively, switch to the Dify Tongyi plugin and use an Alibaba Cloud Bailian (Model Studio) API key.

Abyssedge · January 27, 2026, 6:25am

Zhipu’s GLM4.6V and Qianwen 2.5vl are also supported.

Abyssedge · January 27, 2026, 6:31am

Yes, currently we are doing this by performing frame extraction on videos, but it only supports 6 photos. We suggest that the platform relax the restrictions on video files in the future, because the vendor’s model capabilities are developed through plugins, eliminating the need to worry about manufacturer support. The platform only needs to pass the file as a parameter to the model API. We hope for future compatibility.

Sherry_M · January 27, 2026, 7:55am

I have tested the video-capable model glm-4.1v-thinking-flashx, and it can indeed upload videos. Therefore, would it be convenient for you to provide the model you are using and your workflow for further investigation?

Abyssedge · January 27, 2026, 9:32am

Here’s the thing, uploading videos is supported through workflows, but the actual uploaded video files cannot be passed to the large model, and the large model does not read them.

Abyssedge · January 27, 2026, 9:35am

This is the input and processing for the model processing stage

lysonober · January 27, 2026, 11:41am

@Abyssedge I have updated the explanation: Dify 的最新版本仍然不支持视频上传吗？ - #6 by Dify_Forum_Helper

Abyssedge · January 27, 2026, 12:15pm

Thanks, understood. This means that SiliconFlow does not have the processing logic to convert video files uploaded to Dify, which results in user files not being able to be passed to the model. However, if a URL is input, it works.

lysonober · January 27, 2026, 12:16pm

Yes, this means that currently, if you use SiliconFlow, you’ll need to handle the video-to-image conversion yourself first.

lysonober · January 27, 2026, 12:17pm

@Dify_Forum_Helper Summarize the Q&A in no more than five sentences.

Dify_Forum_Helper · January 27, 2026, 12:18pm

Dify’s frontend and workflow currently support uploading video files, but whether they can actually be read by the model depends on the model used and its plugin chain.
The core issue in this post is that SiliconFlow currently only performs visual processing on images and does not convert user-uploaded videos in Dify to pass them to the model.
Therefore, the same models (such as GLM and Qwen series), if called directly via official APIs using a video URL, can understand video. However, when used in Dify via the SiliconFlow plugin, they do not receive video content.
At this stage, when using SiliconFlow, you need to manually perform “video to image conversion/frame extraction” or switch to models and plugins that natively support video, such as Tongyi Qianwen.
We will need to wait for official SiliconFlow support for video streams, followed by a plugin update, to enable Dify to automatically pass uploaded videos to the model.

Topic		Replies	Views
工作流中同时上传了文件和一个问题，发现llm在思考时说未收到任何上传的文档 Discussion	20	749	January 21, 2026
Dify是不是只能发挥模型的文本能力？ Discussion readme , commuity , case	3	156	January 18, 2026
dify(Version 1.10.1) 通过api上传pptx文件到流水线知识库报错 Discussion case	2	302	January 1, 2026
Tongyi 模型调用报错：Incorrect model credentials provided Discussion	8	495	January 21, 2026
Dify Version 1.9.2 知识库转成工作流形式之后文件无法上传，单独知识库文件可以上传。查看api日志发现没有请求过来 Help Me Build commuity	1	324	December 4, 2025
The dify process executed without any errors, but the API did not return a value（dify流程执行没有报错，但是API没有返回值） Discussion	15	651	April 16, 2026
Dify工作流知识库上传图片报错 Seeking help readme , commuity , case	1	218	January 17, 2026
Dify有没有工作流或者插件可以将word(doc和docx)转成pdf或者将word(doc和docx)直接解析出来 Discussion	5	364	January 22, 2026
Dify chatflow 最后的直接回复节点引用了开始节点的files Discussion	1	66	January 13, 2026
Dify-chatflow-iteration Discussion commuity , case	6	332	January 7, 2026

Does Dify's latest version still not support video upload?

2. Usage

2.1 Detail parameter

4. Billing for visual inputs

4.1 Qwen series

Solution

Related topics