Does Dify's latest version still not support video upload?

Whether it’s a chat assistant or a workflow configuration, multimodal models are unable to receive video files.

What kind of error message you got?

I am using a model that supports video understanding, and the visual section is enabled. Not only does the interface not support video files, but files uploaded via API calls also cannot be received by the large model.

image

Currently, very few large models support video uploads. It’s best to test for support during the setup phase. The known model that supports video is Claude 4.5 @Dify_Forum_Helper

At the moment, the situation can be viewed on two levels: whether the model supports video, and whether the Dify frontend/API accepts video file types.

First, regarding whether file upload is supported, it depends on whether features has enabled the video file type under files.

Second, for GLM, according to the official manual, the sample code is as follows:

# Video understanding example, upload video URL
from zhipuai import ZhipuAI

client = ZhipuAI(api_key="YOUR API KEY")  # Fill in your own API key
response = client.chat.completions.create(
    model="glm-4v-plus-0111",  # Fill in the model name to call
    messages=[
      {
        "role": "user",
        "content": [
          {
            "type": "video_url",
            "video_url": {
                "url" : "https://sfile.chatglm.cn/testpath/video/xxxxx.mp4"
            }
          },
          {
            "type": "text",
            "text": "Please describe this video in detail."
          }
        ]
      }
    ]
)
print(response.choices[0].message)

This shows that for GLM models, users only need to pass a valid video URL for the model to analyze the video.

If using the Qwen model series, according to the official Qwen API documentation:

Qwen-VL analyzes content by extracting a sequence of frames from the video. The frame extraction frequency determines the granularity of analysis. Different SDKs have different default frame extraction frequencies, and the model supports controlling the frequency via the fps parameter (extract one frame every 1/fps seconds, range [0.1, 10], default 2.0). It is recommended to set a higher fps for fast-motion scenes and a lower fps for static or long videos.

import dashscope
import os

# The following is the Singapore region base_url. If using the Virginia region model,
# change base_url to https://dashscope-us.aliyuncs.com/api/v1
# If using the Beijing region model, change base_url to: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
    {"role": "user",
        "content": [
            # fps can control the video frame extraction frequency, meaning one frame is extracted every 1/fps seconds.
            # Full usage: https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "What is this video about?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys differ by region. Get an API key:
    # https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you don't have environment variables configured, replace the next line with: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Another form, based on Alibaba’s official manual:

import os
# dashscope version must be >= 1.20.10
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{"role": "user",
             "content": [
                 # If the model belongs to the Qwen2.5-VL series and an image list is provided,
                 # you can set fps to indicate that the image list was extracted from the original video every 1/fps seconds.
                 {"video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
                  "fps": 2},
                 {"text": "Describe the detailed process shown in this video."}]}]
response = dashscope.MultiModalConversation.call(
    # If you don't have environment variables configured, replace the next line with: api_key="sk-xxx"
    # API keys differ for Singapore/Virginia and Beijing regions. Get an API key:
    # https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen2.5-vl-72b-instruct',  # Example model; replace as needed. Model list:
    # https://www.alibabacloud.com/help/en/model-studio/models
    messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

From the screenshot, the key point is not whether the user is using GLM or Qwen, but that the user is using the SiliconFlow plugin. According to SiliconFlow’s official documentation, its vision handling works like this:

2. Usage

For VLM models, you can call the /chat/completions endpoint and construct message content that includes an image URL or a base64-encoded image. Use the detail parameter to control the image preprocessing mode.

2.1 Detail parameter

SiliconFlow provides three detail options: low, high, and auto. For currently supported models, if detail is omitted or set to high, high (“high resolution”) mode is used; if set to low or auto, low (“low resolution”) mode is used.

4. Billing for visual inputs

Visual inputs such as images are converted into tokens and billed together with text as part of the context. Different models convert visual content differently; below is the current conversion rule.

4.1 Qwen series

Rules:

Qwen supports a maximum resolution of 3584 × 3584 = 12,845,056 pixels and a minimum resolution of 56 × 56 = 3,136 pixels. Each image is first resized so that both sides are multiples of 28, i.e. (h * 28) × (w * 28). If the result falls outside the min/max pixel range, it is further scaled proportionally into that range.

  • When detail=low, all images are resized to 448 × 448, which maps to 256 tokens.

  • When detail=high, the image is scaled proportionally: first round width/height up to the nearest multiple of 28, then scale proportionally into the pixel range (3136, 12845056) while keeping both sides as multiples of 28.

Examples:

  • For images sized 224 × 448, 1024 × 1024, and 3172 × 4096, choosing detail=low always costs 256 tokens.

  • For 224 × 448 with detail=high: it is within the pixel range and both sides are multiples of 28, so cost is (224/28) × (448/28) = 8 × 16 = 128 tokens.

  • For 1024 × 1024 with detail=high: round up to 1036 × 1036 (nearest multiples of 28), within range, so cost is (1036/28) × (1036/28) = 1369 tokens.

  • For 3172 × 4096 with detail=high: round up to 3192 × 4116, exceeding max pixels, then scale proportionally down to 3136 × 4060, so cost is (3136/28) × (4060/28) = 16240 tokens.

The official API docs do not mention video handling, which indicates that the core issue is not Dify itself, but that SiliconFlow does not support video processing. This also explains why the user mentioned doing manual frame extraction.

Solution

  • Wait for SiliconFlow to officially support video/stream inputs, and notify Dify maintainers (and the SiliconFlow plugin author) to update the plugin accordingly.

  • Alternatively, switch to the Dify Tongyi plugin and use an Alibaba Cloud Bailian (Model Studio) API key.

Zhipu’s GLM4.6V and Qianwen 2.5vl are also supported.

Yes, currently we are doing this by performing frame extraction on videos, but it only supports 6 photos. We suggest that the platform relax the restrictions on video files in the future, because the vendor’s model capabilities are developed through plugins, eliminating the need to worry about manufacturer support. The platform only needs to pass the file as a parameter to the model API. We hope for future compatibility.

I have tested the video-capable model glm-4.1v-thinking-flashx, and it can indeed upload videos. Therefore, would it be convenient for you to provide the model you are using and your workflow for further investigation?

Here’s the thing, uploading videos is supported through workflows, but the actual uploaded video files cannot be passed to the large model, and the large model does not read them.

This is the input and processing for the model processing stage

@Abyssedge I have updated the explanation: Dify 的最新版本仍然不支持视频上传吗? - #6 by Dify_Forum_Helper

Thanks, understood. This means that SiliconFlow does not have the processing logic to convert video files uploaded to Dify, which results in user files not being able to be passed to the model. However, if a URL is input, it works.

Yes, this means that currently, if you use SiliconFlow, you’ll need to handle the video-to-image conversion yourself first.

@Dify_Forum_Helper Summarize the Q&A in no more than five sentences.

  1. Dify’s frontend and workflow currently support uploading video files, but whether they can actually be read by the model depends on the model used and its plugin chain.
  2. The core issue in this post is that SiliconFlow currently only performs visual processing on images and does not convert user-uploaded videos in Dify to pass them to the model.
  3. Therefore, the same models (such as GLM and Qwen series), if called directly via official APIs using a video URL, can understand video. However, when used in Dify via the SiliconFlow plugin, they do not receive video content.
  4. At this stage, when using SiliconFlow, you need to manually perform “video to image conversion/frame extraction” or switch to models and plugins that natively support video, such as Tongyi Qianwen.
  5. We will need to wait for official SiliconFlow support for video streams, followed by a plugin update, to enable Dify to automatically pass uploaded videos to the model.