Can Dify only utilize the text capabilities of the model?

In Dify, is the only way to achieve image generation capability by calling tools? This differs from directly invoking APIs of large models with image generation capabilities—when directly calling such APIs, the result is an image based on your textual description. However, in Dify, this capability cannot be achieved. Is it because Dify simply does not support this functionality? Does it only support using the text capabilities of models to invoke related tools?

1 Like

I also have this feeling. The image generation tutorial provided in the official documentation involves adding the AI drawing tool Stability to the toolset and calling the Stability API to achieve it.

I asked Gemini later, and the AI told me that these interfaces are not available yet, including the Rerank model in Ollama, and it cannot be recognized or used in Dify either.

Starting from Dify 1.4.0, LLM nodes have supported multimodal outputs of both text and images.
For more details, please refer to the release notes: Release v1.4.0 · langgenius/dify · GitHub

However, in order to actually use this feature, both the model and the plugin must support this output format.
It seems that some models in the Gemini plugin are supported, but I’m not sure about other providers.