Hi everyone,
I’ve been building chatbots with Dify and running into quality issues that only show up in multi-turn conversations — not single-turn Q&A.
The problem
Single-turn testing is straightforward: ask a question, check the answer. But once users have multi-turn conversations, different kinds of failures appear:
-
RAG retrieval drift: As conversation grows longer, the retrieval query becomes a mix of multiple topics. The knowledge base starts returning less relevant chunks, and the bot confidently answers with information from the wrong document
-
Instruction dilution: Over 8-10+ turns, the bot gradually drifts from its system prompt constraints — tone shifts, it starts answering out-of-scope questions it should have declined, or it stops following formatting rules
-
Silent regressions: After updating a workflow (system prompt change, RAG parameter adjustment, model swap), conversation patterns that worked before break — with no errors in the logs
These are hard to catch because there’s nothing in the logs that says “wrong answer.” The LLM call succeeded, the response looks fluent, but the information is incorrect or the behavior has drifted from what you intended.
What I’ve looked into
Dify integrates with several observability/eval tools:
| Tool | What it offers |
|---|---|
| LangSmith | Datasets + Evaluators, LLM-as-Judge, human feedback |
| Langfuse | Datasets, LLM-as-Judge, human feedback, custom scores |
| Opik | LLM-as-Judge, 8 conversation-specific metrics, dataset evaluation |
| Arize AX | LLM-as-Judge, Session Evals, human annotation |
| Phoenix | LLM-as-Judge, Evaluator Hub |
These are great for tracing and single-turn evaluation. But as far as I can tell, none of them let you design a multi-turn conversation scenario (e.g., “ask X, then based on the response ask Y or Z”) and run it against a Dify chatbot end-to-end.
I searched this forum and GitHub Discussions and was surprised to find almost no discussion about systematic chatbot quality evaluation — despite 211 GitHub issues mentioning regressions after updates and 524 about observability.
My questions to the community
-
How do you test your Dify chatbot quality before releasing changes? Use above tools? Manual testing in the preview? Custom scripts? Something else?
-
Have you experienced silent regressions after updating a workflow or RAG configuration? How did you catch them?
-
Is anyone doing multi-turn evaluation — testing entire conversation flows rather than individual Q&A pairs?
I’d love to hear what’s working (or not working) for you.