How do you measure the quality of multi-turn conversations in your Dify chatbot?

Hi everyone,

I’ve been building chatbots with Dify and running into quality issues that only show up in multi-turn conversations — not single-turn Q&A.

The problem

Single-turn testing is straightforward: ask a question, check the answer. But once users have multi-turn conversations, different kinds of failures appear:

  • RAG retrieval drift: As conversation grows longer, the retrieval query becomes a mix of multiple topics. The knowledge base starts returning less relevant chunks, and the bot confidently answers with information from the wrong document

  • Instruction dilution: Over 8-10+ turns, the bot gradually drifts from its system prompt constraints — tone shifts, it starts answering out-of-scope questions it should have declined, or it stops following formatting rules

  • Silent regressions: After updating a workflow (system prompt change, RAG parameter adjustment, model swap), conversation patterns that worked before break — with no errors in the logs

These are hard to catch because there’s nothing in the logs that says “wrong answer.” The LLM call succeeded, the response looks fluent, but the information is incorrect or the behavior has drifted from what you intended.

What I’ve looked into

Dify integrates with several observability/eval tools:

Tool What it offers
LangSmith Datasets + Evaluators, LLM-as-Judge, human feedback
Langfuse Datasets, LLM-as-Judge, human feedback, custom scores
Opik LLM-as-Judge, 8 conversation-specific metrics, dataset evaluation
Arize AX LLM-as-Judge, Session Evals, human annotation
Phoenix LLM-as-Judge, Evaluator Hub

These are great for tracing and single-turn evaluation. But as far as I can tell, none of them let you design a multi-turn conversation scenario (e.g., “ask X, then based on the response ask Y or Z”) and run it against a Dify chatbot end-to-end.

I searched this forum and GitHub Discussions and was surprised to find almost no discussion about systematic chatbot quality evaluation — despite 211 GitHub issues mentioning regressions after updates and 524 about observability.

My questions to the community

  1. How do you test your Dify chatbot quality before releasing changes? Use above tools? Manual testing in the preview? Custom scripts? Something else?

  2. Have you experienced silent regressions after updating a workflow or RAG configuration? How did you catch them?

  3. Is anyone doing multi-turn evaluation — testing entire conversation flows rather than individual Q&A pairs?

I’d love to hear what’s working (or not working) for you.

1 Like