How do you measure the quality of multi-turn conversations in your Dify chatbot?

ShuntaroOkuma · March 24, 2026, 1:00am

Hi everyone,

I’ve been building chatbots with Dify and running into quality issues that only show up in multi-turn conversations — not single-turn Q&A.

The problem

Single-turn testing is straightforward: ask a question, check the answer. But once users have multi-turn conversations, different kinds of failures appear:

RAG retrieval drift: As conversation grows longer, the retrieval query becomes a mix of multiple topics. The knowledge base starts returning less relevant chunks, and the bot confidently answers with information from the wrong document
Instruction dilution: Over 8-10+ turns, the bot gradually drifts from its system prompt constraints — tone shifts, it starts answering out-of-scope questions it should have declined, or it stops following formatting rules
Silent regressions: After updating a workflow (system prompt change, RAG parameter adjustment, model swap), conversation patterns that worked before break — with no errors in the logs

These are hard to catch because there’s nothing in the logs that says “wrong answer.” The LLM call succeeded, the response looks fluent, but the information is incorrect or the behavior has drifted from what you intended.

What I’ve looked into

Dify integrates with several observability/eval tools:

Tool	What it offers
LangSmith	Datasets + Evaluators, LLM-as-Judge, human feedback
Langfuse	Datasets, LLM-as-Judge, human feedback, custom scores
Opik	LLM-as-Judge, 8 conversation-specific metrics, dataset evaluation
Arize AX	LLM-as-Judge, Session Evals, human annotation
Phoenix	LLM-as-Judge, Evaluator Hub

These are great for tracing and single-turn evaluation. But as far as I can tell, none of them let you design a multi-turn conversation scenario (e.g., “ask X, then based on the response ask Y or Z”) and run it against a Dify chatbot end-to-end.

I searched this forum and GitHub Discussions and was surprised to find almost no discussion about systematic chatbot quality evaluation — despite 211 GitHub issues mentioning regressions after updates and 524 about observability.

My questions to the community

How do you test your Dify chatbot quality before releasing changes? Use above tools? Manual testing in the preview? Custom scripts? Something else?
Have you experienced silent regressions after updating a workflow or RAG configuration? How did you catch them?
Is anyone doing multi-turn evaluation — testing entire conversation flows rather than individual Q&A pairs?

I’d love to hear what’s working (or not working) for you.

PinkBanana · April 15, 2026, 2:14pm

This is a interesting feedback.

Yaoa4 · April 15, 2026, 3:09pm

Yeah, this is a real problem, especially the retrieval drift and instruction dilution you mentioned. We ran into similar issues where everything looked fine turn-by-turn, but the overall conversation would go off the rails after a few steps.
Most traditional eval tools are still focused on single-turn Q&A, but I think newer agent eval tools are shifting towards multi-turn.

One thing that’s been tricky in practice is actually making those simulations realistic and systematic enough to catch regressions reliably across different scenarios.

Tools like Arksim are starting to explore this, you can run conversations with different user goals/personas and see how the system behaves over time. It’s not perfect, but it gets closer to how these failures actually show up in practice.

Not sure if you’ve already found other alternatives but you can try out Arksim and see how it does
https://github.com/arklexai/arksim

Curious if others have found good ways to systematically test multi-turn flows?

Hans_Wang · June 3, 2026, 12:29pm

我试过一个Agent平台 telichat.io，它好像没有检索漂移的问题