Yeah, this is a real problem, especially the retrieval drift and instruction dilution you mentioned. We ran into similar issues where everything looked fine turn-by-turn, but the overall conversation would go off the rails after a few steps.
Most traditional eval tools are still focused on single-turn Q&A, but I think newer agent eval tools are shifting towards multi-turn.
One thing that’s been tricky in practice is actually making those simulations realistic and systematic enough to catch regressions reliably across different scenarios.
Tools like Arksim are starting to explore this, you can run conversations with different user goals/personas and see how the system behaves over time. It’s not perfect, but it gets closer to how these failures actually show up in practice.