Yes, these two points you added actually complete the entire ecosystem of “why statistics can be different.” I’ll help you clarify the logic to provide a complete understanding for future readers.
1. Dify’s Own Statistics: Leaning Towards “In-Product Perspective”
- By configuring tokenizer + pricing in “Model Providers / Custom Model,” Dify can calculate:
- Estimated tokens for each call;
- Approximate cost at the application / workflow / tenant level.
- This statistic is more like an in-application operational perspective:
- Which application is the most expensive, which nodes are the heaviest, which user/API key makes the most calls.
- Used for rate limiting, quotas, cost estimation, and A/B comparison decisions.
But it is inherently an “estimation”:
- It depends on the tokenizer you select;
- It depends on the price list you manually enter or the built-in one;
- Plus, strategies like “whether to include retries/failures/intermediate nodes.”
2. External LLM Observability Tools like Langfuse / LangSmith: Leaning Towards “Engineering + Operations Perspective”
The Monitoring integrations in your diagram (Langfuse, LangSmith, Opik, mlflow, Databricks, W&B, Arize, Alibaba Cloud Monitoring, Tencent Cloud APM, etc.) solve another layer of problems:
Not only do you need to know “how many tokens / how much money was spent,” but also “what exactly happened” for a specific call, a specific Prompt, or a specific path.
Typical capabilities include:
- Complete Trace: Request chain, input and output of each node, model selection, time consumption.
- More Detailed Token & Cost Analysis:
- Some tools directly read the
usagefield returned by the model; - They also have their own set of tokenizer / cost models, which can be cross-referenced with Dify’s estimates.
- Some tools directly read the
- Quality Evaluation: Automatic/manual scoring, playback, regression testing, RAG quality evaluation, etc.
They are not a replacement for Dify’s backend statistics but rather a parallel “second perspective,” more suitable for R&D, SRE, and data teams for in-depth analysis.
3. How to Understand the Three Sets of Numbers: Dify vs. Cloud Provider vs. Observability Platform
If you think of the chain like this:
User → Dify Application (with its own internal token & cost estimation)
→ Observability Platform (Langfuse / LangSmith etc. for tracing + evaluation)
→ Model Cloud Provider (OpenAI / Alibaba Cloud / Tencent Cloud etc. for final billing)
Then the usage / token / cost seen in these three places:
- Cloud Provider: The final real bill, which finance must use as the standard.
- Dify Backend:
- An “application/workflow-centric” perspective, emphasizing visualization and operability;
- With custom tokenizer & pricing, it can achieve “the same order of magnitude alignment” with the cloud provider.
- Langfuse / LangSmith etc.:
- A “call chain / trace / experiment-centric” perspective;
- Helps you optimize prompts, paths, model selection, and even provides more granular token & cost statistics.
In reality, these three sets of numbers will not be exactly the same. The more recommended usage is:
- Bill Reconciliation / Cost Settlement: Use the cloud provider as the standard;
- Product Operations / User Quotas / Application Ranking: Dify’s built-in statistics are sufficient;
- Debugging / Diagnosis / Improving Quality & Performance: Look at traces, metrics, and evaluations in Langfuse / LangSmith etc.
4. A One-Sentence Summary of Your Original Question
Why are Dify backend tokens different from the cloud provider?
Further extended, it can be said:
- Dify gives you a “configurable, in-product estimation perspective,” which can approach the cloud provider’s numbers through tokenizer + pricing;
- The cloud provider gives you the “final settlement perspective”;
- Langfuse / LangSmith etc. provide an “engineering and observability perspective,” making it easier for you to understand where these differences come from and to iterate on your application.
The blog post you cited also points this out: Dify comes with basic statistics, while LangSmith / Langfuse are responsible for more granular cost & token analysis and LLMOps capabilities; the two are complementary.
If you plan to unify the usage of these three sets of statistics in a production environment later (e.g., providing a unified report to business stakeholders), I can help you design a set of “standard definitions + reconciliation procedures” so that product, engineering, and finance teams each know what to look at and how to interpret the differences.