Why Dify Token statistics differ from cloud provider

Why are the token statistics in the Dify backend different from those of the cloud vendor?

The discrepancy between “Dify backend token statistics” and “cloud vendor statistics (e.g., OpenAI, Alibaba, Tencent, etc.)” usually stems from several reasons. You can check them one by one:


1. Inconsistent Statistical Caliber

Common differences:

  1. Whether prompt / system / tool call are all included

    • Cloud vendors generally include:
      • Input: system + user + tool call and all other prompt text
      • Output: All tokens generated by the model
    • In Dify:
      • Some places only show “Total tokens”
      • Some views separate “Model call tokens” + “Knowledge base recall/embedding tokens”
    • If you only looked at one dimension in Dify, it’s normal for it not to match the cloud vendor’s “Total usage”.
  2. Whether knowledge base embedding / retrieval overhead is included

    • Dify knowledge base involves:
      • Document import vectorization
      • Query embedding / rerank during conversation retrieval
    • Tokens consumed by these embedding models are sometimes counted separately by cloud vendors under “Usage / Embeddings”;
    • Dify backend also counts them into usage, but you might have only looked at the usage of “Chat/Completion” type models in the cloud vendor console.
  3. Whether to compare by “number of requests” or “number of tokens”

    • Some cloud vendor consoles default to displaying “number of requests” or “amount”, not tokens.
    • You need to switch to “by token dimension” in the cloud vendor console for comparison.

2. Different Billing Scopes

  1. Inconsistent time range

    • Dify backend selects the last 24 hours / a specific date range;
    • Cloud vendors select UTC time / cumulative for the current month.
    • Especially across days and months, the difference can be very large.
  2. Whether the “key / project” being counted is consistent

    • Under the same cloud account, multiple projects/applications might share one API Key.
    • Dify is just one source among these calls;
    • What you see in the cloud vendor console is the “total amount for all applications under that key”.
  3. Multiple models, multiple regions

    • You might have only looked at a specific app / specific model in Dify;
    • The cloud vendor console counts all models, all regions.

3. Differences in Technical Implementation

  1. Different Tokenizer implementations

    • Dify might use a locally implemented tokenizer to estimate tokens;
    • Cloud vendors use their official tokenizer for billing;
    • Different implementations may have slight differences in rules for segmenting Chinese punctuation, emojis, and special characters → a small deviation is normal (generally within a few percentage points).
  2. Retries / Timeouts / Streaming interruptions

    • If Dify is configured for “automatic retry on failure”, a single user query might trigger multiple model calls in the backend:
      • Dify usage: will count all calls;
      • These calls are also billed by the cloud vendor, but if you only understand it as “successful responses”, you will feel that Dify’s statistics are higher.
    • If streaming output is interrupted midway, Dify might record the tokens already received, while the cloud vendor has already fully billed for that call.
  3. Cache mechanism

    • If you enable certain caching mechanisms (e.g., upstream has its own cache / proxy), it might appear that:
      • Dify believes it sent a request and estimates tokens based on prompt length;
      • In reality, the upstream cloud vendor did not actually bill because of a cache hit, or billed 0.
    • This can lead to a situation where “Dify > Cloud Vendor”.

4. Common Troubleshooting Ideas

You can simply verify with the following steps:

  1. Choose a very small time window for comparison
    For example, within the last 10 minutes, operate only one application, one model, and conduct 2–3 simple conversations, then:

    • Check the total number of tokens for these requests in the Dify backend;
    • Check the token usage for the same API Key in the cloud vendor console for the same time period;
    • See what the difference is (single-digit percentage or several times).
  2. Confirm whether Embedding / Knowledge Base / Other Models are included

    • If you use a knowledge base, Rerank model, or tool calls, it is recommended to sum up the usage statistics of the corresponding models in the cloud vendor console before comparing with Dify.
  3. Confirm whether the time zone, time period, and project/Key accurately correspond

    • Unify the time zone to UTC or Beijing time;
    • Check if any other systems are also using the same Key.

5. If the difference is very large (e.g., more than 2–3 times)

You can further check:

  • Dify version number (many statistical logics have been corrected later):
    • Check the image tag in “Settings / About” or docker-compose.
  • Whether a proxy gateway is configured (e.g., APISIX, Kong, self-built reverse proxy):
    • Proxies sometimes retry or modify request bodies.
  • Look at the detailed logs of a few typical calls in Dify:
    • Specific character count of the request prompt;
    • The usage field returned by the model (some cloud vendors directly return used_tokens);
    • Whether it is consistent with the usage recorded by Dify.

If convenient, you can provide a few pieces of information, and I can help you analyze more specifically:

  1. Which cloud vendor and which model are you using? (e.g., OpenAI gpt-4.1 / Alibaba Tongyi / Tencent Hunyuan, etc.)
  2. Is Dify a cloud version or self-deployed? What is the approximate version number?
  3. Is the difference “a little bit more” (e.g., 10–20%) or “doubled”?
  4. Are you using features like knowledge base / Embedding / Rerank?

Based on this information, it can be roughly determined whether it’s a normal statistical caliber difference or a potential statistical bug or configuration issue.

Yes, the forum is also an example. For instance, when setting up a custom Model in the backend, to facilitate your statistics, it also provides options for custom tokenizer and pricing.


@Dify_Forum_Helper

你这两张图说得非常好,其实已经点到关键了:

  • Dify 后台的 Token / 费用统计,是根据你在「自定义 Model」里配置的 tokenizer 和 pricing 来“自己估算”的
  • 云厂商后台的统计,则是云厂商按自己真实计费规则算出来的。

只要这两边配置或算法不完全一致,就会出现数字对不上。

下面分几块说清楚它们之间的关系,以及怎么尽量对齐。


1. 自定义 Model 里的 tokenizer 是干什么的?

你截图里的下拉框(AnthropicTokenizer / GeminiTokenizer / Llama3Tokenizer / MistralTokenizer / OpenAiTokenizer / QwenTokenizer…)本质上是:

告诉 Dify:“我这路请求的 token 该用谁的分词方式来数”

不同厂商 / 模型的分词规则不一样,比如同一段中文:

  • OpenAI 的 tiktoken 可能切出 50 个 token
  • Qwen 的 tokenizer 可能切出 60 个
  • 甚至有些云商是按字符数按请求次数计费,和“token”这个概念并不 1:1 对应

Dify 只能尽量模拟厂商的 token 计算方式,所以提供了一堆常见 tokenizer 供你选择。
如果你这里选错了 tokenizer,Dify 算出来的 token 数,就必然和云商后台不一致。


2. pricing(单价)配置又起什么作用?

自定义模型里的:

  • Input cost
  • Cached input cost / Cache write cost
  • Output cost

这些都是**“每百万 tokens 收费多少美元”**的配置。Dify 会用:

(Dify 统计到的 input / output tokens)×(你填的单价)

来估算这个模型调用的大致花费,供你在后台看「成本 / 用量报表」。

因此,要想 Dify 报表大致接近云商账单:

  1. tokenizer 要与云商真实计费使用的 tokenizer 尽量一致;
  2. 单价要和云商官方文档里的价格一一对应着填;
  3. 还要注意一些厂商有「缓存命中」「上下文压缩」等特殊价格(cached input / cache write)。

3. 即便配置对了,为何还是会有差异?

即使:

  • 选对了对应的 tokenizer(比如 OpenAI 模型就选 OpenAiTokenizer);
  • 单价也按照云商文档配置好了;

仍然有可能出现偏差,原因包括:

  1. 云商返回 usage 是“实际计费 token”,Dify 是“本地再算一遍”
    • 部分模型的真实计费逻辑有内部压缩 / 特殊 token 处理;
    • Dify 的 tokenizer 实现是“对外公开规则的近似实现”,多数情况下非常接近,但不可能 100% 一样。
  2. 是否包含所有“隐藏”内容
    • 云商把系统提示、工具调用、格式化后的历史对话等都计费;
    • Dify 某些版本和某些节点,可能只对主模型调用计数,导致略低。
  3. 重试 / 失败调用
    • Dify 网络抖动或 429 时可能重试一次;
    • 云商会把每次请求都算 token;
    • Dify 的应用统计可能只记最终那一次——云商数字会显得更大。

因此,做到“小范围内大致接近”是现实目标,要 100% 一致往往不现实


4. 实际建议:怎么把两边数字“拉近”

你现在提到论坛这个例子,说明你在自建 / 扩展场景里已经在配自定义 Model 了,可以按下面步骤做个“校准”:

  1. 确认模型和 tokenizer 一一对应
    • 用 OpenAI 模型 → 选 OpenAiTokenizer
    • 用 Anthropic Claude → 选 AnthropicTokenizer
    • 用通义千问 → 选 QwenTokenizer
    • 其它云商(没有专门选项时):
      • 看其是否声明兼容 OpenAI 协议 / tiktoken,如果是,可以先选 OpenAiTokenizer 做近似;
      • 如果完全自定义计费方式(按字符 / 按调用次数),那 Dify 的 token 统计就只能当“参考值”。
  2. 从官方文档抄价格填进去
    • 逐项对应:
      • Input cost → prompt / input tokens 单价
      • Output cost → completion tokens 单价
      • Cached / cache write → 如果该模型有类似“缓存计费”(典型是一些新模型,或者你开了缓存功能)就按文档填;没有可以保持为 0 或留空。
  3. 做一个“小样本对账”
    • 写一个简单的测试应用 / workflow,只调用这一个模型,固定 prompt,调用 10 次左右;
    • 对比:
      • 云商控制台该模型在这段时间的 usage / 费用;
      • Dify 后台的 token & cost 报表;
    • 如果偏差在 5–10% 以内,一般就可以认为是合理范围;
    • 如果差一倍以上,就需要排查:
      • 是否有其它系统也在用同一个云商 key;
      • 是否有大量失败 / 重试调用被云商计入但你没在 Dify 前端看到。
  4. 明确对团队的口径说明
    • 可以在团队里约定:
      • Dify 后台的统计,主要用于观察应用整体用量 / 控制策略(比如限流)
      • 真正结算成本,以云商账单为准。
    • 同时用自定义 tokenizer + pricing 把两者“对齐到一个大致数量级”,让运营和财务都能接受。

5. 回到你原始问题

“Dify 后台的 token 统计和云商统计的不一样,这是为什么?”

结合你现在补充的截图,可以简化成一句话:

Dify 后台的用量,是用你在「自定义 Model」里选的 tokenizer + 配的价格自己算出来的估算值;而云商后台是实际计费值
只要 tokenizer 或计费规则有任何不一致,就会有差异,属于“正常现象”,通过选对 tokenizer 和价格可以大幅缩小差距,但很难做到完全一样。

如果你愿意,可以告诉我你现在接的是哪个云商、哪个模型、以及你在自定义 Model 里选了哪个 tokenizer、填了哪些价格,我可以帮你一起按官方价格表做一版更精确的配置建议。

Yes, in fact, Dify can also directly link to other powerful LLM observability tools, refer to Enhance LLM Application observability on Dify with LangSmith and Langfuse - Dify Blog @Dify_Forum_Helper

Yes, these two points you added actually complete the entire ecosystem of “why statistics can be different.” I’ll help you clarify the logic to provide a complete understanding for future readers.


1. Dify’s Own Statistics: Leaning Towards “In-Product Perspective”

  • By configuring tokenizer + pricing in “Model Providers / Custom Model,” Dify can calculate:
    • Estimated tokens for each call;
    • Approximate cost at the application / workflow / tenant level.
  • This statistic is more like an in-application operational perspective:
    • Which application is the most expensive, which nodes are the heaviest, which user/API key makes the most calls.
    • Used for rate limiting, quotas, cost estimation, and A/B comparison decisions.

But it is inherently an “estimation”:

  • It depends on the tokenizer you select;
  • It depends on the price list you manually enter or the built-in one;
  • Plus, strategies like “whether to include retries/failures/intermediate nodes.”

2. External LLM Observability Tools like Langfuse / LangSmith: Leaning Towards “Engineering + Operations Perspective”

The Monitoring integrations in your diagram (Langfuse, LangSmith, Opik, mlflow, Databricks, W&B, Arize, Alibaba Cloud Monitoring, Tencent Cloud APM, etc.) solve another layer of problems:

Not only do you need to know “how many tokens / how much money was spent,” but also “what exactly happened” for a specific call, a specific Prompt, or a specific path.

Typical capabilities include:

  • Complete Trace: Request chain, input and output of each node, model selection, time consumption.
  • More Detailed Token & Cost Analysis:
    • Some tools directly read the usage field returned by the model;
    • They also have their own set of tokenizer / cost models, which can be cross-referenced with Dify’s estimates.
  • Quality Evaluation: Automatic/manual scoring, playback, regression testing, RAG quality evaluation, etc.

They are not a replacement for Dify’s backend statistics but rather a parallel “second perspective,” more suitable for R&D, SRE, and data teams for in-depth analysis.


3. How to Understand the Three Sets of Numbers: Dify vs. Cloud Provider vs. Observability Platform

If you think of the chain like this:

User → Dify Application (with its own internal token & cost estimation)
→ Observability Platform (Langfuse / LangSmith etc. for tracing + evaluation)
→ Model Cloud Provider (OpenAI / Alibaba Cloud / Tencent Cloud etc. for final billing)

Then the usage / token / cost seen in these three places:

  1. Cloud Provider: The final real bill, which finance must use as the standard.
  2. Dify Backend:
    • An “application/workflow-centric” perspective, emphasizing visualization and operability;
    • With custom tokenizer & pricing, it can achieve “the same order of magnitude alignment” with the cloud provider.
  3. Langfuse / LangSmith etc.:
    • A “call chain / trace / experiment-centric” perspective;
    • Helps you optimize prompts, paths, model selection, and even provides more granular token & cost statistics.

In reality, these three sets of numbers will not be exactly the same. The more recommended usage is:

  • Bill Reconciliation / Cost Settlement: Use the cloud provider as the standard;
  • Product Operations / User Quotas / Application Ranking: Dify’s built-in statistics are sufficient;
  • Debugging / Diagnosis / Improving Quality & Performance: Look at traces, metrics, and evaluations in Langfuse / LangSmith etc.

4. A One-Sentence Summary of Your Original Question

Why are Dify backend tokens different from the cloud provider?

Further extended, it can be said:

  • Dify gives you a “configurable, in-product estimation perspective,” which can approach the cloud provider’s numbers through tokenizer + pricing;
  • The cloud provider gives you the “final settlement perspective”;
  • Langfuse / LangSmith etc. provide an “engineering and observability perspective,” making it easier for you to understand where these differences come from and to iterate on your application.

The blog post you cited also points this out: Dify comes with basic statistics, while LangSmith / Langfuse are responsible for more granular cost & token analysis and LLMOps capabilities; the two are complementary.

If you plan to unify the usage of these three sets of statistics in a production environment later (e.g., providing a unified report to business stakeholders), I can help you design a set of “standard definitions + reconciliation procedures” so that product, engineering, and finance teams each know what to look at and how to interpret the differences.

@Dify_Forum_Helper Summarize this

Okay, I’ll provide an “easy-to-understand” summary for this post, making it convenient for newcomers to quickly grasp the situation.


Overview in One Sentence

Dify backend’s Token / Cost statistics are estimates based on your configured tokenizer + price list;
Cloud provider consoles calculate based on actual billing rules.
Plus differences in “statistical scope,” “time window,” and “whether retries/caching are included”—
It’s normal for the numbers on both sides to not be exactly the same; as long as the magnitude is close, it’s considered normal.


I. Why are Dify and Cloud Provider Numbers Inconsistent?

This can be understood in three layers.

1. Differences in “How Statistics Are Calculated” (Methodology Differences)

  1. Dify: Counts tokens itself + calculates costs itself
  • In “Model Providers / Custom Model,” you select:

    • Which tokenizer (OpenAiTokenizer / QwenTokenizer / AnthropicTokenizer…)
    • The unit price per million tokens (input / output / cached input / cache write…)
  • Dify will then use:

    Tokens counted by the local tokenizer × the unit price you entered

    To derive an “estimated usage & estimated cost.”

  1. Cloud Providers: Bill according to their internal actual logic
  • They use their own tokenizer / compression strategies / cache billing, etc.:
    • For the same piece of text, the token count might be slightly different from Dify’s implementation;
    • Some providers even bill by character count or request count, which is not a 1:1 relationship with tokens.

→ Even if you select the “correct” tokenizer in Dify and copy the prices directly from the official website, it’s essentially still an “as close as possible” estimate.


2. Differences in “What Is Counted” (Scope Differences)

  • Cloud providers will bill for:

    • system / user / tools / historical conversations,
    • model output,
    • potential retry requests.
      All of these are billed.
  • Different Dify views/versions might:

    • Only show tokens for main model calls;
    • Certain intermediate nodes (tool nodes, sub-workflows, RAG retrieval, embedding, rerank) might not be in the report you’re currently viewing;
    • Have different counting strategies for failed requests, retries, or streaming interruptions.

Common phenomena therefore include:

  • Cloud provider usage > Dify: The cloud provider includes retries / failures / full context;
  • Occasionally Dify > Cloud provider: For example, if you estimate tokens based on the prompt, but the upstream hit a cache, the actual billing might be 0.

3. Differences in “When and What to Look At” (Time Window & Dimension Differences)

  • Different Time Ranges
    • Dify allows selecting “Last 24 hours / Custom Date”;
    • Cloud providers often aggregate “by UTC day / month.”
  • Different Scope Dimensions
    • Dify: a specific app / workflow / tenant;
    • Cloud provider: the entire account / project / API Key, and other services might be using the same key.

→ If the time period, model, and key scope are not perfectly aligned, it’s inherently difficult to match them one-to-one.


II. What Do the Two Charts in Custom Model Truly Indicate?

The two core configurations in your screenshot:

  1. Tokenizer Dropdown
    Determines: Which set of tokenization rules Dify uses to “simulate” the provider’s billing methodology.

    • OpenAI series → Select OpenAiTokenizer
    • Claude → Select AnthropicTokenizer
    • Tongyi Qianwen (Qwen) → Select QwenTokenizer
    • For others compatible with the OpenAI protocol, you can approximate with OpenAiTokenizer first.
  2. Pricing (input / output / cached / cache write cost)
    Determines: How much these tokens are converted into money.

    • Corresponds to the cloud provider’s documentation for: prompt unit price, completion unit price, cache-related unit prices.

As long as one of these two parts doesn’t perfectly match the cloud provider’s rules:

  • Dify’s token / cost statistics will show varying degrees of deviation;
  • This is a “normal phenomenon,” not an error in calculation, but rather “different rules and methodologies.”

III. What Are Langfuse / LangSmith and Other Monitoring Integrations For?

Reference: Enhance LLM Application observability on Dify with LangSmith and Langfuse - Dify Blog

This topic also mentioned Dify’s integration with LLM observability platforms like Langfuse / LangSmith, which address the third dimension of the problem:

It’s not just about “how many tokens were used / how much money was spent,”
but also about figuring out “which call, which prompt segment, which workflow path led to these expenses.”

These platforms generally provide:

  • Trace: The entire call chain, input/output of each node, and time taken;
  • More granular token / cost: Can be viewed by single call, specific prompt version, or specific branch;
  • Quality evaluation: Playback, scoring, A/B testing, RAG evaluation, etc.

So, the three perspectives can be divided as follows:

  1. Cloud Provider Console:

    • The true bill & cost settlement.
  2. Dify Backend (tokenizer + pricing):

    • For “in-application operations”:
      • Which app / workflow consumes the most;
      • Setting quotas and rate limits for users / keys;
      • Estimating costs for product-level decisions.
  3. Langfuse / LangSmith and other monitoring:

    • For “engineering & operations & optimization”:
      • Precisely identifying which step is wasting tokens;
      • Debugging prompts, routing, RAG strategies.

IV. Practical Advice for Newcomers

If you want Dify’s statistics to be “as close as possible” to the cloud provider’s:

  1. In Custom Model:

    • For the tokenizer, try to select the one that matches the actual model;
    • Copy the prices item by item from the cloud provider’s official documentation (distinguishing input / output / cached, etc.).
  2. Perform a “small sample reconciliation”:

    • Fix a simple application and call it dozens of times within a day;
    • Compare Dify’s usage with the cloud provider’s during this period;
    • See if the deviation is within an acceptable range (e.g., 5–10%).
  3. Agree on a unified methodology:

    • “Finance relies on the cloud provider’s bill;
      Product/operations look at Dify backend usage;
      Debugging & optimization look at Langfuse / LangSmith and other monitoring.”

This way, everyone will understand “why the numbers are different” and know which layer of statistics to trust for their respective needs.


If you later have specific cloud provider + model configurations (e.g., a certain Qwen / Claude / GPT model), you can also paste how you’ve configured it in Dify now, and I can help you adjust the tokenizer & pricing to a more reasonable set based on the official price list.