Different similarity scores for identical text chunks in Knowledge Base

I have a question about Dify’s Knowledge Base vector search behavior.

Issue

  • In the search test, I use the query “mac”

  • The search results include multiple chunks with exactly the same text

  • However, the similarity scores (SCORE) for these chunks are different

Example:

  • Identical text:

    “Macの場合はマウスの支給をしません。各自で調達してください。”
    (“For Mac, a mouse is not provided. Please prepare one yourself.”)

  • But the scores differ, for example:

    • SCORE: 0.26

    • SCORE: 0.19

(See attached screenshot)

Questions

I would like to understand why this happens.

  • Is it expected that identical text chunks can have different similarity scores due to:

    • Being stored in different documents

    • Different chunk IDs or ingestion order

    • Differences in metadata (document title, folder, description, etc.)

    • Different surrounding context when the text was chunked

  • Or could this be related to:

    • Embedding timing / re-embedding behavior

    • Vector database implementation details

Assumptions / Environment

  • The chunk text itself is exactly the same string

  • Search type is vector search (not keyword search)

  • No similarity threshold is explicitly configured

Purpose

I want to clarify whether:

  • “Identical text should generally result in identical similarity scores”, or

  • “Some level of score variance for identical text is expected behavior in Dify”

If anyone has experienced similar behavior or knows the underlying design/specification, I would really appreciate your insights.


Hi, if the text inside the chunks is the same, it’s reasonable to expect that the two will have the same vector.

I’d like to know a bit about your environment. Are you using Dify Cloud, or are you self-hosting Dify? Which embedding model are you using? If you’re self-hosting, which Dify version and which vector database are you using?

It also seems like the other post is behaving in a counter-intuitive way.

Since I can’t reproduce the issue in my environment, there might be some kind of inconsistency in the data within your vector database.
Could you try creating a new knowledge base from scratch, upload the same document, and see if the issue still occurs?

1 Like

Hi, thanks for your detailed response.

I agree that if the text inside the chunks is exactly the same, the resulting vectors should also be the same.

Here is a bit more context about my environment:

  • I am self-hosting Dify.

  • Dify version: 1.11.4

  • Embedding model: amazon.titan-embed-text-v2:0

  • Vector database: Weaviate (default configuration)

As you suggested, I will try the following to verify the behavior:

  • Create a new knowledge base from scratch

  • Upload the same document

  • Check whether the issue still occurs

If the issue does not occur in the new knowledge base, it may indicate some inconsistency in the original environment.

Thanks again for the suggestion — I’ll report back once I’ve tested this.