I have a question about Dify’s Knowledge Base vector search behavior.
Issue
-
In the search test, I use the query “mac”
-
The search results include multiple chunks with exactly the same text
-
However, the similarity scores (SCORE) for these chunks are different
Example:
-
Identical text:
“Macの場合はマウスの支給をしません。各自で調達してください。”
(“For Mac, a mouse is not provided. Please prepare one yourself.”) -
But the scores differ, for example:
-
SCORE: 0.26
-
SCORE: 0.19
-
(See attached screenshot)
Questions
I would like to understand why this happens.
-
Is it expected that identical text chunks can have different similarity scores due to:
-
Being stored in different documents
-
Different chunk IDs or ingestion order
-
Differences in metadata (document title, folder, description, etc.)
-
Different surrounding context when the text was chunked
-
-
Or could this be related to:
-
Embedding timing / re-embedding behavior
-
Vector database implementation details
-
Assumptions / Environment
-
The chunk text itself is exactly the same string
-
Search type is vector search (not keyword search)
-
No similarity threshold is explicitly configured
Purpose
I want to clarify whether:
-
“Identical text should generally result in identical similarity scores”, or
-
“Some level of score variance for identical text is expected behavior in Dify”
If anyone has experienced similar behavior or knows the underlying design/specification, I would really appreciate your insights.
