Signed-off-by: Daniel Baumann <daniel@debian.org>

2025-05-19 00:20:02 +02:00

5.6 KiB

Raw Permalink Blame History

Scoring models for document hits

Search is performed on a collection fields using compound queries such as conjunction/disjunction/boolean etc. However, the scoring itself is done independently for each field and then aggregated to get the final score for a document hit.
Default scoring scheme for document hits involving text hits: tf-idf.
Nearest-neighbor/vector hits scoring depends on chosen knn distance metric, highlighted here.
Hybrid search scoring will combine tf-idf scores with knn distance numbers.
v2.5.0 (and after) will come with support for bm25 scoring for exact searches.

BM25

When it comes to scoring a document hit for a specific field, BM25 scoring mechanism requires the following stats:

fieldLen - The number of analyzed terms in the current document's field.
avgFieldLen - The average number of analyzed terms in the field across all the documents.
docTotal - The total number of documents in the index.
docTerm - The total number of documents containing the query term within the index.

The scoring formula followed in BM25 is

\sum_{i}^n IDF(q_i) {{f(q_i,D) * (k1 + 1)}\over{f(q_i,D) + k1 * (1-b+b*{{fieldLen}\over{avgFieldLen}})}}

IDF(q_i) here refers to Inverse Document Frequency talks about how rare (and hence rich in information) is a particular query term `q_i` across all the documents in the index, which is calculated as

    \ln(1 + {{docTotal - docTerm + 0.5}\over{docTerm + 0.5}})

Coming back to the BM25 scoring, f(q_i,D) refers to the frequency of the query term in document D. The entire equation has certain multipliers

k1 - helps in controlling the saturation of the score with respect to query term in a document. Basicaly if the query term's frequency is too high, the score value gets saturated and doesn't increase beyond a certain point.
b - controls the extent to which the fieldLen normalizes the term's frequency.

How to enable and use BM25

Bleve v2.5.0 updated the indexMapping construct with the concept of scoringModel. This is a global (meaning applicable to all the fields) setting which drives which scoring algorithm to apply while scoring the document hits. Supported scoring models are defined here

For instance, while defining the index mapping for the data model that's been decided by the user, following snippet can be referred to enable BM25

indexMapping := bleve.NewIndexMapping()
indexMapping.TypeField = "type"
indexMapping.DefaultAnalyzer = "en"
indexMapping.ScoringModel = index.BM25Scoring

During search time there's explicit change involved, unless the user wants to perform a global scoring.

Global Scoring

Let's say that the user has a dataset which is quite large (let's say 3 million) and to have good throughput, they create 3 shards (with the same index mapping) for the "index". Each of these shards can be bleve.Index type and while performing a search over the entire dataset, a bleve.IndexAlias can be created over which a search can be performed. This parallelizes things pretty good, both on the indexing path and the search path.

The concept of global scoring is applicable when the index is "sharded" (similar to above situation). This is because each index has data which is disjoint, and thereby while performing the scoring on document hits on each of them, the value of stats is not complete at a global level, since we're doing a search over the entire dataset using the bleve.IndexAlias. For eg: docTotal value while scoring the document hits would be 1 million which is incorrect at a global level.

So in order to keep the scoring roughly same across varying count of the number of shards involved, we provide a mechanism to enable "global scoring". In this type of search, an initial roundtrip is performed to gather and aggregate the stats necessary for the scoring mechanism and in the second phase, the actual search is performed. So naturally this comes at a cost of latency. As a reference here's how the user can go about with it

multiPartIndex := bleve.NewIndexAlias(shard1, shard2)
// set the alias with the same index mapping which both the shards use.
err = multiPartIndex.SetIndexMapping(indexMapping)
if err != nil {
    return err
}

ctx := context.Background()
ctx = context.WithValue(ctx, search.SearchTypeKey, search.GlobalScoring)

res, err := multiPartIndex.SearchInContext(ctx, searchRequest)

A note here is that, this would only matter if the relative order of the document hits vary quite a bit (vs single shard case). This would be possible when the shard count increases quite a bit, in low doc count situations or if there is a heavy skew in the data distribution amongst the shards for some reason. Ideally the shards are created when the data is quite large and each of them index same amount of data - in which case the scores won't fluctuate much to affect the relative hit order and the user can choose to avoid the global scoring mechanism altogether.

TF-IDF

TF-IDF is the default scoring mechanism involved (for backward compatibility reasons) and requires no change from the user at index or search time to avail it.

The scoring formula involved is

    \sum_{i}^n f(q_i, D) * {{1}\over{\sqrt{fieldLen}}} * IDF(q_i)

where IDF(q_i) is

    1 + {{docTotal}\over{1 + docTerm}}

Note: TF-IDF formula doesn't accomodate logic for score saturation due to term frequency or fieldLen. So, it's recommended to use BM25 scoring by explicity setting it in the index mapping.

5.6 KiB Raw Permalink Blame History