Generalized Thresholding for ColBERT Scores Across Datasets #376

FaisalAliShah · 2024-10-31T11:11:05Z

I’m currently working with ColBERT for document re-ranking and facing challenges in applying a generalized threshold to ColBERT scores across different datasets. Due to the variability in score ranges, it’s difficult to set a fixed threshold for relevance filtering. Unlike typical embedding similarity scores, ColBERT’s late interaction mechanism produces scores that can vary significantly based on query length, token distributions, and dataset characteristics.

I tried using min-max normalization on scores returned for a particular query but turns out even when the search is irrelevant it would give results because I was selecting min_score and max_score from the query responses.

Here are some of the approaches I’ve considered, but each has limitations when applied generally:

Normalizing scores by query length or token count
Rescaling scores based on observed min-max values in different datasets
Z-score normalization based on empirical mean and variance across datasets
Using adaptive thresholds or lightweight classifiers to predict relevance

However, each approach tends to be dataset-specific, and I would like a solution that can generalize effectively across datasets. Do you have any recommended strategies for achieving a more standardized scoring range or threshold? Alternatively, is there any built-in functionality planned (or that I might have missed) for scaling or calibrating ColBERT scores in a more generalizable way?

Any guidance or suggestions would be greatly appreciated! I have attached my code snipped below as how I am using it.

Thank you for the fantastic work on ColBERT.

prefetch = [
                models.Prefetch(
                    query=dense_embedding,
                    using=dense_vector_name,
                    limit=20,
                ),
                models.Prefetch(
                    query=sparse_embedding,
                    using=sparse_vector_name,
                    limit=20,
                ),
            ]

            
            search_results = self.qdrant_client.query_points(
                collection_name=kwargs["collection_name"],
                prefetch=prefetch if Config.RETRIEVAL_MODE == QdrantSearchEnums.HYBRID.value else None,
                query=dense_embedding if Config.RETRIEVAL_MODE == QdrantSearchEnums.DENSE.value else colbert_embedding,
                using=dense_vector_name if Config.RETRIEVAL_MODE == QdrantSearchEnums.DENSE.value else colbert_vector_name,
                with_payload=True, 
                limit=10,
                # score_threshold=17,
            ).points
            return search_results`

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalized Thresholding for ColBERT Scores Across Datasets #376

Generalized Thresholding for ColBERT Scores Across Datasets #376

FaisalAliShah commented Oct 31, 2024

Generalized Thresholding for ColBERT Scores Across Datasets #376

Generalized Thresholding for ColBERT Scores Across Datasets #376

Comments

FaisalAliShah commented Oct 31, 2024