Skip to content

Evaluation Measures

Matt Pearce edited this page Mar 25, 2020 · 7 revisions

Evaluation measures for an information retrieval system are used to assess how well the search results satisfied the user's query intent.
Such metrics are often split into kinds: online metrics look at users' interactions with the search system, while offline metrics measure relevance, in other words how likely each result, or search engine results page (SERP) page as a whole, is to meet the information needs of the user.

(Wikipedia)

Available Measures

The following list includes the leaf-level RRE built-in metrics which can be used out of the box. "Leaf" because those metrics are computed at leaf level in the domain model, which means they are computed at query level:

  • Precision: the fraction of retrieved documents that are relevant
  • Recall: the fraction of relevant documents that are retrieved
  • Reciprocal Rank: it is the multiplicative inverse of the rank of the first "correct" answer: 1 for first place, 1/2 for second place, 1/3 for third and so on.
  • Expected Reciprocal Rank (ERR): An extension of Reciprocal Rank with graded relevance, measures the expected reciprocal length of time that the user will take to find a relevant document.
  • Average Precision: the area under the precision-recall curve.
  • NDCG: it is the multiplicative inverse of the rank of the first "correct" answer: 1 for first place, 1/2 for second place, 1/3 for third and so on.
  • F-Measure: it measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision. RRE provides the three most popular F-Measure instances: F0.5, F1 and F2; additionally, you may specify your own β value if required (see below).

On top of those "leaf" metrics computed at query level, RRE computes them at the upper levels of the domain model (e.g. query group, topic, corpus) using an aggregation function. The result is a new set of metrics with several levels of granularity:

  • Mean Average Precision: the mean of the average precisions computed at query level.
  • Mean Reciprocal Rank: the average of the reciprocal ranks computed at query level.
  • all other metrics listed above aggregated by their arithmetic mean

Controlling evaluation

Several of the metrics allow parameters to be passed to control how they evaluate the incoming results - see Parameterized Metrics for details of how to set these through your Maven pom.xml.

The metrics with these parameters available are:

  • F-Measure
    • k - the top k reference elements used to build the measurement.
    • beta - the balance factor between precision and recall.
  • NDCG@K
    • k - the top k reference elements used to build the measurement.
    • maximumGrade - the maximum grade available when judging documents (optional, default: 3.0).
    • missingGrade - the grade that should be assigned to documents where no judgement has been given. This is optional - the default value is either maximumGrade / 2 (if maximumGrade has been supplied), or 2.0.
    • name - the name used to record this metric in the output (optional, defaults to NDCG@k, where k is set as above). This allows the metric to be run multiple times with different missing grade values, for example.
  • ERR@K - Expected Reciprocal Rank
    • k - the top k reference elements used to build the measurement.
    • maximumGrade - the maximum grade available when judging documents (optional, default: 3.0).
    • missingGrade - the grade that should be assigned to documents where no judgement has been given. This is optional - the default value is either maximumGrade / 2 (if maximumGrade has been supplied), or 2.0.
    • name - the name used to record this metric in the output (optional, defaults to ERR@k, where k is set as above). This allows the metric to be run multiple times with different missing grade values, for example.
  • RR@K - Reciprocal Rank
    • k - the top k reference elements used to build the measurement (default: 10).
    • maximumGrade - the maximum grade available when judging documents (optional, default: 3.0).
    • missingGrade - the grade that should be assigned to documents where no judgement has been given. This is optional - the default value is either maximumGrade / 2 (if maximumGrade has been supplied), or 2.0.
    • name - the name used to record this metric in the output (optional, defaults to RR@k, where k is set as above). This allows the metric to be run multiple times with different missing grade values, for example.

Both maximumGrade and missingGrade may be floating point values.