Skip to content

Latest commit

 

History

History
489 lines (398 loc) · 19.3 KB

advanced-functionality.rst

File metadata and controls

489 lines (398 loc) · 19.3 KB

Advanced Functionality

This section documents some additional functionality you may find useful after you're comfortable with the primary capabilities of Elasticsearch LTR.

Reusable Features

In :doc:`building-features` we demonstrated creating feature sets by uploading a list of features. Instead of repeating common features in every feature set, you may want to keep a library of features around.

For example, perhaps a query on the title field is important to many of your feature sets, you can use the feature API to create a title query:

POST _ltr/_feature/titleSearch
{
    "feature":
    {
        "params": [
        "keywords"
        ],
        "template": {
        "match": {
            "title": "{{keywords}}"
        }
        }
    }
}

As you'd expect, normal CRUD operations apply. You can DELETE a feature:

DELETE _ltr/_feature/titleSearch

And fetch an individual feature:

GET _ltr/_feature/titleSearch

Or look at all your features, optionally filtered by name prefix:

GET /_ltr/_feature?prefix=t

You can create or update a feature set, you can refer to the titleSearch feature:

POST /_ltr/_featureset/my_featureset/_addfeatures/titleSearch

This will place titleSearch at the next ordinal position under "my_feature_set"

Derived Features

Features that build on top of other features are called derived features. These can be expressed as lucene expressions. They are recognized by "template_language": "derived_expression". Besides these can also take in query time variables of type Number as explained in :ref:`create-feature-set`.

Script Features

These are essentially :ref:`derived-features`, having access to the feature_vector but could be native or painless elasticsearch scripts rather than lucene expressions. "template_language": "script_feature"" allows LTR to identify the templated script as a regular elasticsearch script e.g. native, painless, etc.

The custom script has access to the feature_vector via the java Map interface as explained in :ref:`create-feature-set`.

(WARNING script features can cause the performance of your Elasticsearch cluster to degrade, if possible avoid using these for feature generation if you require your queries to be highly performant)

Script Features Parameters

Script features are essentially native/painless scripts and can accept parameters as per the elasticsearch script documentation. We can override parameter values and names to scripts within LTR scripts. Priority for parameterization in increasing order is as follows

  • parameter name, value passed in directly to source script but not in params in ltr script. These cannot be configured at query time.

  • parameter name passed in to sltr query and to source script, so the script parameter values can be overridden at query time.

  • ltr script parameter name to native script parameter name indirection. This allows ltr parameter name to be different from the underlying script parameter name. This allows same native script to be reused as different features within LTR by specifying different parameter names at query time:

    POST _ltr/_featureset/more_movie_features
    {
       "featureset": {
            "features": [
                {
                    "name": "title_query",
                    "params": [
                        "keywords"
                    ],
                    "template_language": "mustache",
                    "template": {
                        "match": {
                            "title": "{{keywords}}"
                        }
                    }
                },
                {
                    "name": "custom_title_query_boost",
                    "params": [
                        "some_multiplier",
                        "ltr_param_foo"
                    ],
                    "template_language": "script_feature",
                    "template": {
                        "lang": "painless",
                        "source": "(long)params.default_param * params.feature_vector.get('title_query') * (long)params.some_multiplier * (long) params.param_foo",
                        "params": {
                            "default_param" : 10.0,
                            "some_multiplier": "some_multiplier",
                            "extra_script_params": {"ltr_param_foo": "param_foo"}
                        }
                    }
                }
            ]
       }
    }
    

Multiple Feature Stores

We defined a feature store in :doc:`building-features`. A feature store corresponds to an independent LTR system: features, feature sets, models backed by a single index and cache. A feature store corresponds roughly to a single search problem, often tied to a single application. For example wikipedia might be backed by one feature store, but wiktionary would be backed by another. There's nothing that would be shared between the two.

Should your Elasticsearch cluster back multiple properties, you can use all the capabilities of this guide on named feature stores, simply by:

PUT _ltr/wikipedia

Then the same API in this guide applies to this feature store, for example to create a feature set:

POST _ltr/wikipedia/_featureset/attempt_1
{
   "featureset": {
        "features": [
            {
                "name": "title_query",
                "params": [
                    "keywords"
                ],
                "template_language": "mustache",
                "template": {
                    "match": {
                        "title": "{{keywords}}"
                    }
                }
            }
        ]
   }
}

And of course you can delete a featureset:

DELETE _ltr/wikipedia/_featureset/attempt_1

You can use featuresets of specific feature stores by using the store parameter in the sltr part of your query when logging features:

"sltr": {
    "_name": "logged_featureset",
    "featureset": "attempt_1",
    "store": "wikipedia",
    "params": {
        "keywords": "star"
    }
}

In case no store is specified the default store will be used for looking up the featureset.

Model Caching

The plugin uses an internal cache for compiled models.

Clear the cache for a feature store to force models to be recompiled:

POST /_ltr/_clearcache

Get cluster wide cache statistics for this store:

GET /_ltr/_cachestats

Characteristics of the internal cache can be controlled with these node settings:

# limit cache usage to 12 megabytes (defaults to 10mb or max_heap/10 if lower)
ltr.caches.max_mem: 12mb
# Evict cache entries 10 minutes after insertion (defaults to 1hour, set to 0 to disable)
ltr.caches.expire_after_write: 10m
# Evict cache entries 10 minutes after access (defaults to 1hour, set to 0 to disable)
ltr.caches.expire_after_read: 10m

Extra Logging

As described in :doc:`logging-features`, it is possible to use the logging extension to return the feature values with each document. For native scripts, it is also possible to return extra arbitrary information with the logged features.

For native scripts, the parameter extra_logging is injected into the script parameters. The parameter value is a Supplier <Map>, which provides a non-null Map<String,Object> only during the logging fetch phase. Any values added to this Map will be returned with the logged features:

@Override
public double runAsDouble() {
...
    Map<String,Object> extraLoggingMap = ((Supplier<Map<String,Object>>) getParams().get("extra_logging")).get();
    if (extraLoggingMap != null) {
        extraLoggingMap.put("extra_float", 10.0f);
        extraLoggingMap.put("extra_string", "additional_info");
    }
...
}

If (and only if) the extra logging Map is accessed, it will be returned as an additional entry with the logged features:

{
    "log_entry1": [
        {
            "name": "title_query"
            "value": 9.510193
        },
        {
            "name": "body_query"
            "value": 10.7808075
        },
        {
            "name": "user_rating",
            "value": 7.8
        },
        {
            "name": "extra_logging",
            "value": {
                "extra_float": 10.0,
                "extra_string": "additional_info"
            }
        }
    ]
}

Feature Score Caching

By default, this plugin calculates feature scores for model inference and for feature score logging separately. For example, if we write a query as below to rescore top-100 documents then return top-10 among them with feature scores, this plugin calculates the feature scores on the 100 documents for model inference then calculates again and logs 10 of them:

POST tmdb/_search
{
    "size": 10,
    "query": {
        "match": {
            "_all": "rambo"
        }
    },
    "rescore": {
        "window_size" : 100,
        "query": {
            "rescore_query": {
                "sltr": {
                    "params": {
                        "keywords": "rambo"
                    },
                    "model": "my_model"
                }
            }
        }
    },
    "ext": {
        "ltr_log": {
            "log_specs": {
                "name": "log_entry1",
                "rescore_index": 0
            }
        }
    }
}

In some environments, it may be faster to cache the feature scores for model inference and just reuse them for logging. This plugin supports this behavior. To enable the feature score caching, add cache: "true" flag to the LTR query which is the target of feature score logging:

"sltr": {
    "cache": true,
    "params": {
        "keywords": "rambo"
    },
    "model": "my_model"
}

Stats

The stats API gives the overall plugin status and statistics:

GET /_ltr/_stats

{
    "_nodes": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "cluster_name": "es-cluster",
    "stores": {
        "_default_": {
            "model_count": 10,
            "featureset_count": 1,
            "feature_count": 0,
            "status": "green"
        }
    },
    "status": "green",
    "nodes": {
        "2QtMvxMvRoOTymAsoQbxhw": {
            "cache": {
                "feature": {
                    "eviction_count": 0,
                    "miss_count": 0,
                    "hit_count": 0,
                    "entry_count": 0,
                    "memory_usage_in_bytes": 0
                },
                "featureset": {
                    "eviction_count": 0,
                    "miss_count": 0,
                    "hit_count": 0,
                    "entry_count": 0,
                    "memory_usage_in_bytes": 0
                },
                "model": {
                    "eviction_count": 0,
                    "miss_count": 0,
                    "hit_count": 0,
                    "entry_count": 0,
                    "memory_usage_in_bytes": 0
                }
            }
        }
    }
}

You can also use filters to retrieve a single stat:

GET /_ltr/_stats/{stat}

Also you can limit the information to a single node in the cluster:

GET /_ltr/_stats/nodes/{nodeId}

GET /_ltr/_stats/{stat}/nodes/{nodeId}

TermStat Query

Experimental - This query is currently in an experimental stage and the DSL may change as the code advances. For stable term statistic access please see the ExplorerQuery.

The TermStatQuery is a re-imagination of the legacy ExplorerQuery which offers clearer specification of terms and more freedom to experiment. This query surfaces the same data as the ExplorerQuery but it allows the user to specify a custom Lucene expression for the type of data they would like to retrieve. For example:

POST tmdb/_search
{
    "query": {
        "term_stat": {
            "expr": "df",
            "aggr": "max",
            "terms": ["rambo",  "rocky"],
            "fields": ["title"]
        }
    }
}

The expr parameter is the Lucene expression you want to run on a per term basis. This can simply be a stat type, or a custom formula containing multiple stat types, for example: (tf * idf) / 2. The following stat types are injected into the Lucene expression context for your usage:

  • df -- the direct document frequency for a term. So if rambo occurs in 3 movie titles across multiple documents, this is 3.
  • idf -- the IDF calculation of the classic similarity log((NUM_DOCS+1)/(raw_df+1)) + 1.
  • tf -- the term frequency for a document. So if rambo occurs in 3x in movie synopsis in same document, this is 3.
  • tp -- the term positions for a document. Because multiple positions can come back for a single term, review the behavior of pos_aggr
  • ttf -- the total term frequency for the term across the index. So if rambo is mentioned a total of 100 times in the overview field across all documents, this would be 100.

The aggr parameter tells the query what type of aggregation you want over the collected statistics from the expr. For the example terms of rambo rocky we will get stats for both terms. Since we can only return one value you need to decide what statistical calculation you would like.

Supported aggregation types are: - min -- the minimum - max -- the maximum - avg -- the mean - sum -- the sum - stddev -- the standard deviation

Additionally the following counts are available: - matches -- The number of terms that matched in the current document - unique -- The unique number of terms that were passed in

The terms parameter is array of terms to gather statistics for. Currently only single terms are supported, there is not support for phrases or span queries. Note: If your field is tokenized you can pass multiple terms in one string in the array.

The fields parameter specifies which fields to check for the specified terms. Note if no analyzer is specified then we use the analyzer specified for the field.

Optional Parameters

  • analyzer -- if specified this analyzer will be used instead of the configured search_analyzer for each field
  • pos_aggr -- Since each term by itself can have multiple positions, you need to decide which aggregation to apply. This supports the same values as aggr and defaults to AVG

Script Injection

Finally, one last addition that this functionality provides is the ability to inject term statistics into a scripting context. When working with ScriptFeatures if you pass a term_stat object in with the terms, fields and analyzer parameters you can access the raw values directly in a custom script via an injected variable named termStats. This provides for advanced feature engineering when you need to look at all the data to make decisions.

Scripts access matching and unique counts slightly differently than inside the TermStatQuery:

To access the count of matched tokens: params.matchCount.get() To access the count of unique tokens: params.uniqueTerms

You have the following options for sending in parameters to scripts. If you always want to find stats about the same terms (i.e. stopwords or other common terms in your index) you can hardcode the parameters along with your script:

POST _ltr/_featureset/test
{
   "featureset": {
     "features": [
       {
         "name": "injection",
         "template_language": "script_feature",
         "template": {
           "lang": "painless",
           "source": "params.termStats['df'].size()",
           "params": {
             "term_stat": {
                "analyzer": "!standard",
                "terms": ["rambo rocky"],
                "fields": ["overview"]
             }
           }
         }
       }
     ]
   }
}

Note: Analyzer names must be prefixed with a bang(!) if specifying locally, otherwise it will treat the value as a parameter lookup.

To set parameter lookups simply pass the name of the parameter to pull the value from like so:

POST _ltr/_featureset/test
{
   "featureset": {
     "features": [
       {
         "name": "injection",
         "template_language": "script_feature",
         "template": {
           "lang": "painless",
           "source": "params.termStats['df'].size()",
           "params": {
             "term_stat": {
                "analyzer": "analyzerParam",
                "terms": "termsParam",
                "fields": "fieldsParam"
             }
           }
         }
       }
     ]
   }
}

The following example shows how to set the parameters at query time:

POST tmdb/_search
{
    "query": {
        "bool": {
            "filter": [
                {
                    "terms": {
                        "_id": ["7555", "1370", "1369"]
                    }
                },
                {
                    "sltr": {
                        "_name": "logged_featureset",
                        "featureset": "test",
                        "params": {
                          "analyzerParam": "standard",
                          "termsParam": ["troutman"],
                          "fieldsParam": ["overview"]
                        }
                }}
            ]
        }
    },
    "ext": {
        "ltr_log": {
            "log_specs": {
                "name": "log_entry1",
                "named_query": "logged_featureset"
            }
        }
    }
}