Skip to content

Persisting evaluation outputs

Matt Pearce edited this page Mar 27, 2020 · 4 revisions

Writing the evaluation results to JSON can result in a very large file if many queries are being evaluated, especially if you have a number of configuration options. For this reason, we provide the option to persist results to other destinations using the Persistence Framework.

Currently, it is possible to persist to two destinations:

  • a JSON file - the default, requiring no additional configuration
  • an Elasticsearch index.

You can write your own persistence handler by implementing the PersistenceHandler interface.

If no persistence handlers are set, or all of them fail to initialise, the evaluation will stop without running any queries.

Configuration (Maven)

If using Maven, configuration for persistence is passed in the configuration section for the evaluation plugin in your pom.xml. If no persistence configuration is supplied, the evaluation will be written to target/rre/evaluation.json.

Each persistence handler in the configuration should be given a name, which can then be used to pass additional parameters:

<build>
  <plugins>
    <plugin>
      <!-- Elasticsearch evaluation plugin -->
      <groupId>io.sease</groupId>
      <artifactId>rre-maven-elasticsearch-plugin</artifactId>
      <version>${elasticsearch.version}</version>
      <dependencies>
        <!-- Each additional persistence plugin needs to be defined here -->
        <!-- Dependency for Elasticsearch persistence -->
        <dependency>
          <groupId>io.sease</groupId>
          <artifactId>rre-persistence-plugin-elasticsearch</artifactId>
          <version>${elasticsearch.version}</version>
        </dependency>
      </dependencies>
      <configuration>
        <persistence>
          <useTimestampAsVersion>false</useTimestampAsVersion>
          <handlers>
            <!-- Define each handler implementation with a name -->
            <json>io.sease.rre.persistence.impl.JsonPersistenceHandler</json>
            <es_local>io.sease.rre.persistence.impl.ElasticsearchPersistenceHandler</es_local>
            <es_shared>io.sease.rre.persistence.impl.ElasticsearchPersistenceHandler</es_shared>
          </handlers>
          <handlerConfiguration>
            <!-- Add the configuration for each handler, using its name -->
            <json><!-- Any non-default JSON config --></json>
            <es_local><!-- es_local configuration --></es_local>
            <es_shared><!-- es_shared configuration --></es_shared>
          </handlerConfiguration>
        </persistence>
        <!-- Other configuration -->
      </configuration>
    </plugin>
  </plugins>
</build>

You may have as many persistence handlers defined as you like, provided they have unique names.

The useTimestampAsVersion option will set the version value in the output documents to the current timestamp, represented as the number of seconds since the Unix epoch. If this is set to true, you should only have a single configuration version - having more than one will stop the evaluation process.

JSON configuration

The destination file may be changed, if required:

<persistence>
  <json>io.sease.rre.persistence.impl.JsonPersistenceHandler</json>
  <handlerConfiguration>
    <json>
      <destinationFile>/path/to/my/rre_output.json</destinationFile>
    </json>
  </handlerConfiguration>
</persistence>

The file content is held in memory until all evaluations are complete, allowing the full domain structure to be built. This can be memory-hungry for large evaluation sets.

Elasticsearch configuration

The Elasticsearch persistence plugin will create a new index on the destination server, if one is not already present with the given name. Results are written out in batches at a set interval. The Maven configuration options, with default values shown, are:

<persistence>
  <elasticsearch>io.sease.rre.persistence.impl.ElasticsearchPersistenceHandler</elasticsearch
  <handlerConfiguration>
    <elasticsearch>
      <index>rre_output</index> <!-- REQUIRED -->
      <baseUrl>http://localhost:9200</baseUrl>
      <threadpoolSize>2</threadpoolSize>
      <runIntervalMs>500</runIntervalMs>
      <batchSize>500</batchSize>
    </elasticsearch>
  </handlerConfiguration>
</persistence>

The threadpool size, run interval and batch size should be okay for most situations. The index name must be supplied, otherwise the persistence handler will fail to initialise.

The plugin creates one Elasticsearch document per query version. These can then be grouped or aggregated to get the overall picture for a topic or query group.