Skip to content

Design large scale processing

Stefaan Lippens edited this page Sep 28, 2022 · 3 revisions

Current design

The current design is in fact an early prototype:

  • zookeeper is used for persistence, will probably not scale to 50k+ tasks
  • zookeeper does not have a very convenient UI for internal operations
  • zookeeper is basic key/value store, no search/query functionality
  • the aggregator itself is responsible for tracking of all these jobs, and launching new ones. It's not entirely clear if this would work well in an active/active HA setup. It also increases the workload for the aggregator, while it should rather focus on handling requests?
  • Because there is currently no background process feature in the aggregator implementation, the (sub)job status tracking is only triggered when the user polls for the parent job status. If the end user stops polling for some reason, status updates will never reach the aggregator

NiFi based design

VITO has considerable experience with a nifi based design:

  • The aggregator receives a large task, and splits it up into subtasks. These subtasks are json encoded and persisted in a database like ElasticSearch.
  • A kibana dashboard allows detailed follow-up of subtasks.
  • A nifi flow picks up subtasks from ES, and submits them to compatible backends.
  • Complex logic, for instance to determine ordering, priorities, compatible backends, should preferably be computed in the aggregator already and stored in the subtask metadata. This avoids very complex logic in the nifi system, which is somewhat annoying to manage.
  • Nifi gives us a number of nice features:
    • Support for ordering jobs, priorities
    • Sending emails, for instance upon failure of a subjob.
    • Very reliable with respect to failure. Nifi will not easily loose track of jobs, and can pick up where it left of.
    • Full traceability/auditing
  • NiFi persists state changes into elasticsearch as well, so we have coarse grained state in ES + fine grained state in nifi itself.
  • NiFi can also trigger other flow, for instance if a job finishes, we may want to index the stac result metadata into a catalog.

One downside is of course that an important part of the logic is not written in python, but in fact as a nifi flow. Nifi flows can be stored in git and shared as open source as well.

Clone this wiki locally