Skip to content

Long Running Processes

exarkun edited this page Sep 11, 2014 · 1 revision

Overview

This is an in-progress document describing how and why the Flocker operational model might be changed from a sequence of short-level, executed-over-ssh processes to a group of long-lived, interacting-with-a-real-protocol processes.

Current State

Flocker 0.1 uses short-lived processes for all of its functionality. These processes can be divided into two rough categories. One category includes the user-facing interface - mostly just flocker-deploy. These are commands we encourage end-users to interact with directly. The other category includes the internal interface which we use to allow different parts of our software (primarily parts running on different hosts) to communicate. These parts are typically invoked remotely via ssh. Users are not encouraged to interact with them directly. Examples include flocker-reportstate and flocker-changestate.

Advantages

  • It makes the interface between different components of the system extremely explicit. This encourages loose coupling.
  • In the case of flocker-volume, it exposes certain functionality which may be useful or interesting outside of the core goals of Flocker. This allows a certain level of user-driven experimentation which may suggest valuable directions for Flocker to take.
  • It avoids the need to manage the lifetime of any long-running processes. This saves labor by removing the need to integrate with the system init daemon.
  • Leveraging SSH provides a baseline level of security without requiring us to build much ourselves. The SSH security model is well understood and the OpenSSH package provides a widely deployed implementation.

The latter two (or perhaps three, as the value of experimentation will decline as Flocker matures) of these are advantages in terms of time to market rather than inherent advantages that will persist in long run.

Disadvantages

  • The complexity of multi-process interactions is greater than the complexity of normal in-process method calls. There are a variety of error conditions which are only possible with the multi-process approach. To create a reliable piece of software these need to be handled. Additionally, since the only handling possible for many of them is to report an error and then fail they represent unique failure modes for the software that would not exist without multi-processing (therefore the software is overall less reliable when it involves multi-processing). In the long run, the additional complexity also requires more engineering effort to maintain.
  • Use of a shell-escaped string (SSH does not actually support argv as a sequence) introduces certain limitations. There is a maximum command length. Data to be exchanged must be serialized to a shell-compatible form (non-ASCII is difficult, NUL is disallowed, special bytes require quoting or escaping, etc).
  • Without any long-lived processes there is nothing to receive filesystem change notification. This precludes a certain part of the desired feature set (automatic replication of changes).
  • The OpenSSH ssh command line program is only marginally friendly towards programmatic use. The wide variety of configurations which make it appealing as an easy way to let users authenticate against their servers (in whatever way they have previously configured and prefer, eg smartcards or kerberos, etc) also make it difficult to reliably drive automatically. "Control master" sockets can leave processes running for too long, unconfigured GSSAPI can result in long delays in establishing an authenticated connection, etc.
  • The combination of standard out and standard err provide a limited and somewhat error prone channel for communication back to the client side of the connection. Any protocol that is to operate over these file descriptors will potentially be disrupted by library code which writes out data intended for human consumption.
  • Python process startup overhead is not entirely insignificant.
  • Windows is an additional burden to support due to the lack of the OpenSSL CLI there (excepting cygwin et al - which is limited to a tiny fraction of the Windows audience)

The Solution

Scope

The user-facing parts of the interface which are short-lived command line tools will remain as-is. Creation of different or better user-interfaces is out of scope for this document. flocker-deploy will continue to interact with cluster nodes using SSH. Replacing the authentication mechanism between the user and the cluster is out of scope for this document.

The roles served by flocker-reportstate, flocker-changestate, and flocker-volume and their interactions are in scope for this document.

Outline

Flocker will supply a long-running process (hereafter referred to as flockerd) which will run on each cluster node (each host which is to be used to run Docker containers or store Flocker volumes). The process will expose an HTTP API which can be used to interrogate the state of the node, initiate a state change on the node, and facilitate volume exchange between nodes.

flocker-deploy interaction

The node-side change is orthogonal to flocker-deploy's interaction with clusters. The minimum possible impact this change could have on flocker-deploy is that the implementation of the commands flocker-deploy invokes via SSH will change to communicate with flockerd rather than taking action directly themselves. A slightly more invasive (but still invisible to end-users) approach could involve changing the commands flocker-deploy invokes via SSH - for example, to commands that invoke an HTTP client against the local flocker interface. Most drastically, flocker-deploy could be changed to interact with flockerd directly with no intervening agents. Any of these approaches should be easily compatible with the long-running flockerd process.

Implementation

Startup

flockerd needs to be started. This most likely necessitates integration with the (likely multiple) platform-specific system init daemons. For example, on RedHat family Linux distributions, flockerd should probably be set up as a systemd unit. Similarly, on the Ubuntu family of distributions, a configuration for upstart will be required. These can be handled individually and prioritized based on which Linux distributions we think our audience will find most appealing (we may also be able to solicit implementations from the community for their favorite platforms; if Flocker is adopted by distributions then the distributions may also take over maintenance of these integrations).

This area of the implementation is largely independent of the rest and decisions made elsewhere should have minimal impact here.

Logging

flockerd needs to publish detailed information about its activities. The codebase has already begun to adopt eliot as its structured logging library. Until there is convergence (or something that might at least be mistaken for convergence) in the Python community over structured logging libraries, eliot is probably our best choice (because we are already familiar with it, because we can continue to develop it to meet our needs, because we're already using it, etc).

Public API

flockerd needs a public, stable interface to allow users to control its behavior. HTTP is the lingua franca of our industry. flockerd will expose an HTTP API over TLSv1.2 (or newer - as security concerns dictate) only (no cleartext access). flocker-deploy may not initially used this but it probably will eventually. We should expect and encourage users to build tools against this interface.

The infrastructure for building JSON-based REST HTTP APIs ClusterHQ previously built for HybridCluster will be used to build this API. It will support versioning, authentication, authorization, and input and output schema-based validation of JSON content.

Internal API

User-triggered deployment updates require that Flocker nodes communicate with each other:

  • to exchange small pieces of information about what snapshots exist on what nodes.
  • to transfer large snapshot data streams to support moving volumes along with their containers.

Far future goals of Flocker will have more complex internal communication requirements (eg automatic failure requires both a liveness protocol and a consensus protocol). These will not be addressed here.

The internal API can be structured the same way as the public API: as a JSON-based REST HTTP API. Of note is that JSON encoding a ZFS data stream is impractical and the API endpoint involved in this will accept the raw bytes as the request message body.

This approach requires efficient handling of large HTTP request bodies. Twisted Web does not preclude this but neither does it provide good tools to make it easy. This may be an area where we can contribute to Twisted Web.

Open Questions

  1. Does flockerd run on the host or in a Docker container?
  2. Does flockerd log to a file itself or does it try to delegate responsibility for log files to another tool (eg the systemd journal)? If flockerd is deployed in a container, do we try to make log files available on the host or do we write them inside the container? Do we put them onto a log volume so they can be managed with Flocker?
  3. What is the authentication and authorization model for the public API? Do we stick to password-based authentication? Do we use system credentials? Do we support OAuth2 for granting fine-grained access to third-parties? Do we use Kerberos or TLS client certificates?