Skip to content

Latest commit

 

History

History
359 lines (286 loc) · 16 KB

DESIGN.rst

File metadata and controls

359 lines (286 loc) · 16 KB

This document outlines the architecture of cc-oci-runtime along with implementation details for the curious or those wishing to get involved in the project.

Henceforth, the tool will simply be referred to as "the runtime".

The runtime is written in ANSI C.

The design has been dictated largely by the Open Containers Initiative (OCI) "runtime" specification which continues to evolve. As such, the code is not as simple as it once was in earlier iterations of the specification.

Since at the time of writing the reference implementation of the OCI specification (runc) deviates from the specification itself and since that reference implementation is the default runtime used by Docker, this runtime strives to be compatible with both. This further complicates the code.

At the most basic level, the runtime performs the following steps:

  • Reads command-line arguments and options.
  • Reads the OCI configuration file.
  • Starts a virtual machine.
  • Runs the required workload command inside the virtual machine.
  • Stops the virtual machine.

The runtime is written in ANSI C and makes heavy use of the GLib library. This was chosen for its prevalence, flexibility, comprehensive documentation and test suite. Since JSON is also used heavily, the accompanying JSON-GLib library was also adopted since it shares the same set of attributes as GLib.

The first places to start to become familiar with the code are:

  • The main() function.
  • The oci.h header file.

The code is fully documented using special comments that are parseable by the excellent Doxygen tool. See README.rst for details of generating and viewing the extensive code documentation.

  • Code lives below the src/ directory.
  • Tests:
    • Unit test code lives in the directory tests/.
    • Functional tests live below the directory tests/functional/.
    • Integration tests live below the directory tests/integration/.
  • The style of the code is similar to that used by the Linux kernel.
  • The code is written to be as clean and readable as possible.
  • Use of "goto" is recommended for simplifying error handling and avoiding duplicated code.
  • All functions must be documented with a Doxygen header.
  • All function parameters must be checked and an error returned when an unexpected value is found.
  • Functions relating to a particular sub-system are separated into their own sub-system-specific file and optional header file.
  • A sub-system should expose the smallest possible interface (all other functions and data should be "static").
  • All sub-system interfaces must be accompanied with unit tests. For example, subsystem "src/${subsystem}.c" must have an accompanying "tests/${subsystem}_test.c". This is a minimum - ideally all functions should have a unit test (to test a private function, replace "static" with "private").
  • Most unit tests functions accept an cc_oci_config object. This is the main object which encapsulates the contents of the OCI configuration file along with runtime-specific data.
  • Where possible, all command-line commands and options should be accompanied by a functional test. See How command-line commands are implemented.
  • The BATS test framework is used for functional and integration tests.

The OCI JSON configuration file, config.json (but represented in the code by CC_OCI_CONFIG_FILE) is passed to the create command is parsed by cc_oci_config_file_parse() which loads the file into a tree of GNode's. This function then calls cc_oci_process_config() which iterates over the tree and calls special "handler" functions for each node. This logic is encapsulated by spec_handler objects which define the name of the node they operate on and a function to call to handle the node.

The spec handlers used to parse the configuration file for container creation are encapsulated in the start_state_handlers array, whilst those used to stop a container are encapsulated in the stop_state_handlers array.

Each spec_handler is defined in a separate file below src/spec_handlers/.

For example, the spec_handler to parse the OCI config root object is src/spec_handlers/root.c.

Not all runtime commands are provided with the OCI configuration file, so when the runtime's create command is called, it creates a persistent file containing state information that can be read by subsequent invocations of the runtime when passed different commands.

The state file is represented by CC_OCI_STATE_FILE and created by the cc_oci_state_file_create() function.

Other commands read the state file into an oci_state object using the cc_oci_state_file_read() function.

Like the OCI configuration file, the state file is loaded into a GNode tree and has an array of spec_handler objects deal with individual JSON objects. The state file spec handlers are encapsulated in the state_handlers array.

Note that the cc_oci_config object includes a similar object in the form of cc_oci_container_state. But whereas the create command has access to the complete cc_oci_config object, other commands rely on the partial information provided in the oci_state object.

However, some part of the code require a cc_oci_config object, so a function called cc_oci_config_update() can be called to create a partial (but valid) cc_oci_config object from a oci_state object.

The CC_OCI_HYPERVISOR_CMDLINE_FILE file is used to specify the arguments to use to launch the hypervisor. This file is read by the cc_oci_vm_args_get() function which also expands the special tags (variables) which can be included in the file. The expansions are handled by the cc_oci_expand_cmdline() function.

The CC_OCI_VM_CONFIG file is a valid JSON fragment that is used to supplement the data provided by the OCI configuration file`; if that file does not contain the required virtual machine configuration, the runtime will attempt to read that from CC_OCI_VM_CONFIG using the get_spec_vm_from_cfg_file() function.

See Logging.

The runtime supports the OCI runtime commands along with additional commands supported by runc.

  • Every command-line command (or "sub-command") is implemented in its own separate file below the src/commands/ directory.
  • Each command must define a subcommand object which specifies:
    • The name of the command as specified on the command-line.
    • A description that will be displayed in usage output.
    • An optional array of command-line options the command accepts.
    • A handler function called when the user specified the command on the command-line.
  • Most OCI runtime commands have a corresponding function (prefixed with "cc_oci") in src/oci.c.

For a simple example, see src/commands/version.c which is the implementation for:

$ cc-oci-runtime version

All command-line commands should have a corresponding functional test. For example, the version command has a BATS functional test at tests/functional/version.bats.

The hypervisor is launched by the cc_oci_vm_launch() function.

The logic employed by this function is unfortunately quite elaborate. This is mostly due to the OCI runtime specification version 1.0.0-rc1 which split the previous start command into two separate commands (create and start), but also due to Containerd's expectations of how a runtime should operate.

Since under Docker, the pre-start hooks are responsible for setting up the containers network and since the runtime process is expected to be running at create time, create runs the pre-start hooks _first_, then arranges for the hypervisor process to be started passing it the network configuration derived from the execution of the pre-start hooks. Further, since the hypervisor process must exist but is not allowed to execute at this phase, it is created in a stopped state by launching it under the control of ptrace(2). This control is immediately relinquished by the process is sent a SIGSTOP signal such that it is "paused".

The start command then "releases" the stopped hypervisor process by sending it the SIGCONT signal, allowing it to start executing.

Message logging is handled by calling the cc_oci_log_init() function. The code makes heavy use of the GLib logging calls such as g_critical(), g_warning() and g_debug().

The logging code actually writes to up to two files; if a command specifies the --log option, all logging calls with write data to this file. However, since Docker passes this option and sets the path to the log to a container-specific directory, it is also possible to specify the --global-log option to any command regardless of whether --log has been specified. The global log is always written in ASCII format and allows for a single log to be maintained which all containers can write to if desired.

By default, only a few messages will be written to either log under normal operation. However, if --debug is specified, the number of messages logged rises significantly so care should be taken to ensure that sufficient disk space is available for the logs and that log files are rotated and compressed for long-running and/or busy systems.

All writes to either log file are atomic. If no log command-line option is specified, no logging will occur. If logging fails, the runtime will attempt to log using syslog(3).

This sections gives a broad overview of how Docker 1.12 interacts with the runtime.

The simplest example to consider is what happens when the user runs:

$ docker run -ti busybox

The following is a simplified UML sequence diagram showing how the individual elements interact:

  +------+  +-------+  +----------+
  |docker|  |dockerd|  |containerd|
  +------+  +-------+  +----------+
      |         |           |
"run" +-------->|           |
      |         +---------->|         +---------------+
      |         |           +-------->|containerd-shim|
      |         |           |         +-------+-------+
      |         |           |                 |          +--------------+
      |         |           |                 |--------->|cc-oci-runtime| "create"
      |         |           |                 |          +------+-------+
      |         |           |                 |                 |
      |         |           |                 |                 | fork()      +---------+
      |         |           |                 |                 +------------>|qemu-lite|
      |         |           |                 |                 |             +------+--+
      |         |           |                 |                 |                    |
      |         |           |                 |                 | write state        |     +-----+
      |         |           |                 |                 +--------------------|---->|state|
      |         |           |                 |                 |                    |     +-----+
      |         |           |                 |                 | exit()             |        ^
      |         |           |                 |<----------------+                    |        |
      |         |           |                 |           +--------------+           |        |
      |         |           +-----------------+---------->|cc-oci-runtime| "start"   |        |
      |         |           |                 |           +-----+--------+           |        |
      |         |           |                 |                 |                    |        |
      |         |           |                 |                 | read state         |        |
      |         |           |                 |                 +--------------------|--------+
      |         |           |                 |                 |                    |        |
      |         |           |                 |                 | enable hypervisor  |        |
      |         |           |                 |                 +------------------->|        |
      |         |           |                 |                 |                    |        |
      |         |           |                 |                 | exit()             |        |
      |         |           |<----------------|-----------------+                    |        |
      |         |           |                 |                                      |        |
      |         |           |                 |                                      | exit() |
      |         |           |<----------------+--------------------------------------+        |
      |         |           |                                                                 |
      |         |           |                             +--------------+                    |
      |         |           |-----------------+---------->|cc-oci-runtime| "delete"           |
      |         |           |                             +-----+--------+                    |
      |         |           |                                   |                             |
      |         |           |                                   | delete state                |
      |         |           |                                   +-----------------------------+
      |         |           |                                   |
      |         |           |                                   | exit()
      |         |           |<----------------+-----------------+
      |         |           |
      |         |           | notify exit()
      |<--------+-----------+
      |         |           |
      |exit()   |           |
     ---        |           |
                :           :
                .           .

Notes:

  • As the diagram shows, the runtime is called multiple times, each time being passed a different argument (create, start, delete).This reflects the way the OCI specification mandates the runtime be invoked.
  • containerd-shim is able to detect when the qemu-lite process exits since it registers itself as a "sub-reaper" (or "sub-init") process.