Skip to content

Commit

Permalink
Merge pull request #446 from cmoussa1/issue#445
Browse files Browse the repository at this point in the history
repo: create a `doc` folder, add flux-accounting guide
  • Loading branch information
mergify[bot] authored May 10, 2024
2 parents f44be60 + 98e92c2 commit 4a49f54
Showing 1 changed file with 378 additions and 0 deletions.
378 changes: 378 additions & 0 deletions doc/guide/accounting-guide.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,378 @@
.. _flux-accounting-guide:

#####################
Flux Accounting Guide
#####################

*key terms: association, bank*

.. note::
flux-accounting is still beta software and many of the interfaces
documented in this guide may change with regularity.

This document is in DRAFT form.

********
Overview
********

By default, a Flux system instance treats users equally and schedules work
based on demand, without consideration of a user's history of resource
consumption, or what share of available resources their organization considers
they should be entitled to use relative to other competing users.

Flux-accounting adds a database which stores site policy, *banks* with
with user/project associations, and metrics representing historical usage.
It also adds a Flux jobtap plugin that sets the priority on each job that
enters the system based on multiple factors including *fair share* values.
The priority determines the order in which jobs are considered by the scheduler
for resource allocation. In addition, the jobtap plugin holds or rejects job
requests that exceed user/project specific limits or have exhausted their
bank allocations.

The database is populated and queried with command line tools prefixed with
``flux account``. Accounting scripts are run regularly by
:core:man1:`flux-cron` to pull historical job information from the Flux
``job-list`` and ``job-info`` interfaces into the accounting database,
and to push bank and limit data to the jobtap plugin.

At this time, the database is expected to be installed on a cluster management
node, co-located with the rank 0 Flux broker, managing accounts for that
cluster only. Sites would typically populate the database and keep it up to
date automatically using information regularly pulled or pushed from an
external source like an identity management system.

******************************
Installation and Configuration
******************************

System Prerequisites
====================

The `Flux Administrator's Guide <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/guide/admin.html>`_ documents relevant information for
the administration and management of a Flux system instance.

The following instructions assume that Flux is configured and working, that
the Flux *statedir* (``/var/lib/flux``) is writable by the ``flux`` user,
and that the ``flux`` user is the system instance owner.

Installing Software Packages
============================

The ``flux-accounting`` package should be installed on the management node
from your Linux distribution package manager. Once installed, the service
that accepts ``flux account`` commands and interacts with the flux-accounting
database can be started.

You can enable the service with ``systemctl``; if not configured with a custom
path, the flux-accounting systemd unit file will be installed to the same
location as flux-core's systemd unit file:

.. code-block:: console
$ sudo systemctl enable flux-accounting
The service can then be controlled with ``systemd``. To utilize the service,
the following prerequisites must be met:

1. A flux-accounting database has been created with ``flux account create-db``.
The service establishes a connection with the database in order to read from
and write to it.

2. An active Flux system instance is running. The flux-accounting service will
only run after the system instance is started.

Accounting Database Creation
============================

The accounting database is created with the command below. Default
parameters are assumed, including the accounting database path of
``/var/lib/flux/FluxAccounting.db``.

.. code-block:: console
$ sudo -u flux flux account create-db
.. note::
The flux accounting commands should always be run as the flux user. If they
are run as root, some commands that rewrite the database could change the
owner to root, causing flux-accounting scripts run from flux cron to fail.

Banks must be added to the system, for example:

.. code-block:: console
$ sudo -u flux flux account add-bank root 1
$ sudo -u flux flux account add-bank --parent-bank=root sub_bank_A 1
Users that are permitted to run on the system must be assigned banks,
for example:

.. code-block:: console
$ sudo -u flux flux account add-user --username=user1234 --bank=sub_bank_A
Enabling Multi-factor Priority
==============================

When flux-accounting is installed, the job manager uses a multi-factor
priority plugin to calculate job priorities. The Flux system instance must
configure the ``job-manager`` to load this plugin.

.. code-block:: toml
[job-manager]
plugins = [
{ load = "mf_priority.so" },
]
See also: :core:man5:`flux-config-job-manager`.

Automatic Accounting Database Updates
=====================================

If updating flux-accounting to a newer version on a system where a
flux-accounting DB is already configured and set up, it is important to update
the database schema, as tables and columns may have been added or removed in
the newer version. The flux-accounting database schema can be updated with the
following command:

.. code-block:: console
$ sudo -u flux flux account-update-db
A series of actions should run periodically to keep the accounting
system in sync with Flux:

- A script fetches inactive jobs and inserts them into a ``jobs`` table in the
flux-accounting DB.
- The job-archive module scans inactive jobs and dumps them to a sqlite
database.
- A script reads the archive database and updates the job usage data in the
accounting database.
- A script updates the per-user fair share factors in the accounting database.
- A script pushes updated factors to the multi-factor priority plugin.

The Flux system instance must configure the ``job-archive`` module to run
periodically:

.. code-block:: toml
[archive]
period = "1m"
See also: :core:man5:`flux-config-archive`.

The scripts should be run by :core:man1:`flux-cron`:

.. code-block:: console
# /etc/flux/system/cron.d/accounting
30 * * * * bash -c "flux account-fetch-job-records; flux account update-usage; flux account-update-fshare; flux account-priority-update"
***********************
Database Administration
***********************

The flux-accounting database is a SQLite database which stores user account
information and bank information. Administrators can add, disable, edit, and
view user and bank information by interfacing with the database through
front-end commands provided by flux-accounting. The information in this
database works with flux-core to calculate job priorities submitted by users,
enforce basic job accounting limits, and calculate fair-share values for
users based on previous job usage.

Each user belongs to at least one bank. This user/bank combination is known
as an *association*, and henceforth will be referred to as an *association*
throughout the rest of this document.

.. note::
In order to interact with the flux-accounting database, you must have read
and write permissions to the directory that the database resides in. The
SQLite documentation_ states that since "SQLite reads and writes an ordinary
disk file, the only access permissions that can be applied are the normal
file access permissions of the underlying operating system."

The front-end commands provided by flux-accounting allow an administrator to
interact with association or bank information. ``flux account -h`` will list
all possible commands that interface with the information stored in their
respective tables in the flux-accounting database. The current database
consists of the following tables:

+--------------------------+--------------------------------------------------+
| table name | description |
+==========================+==================================================+
| association_table | stores associations |
+--------------------------+--------------------------------------------------+
| bank_table | stores banks |
+--------------------------+--------------------------------------------------+
| job_usage_factor_table | stores past job usage factors for associations |
+--------------------------+--------------------------------------------------+
| t_half_life_period_table | keeps track of the current half-life period for |
| | calculating job usage factors |
+--------------------------+--------------------------------------------------+
| queue_table | stores queues, their limits properties, as well |
| | as their associated priorities |
+--------------------------+--------------------------------------------------+
| project_table | stores projects for associations to charge their |
| | jobs against |
+--------------------------+--------------------------------------------------+
| jobs | stores inactive jobs for job usage and fair |
| | share calculation |
+--------------------------+--------------------------------------------------+

To view all associations in a flux-accounting database, the ``flux
account-shares`` command will print this DB information in a hierarchical
format. An example is shown below:

.. code-block:: console
$ flux account-shares
Account Username RawShares RawUsage Fairshare
root 1 0
bank_A 1 0
bank_A user_1 1 0 0.5
bank_B 1 0
bank_B user_2 1 0 0.5
bank_B user_3 1 0 0.5
bank_C 1 0
bank_C_a 1 0
bank_C_a user_4 1 0 0.5
bank_C_b 1 0
bank_C_b user_5 1 0 0.5
bank_C_b user_6 1 0 0.5
****************************
Job Usage Factor Calculation
****************************

An association's job usage represents their usage on a cluster in relation to
the size of their jobs and how long they ran. The raw job usage value is
defined as the sum of products of the number of nodes used (``nnodes``) and
time elapsed (``t_elapsed``):

.. code-block:: console
RawUsage = sum(nnodes * t_elapsed)
This job usage factor per association has a half-life decay applied to it as
time passes. By default, this half-life decay is applied to jobs every week
for four weeks; jobs older than four weeks no longer play a role in determining
an association's job usage factor. The configuration parameters that determine
how to represent a half-life for jobs and how long to consider jobs as part of
an association's overall job usage are represented by **PriorityDecayHalfLife**
and **PriorityUsageResetPeriod**, respectively. These parameters are
configured when the flux-accounting database is first created.

Example Job Usage Calculation
=============================

Below is an example of how flux-accounting calculates an association's current
job usage. Let's say a user has the following job records from the most
recent half-life period (by default, jobs that have completed in the
last week):

.. code-block:: console
UserID Username JobID T_Submit T_Run T_Inactive Nodes R
0 1002 1002 102 1605633403.22141 1605635403.22141 1605637403.22141 2 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
1 1002 1002 103 1605633403.22206 1605635403.22206 1605637403.22206 2 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
2 1002 1002 104 1605633403.22285 1605635403.22286 1605637403.22286 2 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
3 1002 1002 105 1605633403.22347 1605635403.22348 1605637403.22348 1 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
4 1002 1002 106 1605633403.22416 1605635403.22416 1605637403.22416 1 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
From these job records, we can gather the following information:

* total nodes used (``nnodes``): 8
* total time elapsed (``t_elapsed``): 10000.0

So, the usage of the association from this current half life is:

.. code-block:: console
sum(nnodes * t_elapsed) = (2 * 2000) + (2 * 2000) + (2 * 2000) + (1 * 2000) + (1 * 2000)
= 4000 + 4000 + 4000 + 2000 + 2000
= 16000
This current job usage is then added to the association's previous job usage
stored in the flux-accounting database. This sum then represents the
association's overall job usage.

****************************
Multi-Factor Priority Plugin
****************************

The multi-factor priority plugin is a jobtap_ plugin that generates
an integer job priority for incoming jobs in a Flux system instance. It uses
a number of factors to calculate a priority and, in the future, can add more
factors. Each factor has an associated integer weight that determines its
importance in the overall priority calculation. The current factors present in
the multi-factor priority plugin are:

* **fair-share**: the ratio between the amount of resources allocated vs. resources
consumed. See the :ref:`Glossary definition <glossary-section>` for a more
detailed explanation of how fair-share is utilized within flux-accounting.

* **urgency**: a user-controlled factor to prioritize their own jobs.

In addition to generating an integer priority for submitted jobs in a Flux
system instance, the multi-factor priority plugin also enforces per-association
job limits to regulate use of the system. The two per-association limits
enforced by this plugin are:

* **max_active_jobs**: a limit on how many *active* jobs an association can have at
any given time. Jobs submitted after this limit has been hit will be rejected
with a message saying that the association has hit their active jobs limit.

* **max_running_jobs**: a limit on how many *running* jobs an association can have
at any given time. Jobs submitted after this limit has been hit will be held
by adding a ``max-running-jobs-user-limit`` dependency until one of the
association's currently running jobs finishes running.

Both "types" of jobs, *running* and *active*, are based on Flux's definitions
of job states_. *Active* jobs can be in any state but INACTIVE. *Running* jobs
are jobs in either RUN or CLEANUP states.

.. _glossary-section:

********
Glossary
********

association
A 2-tuple combination of a username and bank name.

bank
An account that contains associations.

fair-share
A metric used to ensure equitable resource allocation among associations
within a shared system. It represents the ratio between the amount of
resources an association is allocated versus the amount actually consumed.
The fair-share value influences an association's priority when submitting
jobs to the system, adjusting dynamically to reflect current usage compared
to allocated quotas. High consumption relative to allocation can decrease an
association's fair-share value, reducing their priority for future resource
allocation, thereby promoting balanced usage across all associations to
maintain system fairness and efficiency.

.. note::

The design of flux-accounting was driven by LLNL site requirements. Years ago,
the design of `Slurm accounting`_ and its `multi-factor priority
plugin`_ were driven by similar LLNL site requirements. We chose to
reuse terminology and concepts from Slurm to facilitate a smooth transition to
Flux. The flux-accounting code base is all completely new, however.

.. _documentation: https://sqlite.org/omitted.html

.. _Slurm accounting: https://slurm.schedmd.com/accounting.html

.. _multi-factor priority plugin: https://slurm.schedmd.com/priority_multifactor.html

.. _jobtap: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html#flux-jobtap-plugins-7

.. _states: https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_21.html

0 comments on commit 4a49f54

Please sign in to comment.