Merge pull request #446 from cmoussa1/issue#445

repo: create a `doc` folder, add flux-accounting guide
flux-framework · May 10, 2024 · 4a49f54 · 4a49f54
2 parents f44be60 + 98e92c2
commit 4a49f54
Showing 1 changed file with 378 additions and 0 deletions.
diff --git a/doc/guide/accounting-guide.rst b/doc/guide/accounting-guide.rst
@@ -0,0 +1,378 @@
+.. _flux-accounting-guide:
+
+#####################
+Flux Accounting Guide
+#####################
+
+*key terms: association, bank*
+
+.. note::
+    flux-accounting is still beta software and many of the interfaces
+    documented in this guide may change with regularity.
+
+    This document is in DRAFT form.
+
+********
+Overview
+********
+
+By default, a Flux system instance treats users equally and schedules work
+based on demand, without consideration of a user's history of resource
+consumption, or what share of available resources their organization considers
+they should be entitled to use relative to other competing users.
+
+Flux-accounting adds a database which stores site policy, *banks* with
+with user/project associations, and metrics representing historical usage.
+It also adds a Flux jobtap plugin that sets the priority on each job that
+enters the system based on multiple factors including *fair share* values.
+The priority determines the order in which jobs are considered by the scheduler
+for resource allocation.  In addition, the jobtap plugin holds or rejects job
+requests that exceed user/project specific limits or have exhausted their
+bank allocations.
+
+The database is populated and queried with command line tools prefixed with
+``flux account``.  Accounting scripts are run regularly by
+:core:man1:`flux-cron` to pull historical job information from the Flux
+``job-list`` and ``job-info`` interfaces into the accounting database,
+and to push bank and limit data to the jobtap plugin.
+
+At this time, the database is expected to be installed on a cluster management
+node, co-located with the rank 0 Flux broker, managing accounts for that
+cluster only.  Sites would typically populate the database and keep it up to
+date automatically using information regularly pulled or pushed from an
+external source like an identity management system.
+
+******************************
+Installation and Configuration
+******************************
+
+System Prerequisites
+====================
+
+The `Flux Administrator's Guide <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/guide/admin.html>`_ documents relevant information for
+the administration and management of a Flux system instance.
+
+The following instructions assume that Flux is configured and working, that
+the Flux *statedir* (``/var/lib/flux``) is writable by the ``flux`` user,
+and that the ``flux`` user is the system instance owner.
+
+Installing Software Packages
+============================
+
+The ``flux-accounting`` package should be installed on the management node
+from your Linux distribution package manager. Once installed, the service
+that accepts ``flux account`` commands and interacts with the flux-accounting
+database can be started.
+
+You can enable the service with ``systemctl``; if not configured with a custom
+path, the flux-accounting systemd unit file will be installed to the same
+location as flux-core's systemd unit file:
+
+.. code-block:: console
+
+  $ sudo systemctl enable flux-accounting
+
+The service can then be controlled with ``systemd``. To utilize the service,
+the following prerequisites must be met:
+
+1. A flux-accounting database has been created with ``flux account create-db``.
+The service establishes a connection with the database in order to read from
+and write to it.
+
+2. An active Flux system instance is running. The flux-accounting service will
+only run after the system instance is started.
+
+Accounting Database Creation
+============================
+
+The accounting database is created with the command below.  Default
+parameters are assumed, including the accounting database path of
+``/var/lib/flux/FluxAccounting.db``.
+
+.. code-block:: console
+
+ $ sudo -u flux flux account create-db
+
+.. note::
+    The flux accounting commands should always be run as the flux user. If they
+    are run as root, some commands that rewrite the database could change the
+    owner to root, causing flux-accounting scripts run from flux cron to fail.
+
+Banks must be added to the system, for example:
+
+.. code-block:: console
+
+ $ sudo -u flux flux account add-bank root 1
+ $ sudo -u flux flux account add-bank --parent-bank=root sub_bank_A 1
+
+Users that are permitted to run on the system must be assigned banks,
+for example:
+
+.. code-block:: console
+
+ $ sudo -u flux flux account add-user --username=user1234 --bank=sub_bank_A
+
+Enabling Multi-factor Priority
+==============================
+
+When flux-accounting is installed, the job manager uses a multi-factor
+priority plugin to calculate job priorities.  The Flux system instance must
+configure the ``job-manager`` to load this plugin.
+
+.. code-block:: toml
+
+ [job-manager]
+ plugins = [
+   { load = "mf_priority.so" },
+ ]
+
+See also: :core:man5:`flux-config-job-manager`.
+
+Automatic Accounting Database Updates
+=====================================
+
+If updating flux-accounting to a newer version on a system where a
+flux-accounting DB is already configured and set up, it is important to update
+the database schema, as tables and columns may have been added or removed in
+the newer version. The flux-accounting database schema can be updated with the
+following command:
+
+.. code-block:: console
+
+ $ sudo -u flux flux account-update-db
+
+A series of actions should run periodically to keep the accounting
+system in sync with Flux:
+
+- A script fetches inactive jobs and inserts them into a ``jobs`` table in the
+  flux-accounting DB.
+- The job-archive module scans inactive jobs and dumps them to a sqlite
+  database.
+- A script reads the archive database and updates the job usage data in the
+  accounting database.
+- A script updates the per-user fair share factors in the accounting database.
+- A script pushes updated factors to the multi-factor priority plugin.
+
+The Flux system instance must configure the ``job-archive`` module to run
+periodically:
+
+.. code-block:: toml
+
+ [archive]
+ period = "1m"
+
+See also: :core:man5:`flux-config-archive`.
+
+The scripts should be run by :core:man1:`flux-cron`:
+
+.. code-block:: console
+
+ # /etc/flux/system/cron.d/accounting
+
+ 30 * * * * bash -c "flux account-fetch-job-records; flux account update-usage; flux account-update-fshare; flux account-priority-update"
+
+***********************
+Database Administration
+***********************
+
+The flux-accounting database is a SQLite database which stores user account
+information and bank information. Administrators can add, disable, edit, and
+view user and bank information by interfacing with the database through
+front-end commands provided by flux-accounting. The information in this
+database works with flux-core to calculate job priorities submitted by users,
+enforce basic job accounting limits, and calculate fair-share values for
+users based on previous job usage.
+
+Each user belongs to at least one bank. This user/bank combination is known
+as an *association*, and henceforth will be referred to as an *association*
+throughout the rest of this document.
+
+.. note::
+    In order to interact with the flux-accounting database, you must have read
+    and write permissions to the directory that the database resides in. The
+    SQLite documentation_ states that since "SQLite reads and writes an ordinary
+    disk file, the only access permissions that can be applied are the normal
+    file access permissions of the underlying operating system."
+
+The front-end commands provided by flux-accounting allow an administrator to
+interact with association or bank information.  ``flux account -h`` will list
+all possible commands that interface with the information stored in their
+respective tables in the flux-accounting database. The current database
+consists of the following tables:
+
++--------------------------+--------------------------------------------------+
+| table name               | description                                      |
++==========================+==================================================+
+| association_table        | stores associations                              |
++--------------------------+--------------------------------------------------+
+| bank_table               | stores banks                                     |
++--------------------------+--------------------------------------------------+
+| job_usage_factor_table   | stores past job usage factors for associations   |
++--------------------------+--------------------------------------------------+
+| t_half_life_period_table | keeps track of the current half-life period for  |
+|                          | calculating job usage factors                    |
++--------------------------+--------------------------------------------------+
+| queue_table              | stores queues, their limits properties, as well  |
+|                          | as their associated priorities                   |
++--------------------------+--------------------------------------------------+
+| project_table            | stores projects for associations to charge their |
+|                          | jobs against                                     |
++--------------------------+--------------------------------------------------+
+| jobs                     | stores inactive jobs for job usage and fair      |
+|                          | share calculation                                |
++--------------------------+--------------------------------------------------+
+
+To view all associations in a flux-accounting database, the ``flux
+account-shares`` command will print this DB information in a hierarchical
+format. An example is shown below:
+
+.. code-block:: console
+
+ $ flux account-shares
+
+ Account                         Username           RawShares            RawUsage           Fairshare
+ root                                                       1                   0
+  bank_A                                                    1                   0
+   bank_A                          user_1                   1                   0                 0.5
+  bank_B                                                    1                   0
+   bank_B                          user_2                   1                   0                 0.5
+   bank_B                          user_3                   1                   0                 0.5
+  bank_C                                                    1                   0
+   bank_C_a                                                 1                   0
+    bank_C_a                       user_4                   1                   0                 0.5
+   bank_C_b                                                 1                   0
+    bank_C_b                       user_5                   1                   0                 0.5
+    bank_C_b                       user_6                   1                   0                 0.5
+
+
+****************************
+Job Usage Factor Calculation
+****************************
+
+An association's job usage represents their usage on a cluster in relation to
+the size of their jobs and how long they ran. The raw job usage value is
+defined as the sum of products of the number of nodes used (``nnodes``) and
+time elapsed (``t_elapsed``):
+
+.. code-block:: console
+
+  RawUsage = sum(nnodes * t_elapsed)
+
+This job usage factor per association has a half-life decay applied to it as
+time passes. By default, this half-life decay is applied to jobs every week
+for four weeks; jobs older than four weeks no longer play a role in determining
+an association's job usage factor. The configuration parameters that determine
+how to represent a half-life for jobs and how long to consider jobs as part of
+an association's overall job usage are represented by **PriorityDecayHalfLife**
+and  **PriorityUsageResetPeriod**, respectively. These parameters are
+configured when the flux-accounting database is first created.
+
+Example Job Usage Calculation
+=============================
+
+Below is an example of how flux-accounting calculates an association's current
+job usage. Let's say a user has the following job records from the most
+recent half-life period (by default, jobs that have completed in the
+last week):
+
+.. code-block:: console
+
+     UserID Username  JobID         T_Submit            T_Run       T_Inactive  Nodes                                                                               R
+  0    1002     1002    102 1605633403.22141 1605635403.22141 1605637403.22141      2  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
+  1    1002     1002    103 1605633403.22206 1605635403.22206 1605637403.22206      2  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
+  2    1002     1002    104 1605633403.22285 1605635403.22286 1605637403.22286      2  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
+  3    1002     1002    105 1605633403.22347 1605635403.22348 1605637403.22348      1  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
+  4    1002     1002    106 1605633403.22416 1605635403.22416 1605637403.22416      1  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
+
+From these job records, we can gather the following information:
+
+* total nodes used (``nnodes``): 8
+* total time elapsed (``t_elapsed``): 10000.0
+
+So, the usage of the association from this current half life is:
+
+.. code-block:: console
+
+  sum(nnodes * t_elapsed) = (2 * 2000) + (2 * 2000) + (2 * 2000) + (1 * 2000) + (1 * 2000)
+                          = 4000 + 4000 + 4000 + 2000 + 2000
+                          = 16000
+
+This current job usage is then added to the association's previous job usage
+stored in the flux-accounting database. This sum then represents the
+association's overall job usage.
+
+****************************
+Multi-Factor Priority Plugin
+****************************
+
+The multi-factor priority plugin is a jobtap_ plugin that generates
+an integer job priority for incoming jobs in a Flux system instance. It uses
+a number of factors to calculate a priority and, in the future, can add more
+factors. Each factor has an associated integer weight that determines its
+importance in the overall priority calculation. The current factors present in
+the multi-factor priority plugin are:
+
+* **fair-share**: the ratio between the amount of resources allocated vs. resources
+  consumed. See the :ref:`Glossary definition <glossary-section>` for a more
+  detailed explanation of how fair-share is utilized within flux-accounting.
+
+* **urgency**: a user-controlled factor to prioritize their own jobs.
+
+In addition to generating an integer priority for submitted jobs in a Flux
+system instance, the multi-factor priority plugin also enforces per-association
+job limits to regulate use of the system. The two per-association limits
+enforced by this plugin are:
+
+* **max_active_jobs**: a limit on how many *active* jobs an association can have at
+  any given time. Jobs submitted after this limit has been hit will be rejected
+  with a message saying that the association has hit their active jobs limit.
+
+* **max_running_jobs**: a limit on how many *running* jobs an association can have
+  at any given time. Jobs submitted after this limit has been hit will be held
+  by adding a ``max-running-jobs-user-limit`` dependency until one of the
+  association's currently running jobs finishes running.
+
+Both "types" of jobs, *running* and *active*, are based on Flux's definitions
+of job states_. *Active* jobs can be in any state but INACTIVE. *Running* jobs
+are jobs in either RUN or CLEANUP states.
+
+.. _glossary-section:
+
+********
+Glossary
+********
+
+association
+  A 2-tuple combination of a username and bank name.
+
+bank
+  An account that contains associations.
+
+fair-share
+  A metric used to ensure equitable resource allocation among associations
+  within a shared system. It represents the ratio between the amount of
+  resources an association is allocated versus the amount actually consumed.
+  The fair-share value influences an association's priority when submitting
+  jobs to the system, adjusting dynamically to reflect current usage compared
+  to allocated quotas. High consumption relative to allocation can decrease an
+  association's fair-share value, reducing their priority for future resource
+  allocation, thereby promoting balanced usage across all associations to
+  maintain system fairness and efficiency.
+
+.. note::
+
+ The design of flux-accounting was driven by LLNL site requirements. Years ago,
+ the design of `Slurm accounting`_ and its `multi-factor priority
+ plugin`_ were driven by similar LLNL site requirements. We chose to
+ reuse terminology and concepts from Slurm to facilitate a smooth transition to
+ Flux. The flux-accounting code base is all completely new, however.
+
+.. _documentation: https://sqlite.org/omitted.html
+
+.. _Slurm accounting: https://slurm.schedmd.com/accounting.html
+
+.. _multi-factor priority plugin: https://slurm.schedmd.com/priority_multifactor.html
+
+.. _jobtap: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html#flux-jobtap-plugins-7
+
+.. _states: https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_21.html