flux-accounting: add a local job-archive #357

cmoussa1 · 2023-05-31T16:17:07Z

Background

As discussed in flux-framework/flux-core#3136 and in #353, it may be advantageous for flux accounting to do its own job record archival. With archival moved to flux accounting, the job-archive module in flux-core could be retired and the flux accounting project can evolve the schema as needed to fit its requirements.

This PR looks to begin to add a job archive local to flux-accounting. It creates two new tables in the flux-accounting DB:

jobs: a table that stores job records used for calculating job usage (and subsequently fair share)
last_seen_job_table: a table that stores a timestamp of the last inactive job seen

It also adds a new Python script that is designed to be run as a cron job (and in conjunction with the other Python scripts and commands in this repository that run periodically). This new Python script, flux-account-fetch-job-records.py, will fetch inactive jobs using Flux's job-info and job-list interfaces. It will create job-record objects and insert them into the newly created jobs table. The benefit here is when it comes time for flux-accounting to re-calculate job-usage and fair share values for user/banks during a periodic update, it can just look locally in its own DB instead of having to connect to flux-core's job-archive.

Some minor cleanup of the update-usage and view-job-records commands (as well as some of the unit tests for the portion of flux-accounting that previously dealt with the flux-core job-archive) follow the first couple commits that add the new tables and Python script. No major changes were required to the tests that previously dealt with the flux-core job-archive (specifically t1011-job-archive-interface.t). The big change there was to add a call to the new Python script and point the view-job-records and update-usage commands to the flux-accounting DB.

Note that this new Python script does not yet have a "purge" function just yet to remove old jobs that are no longer considered by job usage or fair share calculations. For the sake of keeping this PR as narrow in scope as possible, I figured I would open a PR that follows this one which looks to add purge capabilities.

You'll also notice that right now there is a commit in this PR that removes one of the sharness tests (t1010-update-usage.t) from t/Makefile.am, essentially ignoring it when running make check. The reason for this is t1010-job-usage.t creates a fake job-archive DB to test the update-usage command for flux-accounting. This test was created before additional tests were added to flux-accounting that actually loaded flux-core's job-archive. But now that flux-accounting has implemented its own job-archive, I don't think this sharness test file is necessarily needed, especially because the next sharness test file, t1011-job-archive-interface.db, actually tests fetching job records and updating usage using (now) flux-accounting's job-archive. So I'm not totally sure what the best approach here is. Should I remove the test file completely from the repository and re-number the following sharness tests? Is leaving it out of t/Makefile.am fine? Let me know.

cmoussa1 · 2023-06-01T15:25:42Z

OK, I think I'll mark this is as ready for some initial review whenever convenient. 😄

garlick · 2023-06-01T15:59:44Z

Awesome @cmoussa1 !

Should I remove the test file completely from the repository and re-number the following sharness tests? Is leaving it out of t/Makefile.am fine? Let me know.

IMHO if the test is no longer needed, just remove it. No need to renumber the tests. Just leave a hole :-)

I'll plan to make a review pass later today and test this on my small cluster FWIW. I think it would be good for @chu11 and @ryanday36 to also have a look, keeping in mind this would be intended to replace job-archive.

Maybe it would make sense to create a PR for the flux accounting guide that remains pending until this is included in a new tag? The stuff in there about configuring job-archive could be dropped and a suggested crontab line to run the new script periodically could be added to the example there (which I would cut and paste for my testing).

cmoussa1 · 2023-06-01T16:39:37Z

Thanks @garlick!

IMHO if the test is no longer needed, just remove it. No need to renumber the tests. Just leave a hole :-)

That sounds good. I'll edit the commit to just remove it. Thanks!

I'll plan to make a review pass later today and test this on my small cluster FWIW. I think it would be good for @chu11 and @ryanday36 to also have a look, keeping in mind this would be intended to replace job-archive.

Yes, this sounds great. Note that I leave for vacation in a couple hours so I'll probably be delayed in a response to your review comments. 😅

Maybe it would make sense to create a PR for the flux accounting guide that remains pending until this is included in a new tag? The stuff in there about configuring job-archive could be dropped and a suggested crontab line to run the new script periodically could be added to the example there (which I would cut and paste for my testing).

I agree! I'll create an issue for it so I don't forget and try to open one soon. 👍

cmoussa1 · 2023-06-26T16:22:01Z

Note: I've updated this PR to use the new flux-core job-info convenience function that was added in flux-framework/flux-core#5265

Problem: The schema for the flux-accounting DB has been changed as a result of flux-framework#357. Update the schema version for the flux-accounting DB in __init__.py.in and in the front-end flux-account-service.py script.

cmoussa1 · 2023-07-18T20:13:00Z

just rebased to catch up after #367

garlick · 2023-08-09T23:27:01Z

Poking at this a bit (finally!)

To get it going I rebased on current master in a private repo and build a deb, then installed that on my test cluster rank 0. Then I ran

$ sudo -u flux flux account-update-db
checking for new tables...
new table found: jobs
new table found: last_seen_job_table
checking for new columns to add in tables...

So far so good. For fun I poked at the db and dumped the definition of the jobs table:

$ sudo -u flux sqlite3 FluxAccounting.db
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> .schema jobs
CREATE TABLE IF NOT EXISTS "jobs" (id integer  NOT NULL, userid integer  NOT NULL, t_submit real  NOT NULL, t_run real  NOT NULL, t_inactive real  NOT NULL, ranks tinytext  NOT NULL, R tinytext  NOT NULL, jobspec tinytext  NOT NULL, PRIMARY KEY (id));

And noted a few differences from the job-archive jobs table (no problem - just noting!):

$ sudo -u flux sqlite3 job-archive.sqlite
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> .schema jobs
CREATE TABLE jobs(  id CHAR(16) PRIMARY KEY,  userid INT,  ranks TEXT,  t_submit REAL,  t_run REAL,  t_cleanup REAL,  t_inactive REAL,  eventlog TEXT,  jobspec TEXT,  R TEXT);

Then I tried manually populating the new table with the fetch-job-records script but that didn't go so well:

$ sudo -u flux flux account-fetch-job-records
Traceback (most recent call last):
  File "/usr/lib/aarch64-linux-gnu/flux/cmd/flux-account-fetch-job-records.py", line 171, in <module>
    main()
  File "/usr/lib/aarch64-linux-gnu/flux/cmd/flux-account-fetch-job-records.py", line 156, in main
    timestamp = get_last_job_ts(conn)
  File "/usr/lib/aarch64-linux-gnu/flux/cmd/flux-account-fetch-job-records.py", line 65, in get_last_job_ts
    return float(row[0])
TypeError: 'NoneType' object is not subscriptable

Probably unrelated but I also noticed that for me I got a couple of errors when running make check - were these supposed to have been removed?

ERROR: test/test_job_archive_interface.py - missing test plan
ERROR: test/test_job_archive_interface.py - exited with status 1

I thought maybe I'd pause here and ask you to rebase this PR on current master and make sure everything still works for you.

garlick · 2023-08-10T13:26:34Z

Two notes on the schema:

I don't think TINYTEXT is a sqlite type - it is accepted but converted to TEXT
A sqlite INTEGER is a 64 bit signed value, but job IDs (in integer form) use the full 64 bit unsigned range. Maybe that is why @chu11 used CHAR(16) as the type for job IDs? Although maybe INTEGER would work and be more efficient as long as the sign conversion is carefully handled (a test would be good).

Also CHAR(16) is just another alias for TEXT in sqlite (and subscripts are ignored)

https://www.sqlite.org/draft/datatype3.html

Edit: I guess one other question is any reason not to use the same technique as @chu11 for getting the last job ID based on the inactive time stamp? E.g.

SELECT MAX(t_inactive) FROM jobs

It might be a bit simpler than managing the last_seen_job_table (I'm assuming forgetting to add it to the schema update script is what caused the error above)

cmoussa1 · 2023-08-15T19:34:25Z

Thanks for beginning to take a look at this @garlick! I spent some time this morning looking at your feedback; sorry that it didn't work out so gracefully on your test cluster this first time around!

And noted a few differences from the job-archive jobs table (no problem - just noting!):

Thanks for providing this comparison. Yeah, I noticed that the types for some of the data that flux-accounting is interested in was different in flux-accounting's job-archive and flux-core's, so I went ahead and changed them to match the data types of flux-core's job-archive. They should match now.

Probably unrelated but I also noticed that for me I got a couple of errors when running make check - were these supposed to have been removed?
ERROR: test/test_job_archive_interface.py - missing test plan
ERROR: test/test_job_archive_interface.py - exited with status 1

That is odd - that set of tests is supposed to be there and run. I have never run make deb before (I am unsure how), however, so I wonder if there is something weird going on with this unit test during that build process? FWIW, this unit test runs in CI.

Edit: I guess one other question is any reason not to use the same technique as @chu11 for getting the last job ID based on ? > the inactive time stamp? E.g.
SELECT MAX(t_inactive) FROM jobs
It might be a bit simpler than managing the last_seen_job_table (I'm assuming forgetting to add it to the schema update > script is what caused the error above)

Ah, that would be because I forgot you could query for MIN and MAX values, :-) Thanks for pointing this out. I went ahead and force-pushed a change to this script to run SELECT MAX(t_inactive) FROM jobs when looking for the most recent inactive job in the jobs table. If the jobs table is empty, the variable that holds the timestamp of the most recent inactive job should just be 0.0 (indicating it should look for all inactive jobs).

I've also rebased on current master, so it should be caught up after #369 now.

garlick · 2023-08-15T20:11:52Z

Thanks @Cmoussa - i've updated to your branch, and I'm still getting those two test failures. The logs say

Traceback (most recent call last):
  File "/nfshome/garlick/proj/flux-accounting/src/bindings/python/fluxacct/accounting/./test/test_job_archive_interface.py", line 20, in <module>
    from fluxacct.accounting import job_archive_interface as jobs
  File "/nfshome/garlick/proj/flux-accounting/src/bindings/python/fluxacct/accounting/job_archive_interface.py", line 18, in <module>
    from flux.resource import ResourceSet
ModuleNotFoundError: No module named 'flux'
ERROR: test/test_job_archive_interface.py - missing test plan
ERROR: test/test_job_archive_interface.py - exited with status 1

I'll see if I can find out why that's failing on my system.

Note: the make deb just builds a deb package for the project (like an RPM). It shouldn't affect the tests in any way.

garlick · 2023-08-15T20:17:39Z

Those tests are failing for me on master also so I guess it's not related to this PR.

cmoussa1 · 2023-08-15T20:24:09Z

Thanks for letting me know @garlick!

Traceback (most recent call last):
  File "/nfshome/garlick/proj/flux-accounting/src/bindings/python/fluxacct/accounting/./test/test_job_archive_interface.py", line 20, in <module>
    from fluxacct.accounting import job_archive_interface as jobs
  File "/nfshome/garlick/proj/flux-accounting/src/bindings/python/fluxacct/accounting/job_archive_interface.py", line 18, in <module>
    from flux.resource import ResourceSet
ModuleNotFoundError: No module named 'flux'
ERROR: test/test_job_archive_interface.py - missing test plan
ERROR: test/test_job_archive_interface.py - exited with status 1

It looks like from the description of this error that flux-accounting is having trouble importing a Python module from flux-core: flux.resource is a Python module that allows flux-accounting to fetch nnodes from a job's R.

chu11 · 2024-04-05T17:34:51Z

slowly starting to review this, and just wondering what is the plan to transition from the old job-archive? perhaps there needs to be a script to do a "one time copy" of what you need from job-archive? (apologies if this is discussed above and I missed it).

Edit: n/m it appears my answer is in commit 2 of the PR series :-)

cmoussa1 · 2024-04-05T17:51:04Z

was just typing up a response! Sorry that I didn't make that clearer @chu11 - I should update the PR description

chu11 · 2024-04-05T17:41:18Z

src/cmd/flux-account-fetch-job-records.py

+# try to open database file; will exit with -1 if database file not found
+def est_sqlite_conn(path):
+    if not os.path.isfile(path):
+        print(f"Database file does not exist: {path}", file=sys.stderr)
+        sys.exit(1)


i think you mean "exit 1"

thanks - just pushed up a change to fix the comment

chu11 · 2024-04-05T17:45:31Z

src/cmd/flux-account-fetch-job-records.py

+# insert newly seen jobs into the "jobs" table in the flux-accounting DB
+def insert_jobs_in_db(conn, job_records):
+    cur = conn.cursor()
+
+    for single_job in job_records:
+        cur.execute(
+            """
+            INSERT OR IGNORE INTO jobs
+            VALUES (?, ?, ?, ?, ?, ?, ?, ?)
+            """,
+            (
+                single_job["id"],
+                single_job["userid"],
+                single_job["t_submit"],
+                single_job["t_run"],
+                single_job["t_inactive"],
+                single_job["ranks"],
+                single_job["R"],
+                single_job["jobspec"],
+            ),
+        )
+
+    conn.commit()


if a job didn't run (i.e. cancelled) do you care? perhaps if t_run not set or == 0, no need for record?

haven't gotten far enough if your PR series to see tests, but this would be a good test to add too

That's a good suggestion, thanks - you're right, I should probably check to see if a job actually ran. I'll go ahead and add a check as well as a test to submit a job that never runs and ensures that the job record doesn't show up under the user and that their job usage value doesn't change.

just force-pushed up a couple changes per your suggestion - added a check to just skip adding the job record if it never ran and added a couple tests in t1011-job-archive-interface.t for the case of a canceled job

chu11

per discussion at meeting yesterday, this is good to go! We'll have to test on fluke before it rolls out to cluster to make sure transition instructions are good.

cmoussa1 · 2024-05-09T17:58:09Z

Thanks @chu11! I have an open PR #446 that looks to move the accounting guide to this repository; I should add some additional instructions to that guide regarding the change to fetching job records. I'll update that PR.

@ryanday36 - I'll try to keep you posted when a new version of flux-accounting is created so that we can coordinate in case there are hiccups.

I'll set MWP here

Add a new table to the flux-accounting DB: jobs, which will store jobs fetched by flux-accounting periodically to be used for job-usage and fair-share calculation as well as for viewing by users/admins.

Add a new Python script to the flux-accounting command suite that fetches jobs from Flux using the job-list and job-info interfaces and inserts them into a table in the flux-accounting DB.

Problem: The update-usage command takes in a positional argument for the path to flux-core's job-archive DB. Now that flux-accounting has implemented its own job-archive within its own database, it no longer needs to look for jobs in the job-archive DB. Remove the positional argument that specifies a path to flux-core's job-archive DB in the update-usage command.

Problem: The view-job-records and update-usage commands both take arguments that specify where flux-core's job-archive DB is located in order to look for job-records. Now that flux-accounting has implemented its own job-archive, these commands can just look in the flux-accounting DB for these records. Remove the job-archive DB path argument from both of these commands in flux-account-service.py.

Problem: The calc_usage_factor() and update_job_usage() functions both take an argument for a SQLite connection to flux-core's job-archive DB. Now that flux-accounting has implemented its own job-archive locally in its own DB, these two functions no longer need to connect to the job-archive DB. Remove the SQLite connection parameter from both the calc_usage_factor() and update_job_usage() functions that connect to flux-core's job-archive DB, and instead just use one SQLite connection to the flux-accounting DB.

Add "jobs" as an expected table to the list of expected tables in the unit tests for create_db.py.

Problem: The unit tests for test_job_archive_interface.py create and use a test job-archive DB. Now that flux-accounting has implemented its own job-archive, the test should just use one connection to the flux-accounting DB's "jobs" table. Remove the creation of the test job-archive DB. Remove the SQLite connection to the test job-archive DB. Restructure the INSERT SQLite statement that inserts fake job records into the jobs table to match the schema of the new flux-accounting "jobs" table.

Problem: t1010-job-usage.t creates a fake job-archive DB to test the update-usage command for flux-accounting. Now that flux-accounting has implemented its own job-archive, this sharness test file is not necessarily needed, especially because the next sharness test file, t1011-job-archive.db, actually tests fetching job records and updating usage using flux-accounting's job-archive. Remove t1010-job-usage.t from the test suite.

Remove the argument to flux-core's job-archive DB when testing the update-usage command in t1026-flux-account-perms.t.

Problem: t1011-job-archive-interface.py doesn't make use of the new flux-accounting Python script that fetches new job records and adds it to the jobs table in the flux-accounting DB. Add tests that call this Python script.

Problem: t1011-job-archive-interface does not have any tests that ensure a user's job-usage value does not get affected by a job that never ran. Add a basic set of tests in t1011-job-archive-interface.t that submits a job which gets canceled and ensures that the record does not get added to flux-accounting's jobs table and thus does not affect a user's job-usage value.

Problem: There is a test that looks for an expected list of tables in a flux-accounting DB, but there is no "jobs" table. Add the new table as an expected table in t1017-update-db.t.

codecov · 2024-05-09T18:06:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.47%. Comparing base (2f811cf) to head (f095423).
Report is 1 commits behind head on master.

❗ Current head f095423 differs from pull request most recent head 035a0b5. Consider uploading reports for the commit 035a0b5 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #357      +/-   ##
==========================================
+ Coverage   83.30%   83.47%   +0.17%     
==========================================
  Files          23       23              
  Lines        1545     1525      -20     
==========================================
- Hits         1287     1273      -14     
+ Misses        258      252       -6

see 3 files with indirect coverage changes

Problem: The schema version of the flux-accounting database was never updated with the addition of the jobs table in flux-framework#357. Update the schema version number of the flux-accounting database.

cmoussa1 added the new feature new feature label May 31, 2023

cmoussa1 changed the title ~~[WIP] flux-accounting: add a local job-archive~~ flux-accounting: add a local job-archive Jun 1, 2023

cmoussa1 marked this pull request as ready for review June 1, 2023 15:21

cmoussa1 mentioned this pull request Jun 1, 2023

flux-accounting guide: change instructions for job-archive flux-framework/flux-docs#243

Closed

cmoussa1 force-pushed the add.local.job-archive branch from bf3d221 to ba6c8bc Compare June 1, 2023 17:06

cmoussa1 mentioned this pull request Jun 5, 2023

[WIP] flux-accounting guide: add instructions for setting up local job-archive flux-framework/flux-docs#244

Closed

cmoussa1 mentioned this pull request Jun 16, 2023

python: support convenience API for job-info.lookup RPC / "flux job info" flux-framework/flux-core#5265

Merged

chu11 mentioned this pull request Jun 20, 2023

job-list: default 'since' to something != 0 flux-framework/flux-core#5284

Open

cmoussa1 force-pushed the add.local.job-archive branch 3 times, most recently from 4e29b43 to 0a321cd Compare June 23, 2023 19:02

cmoussa1 force-pushed the add.local.job-archive branch 2 times, most recently from 45e5605 to 164b5e3 Compare July 18, 2023 20:08

cmoussa1 requested a review from garlick July 18, 2023 20:13

cmoussa1 force-pushed the add.local.job-archive branch from 164b5e3 to c37294a Compare July 24, 2023 19:17

cmoussa1 force-pushed the add.local.job-archive branch 2 times, most recently from 063cea5 to f131b4c Compare August 15, 2023 19:17

garlick mentioned this pull request Aug 15, 2023

ERROR: test/test_job_archive_interface.py - missing test plan #370

Open

chu11 reviewed Apr 5, 2024

View reviewed changes

cmoussa1 force-pushed the add.local.job-archive branch 2 times, most recently from 426bff7 to 45fd833 Compare April 8, 2024 19:01

chu11 approved these changes May 9, 2024

View reviewed changes

cmoussa1 added merge-when-passing and removed merge-when-passing labels May 9, 2024

cmoussa1 added 12 commits May 9, 2024 11:04

create_db: add "jobs" table

2ab4a29

Add a new table to the flux-accounting DB: jobs, which will store jobs fetched by flux-accounting periodically to be used for job-usage and fair-share calculation as well as for viewing by users/admins.

flux-account: add new fetch-job-records.py script

f073dd2

Add a new Python script to the flux-accounting command suite that fetches jobs from Flux using the job-list and job-info interfaces and inserts them into a table in the flux-accounting DB.

test: add new table to expected table list

c128a41

Add "jobs" as an expected table to the list of expected tables in the unit tests for create_db.py.

t: remove job-archive path in update-usage test

7f6945c

Remove the argument to flux-core's job-archive DB when testing the update-usage command in t1026-flux-account-perms.t.

t: use fetch-job-records script in test

96ac0a5

Problem: t1011-job-archive-interface.py doesn't make use of the new flux-accounting Python script that fetches new job records and adds it to the jobs table in the flux-accounting DB. Add tests that call this Python script.

t: add new tables to expected table list

035a0b5

Problem: There is a test that looks for an expected list of tables in a flux-accounting DB, but there is no "jobs" table. Add the new table as an expected table in t1017-update-db.t.

cmoussa1 force-pushed the add.local.job-archive branch from 45fd833 to 035a0b5 Compare May 9, 2024 18:04

cmoussa1 added the merge-when-passing label May 9, 2024

mergify bot merged commit f44be60 into flux-framework:master May 9, 2024
11 checks passed

This was referenced May 9, 2024

job usage: archive long term job records in the accounting db #353

Closed

repo: create a doc folder, add flux-accounting guide #446

Merged

This was referenced May 15, 2024

database: update schema version #453

Merged

job-archive-interface: clean up which data is fetched for job-usage calculation #354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux-accounting: add a local job-archive #357

flux-accounting: add a local job-archive #357

cmoussa1 commented May 31, 2023 •

edited

Loading

cmoussa1 commented Jun 1, 2023

garlick commented Jun 1, 2023

cmoussa1 commented Jun 1, 2023

cmoussa1 commented Jun 26, 2023

cmoussa1 commented Jul 18, 2023

garlick commented Aug 9, 2023 •

edited

Loading

garlick commented Aug 10, 2023 •

edited

Loading

cmoussa1 commented Aug 15, 2023

garlick commented Aug 15, 2023

garlick commented Aug 15, 2023

cmoussa1 commented Aug 15, 2023

chu11 commented Apr 5, 2024 •

edited

Loading

cmoussa1 commented Apr 5, 2024

chu11 Apr 5, 2024

cmoussa1 Apr 8, 2024

chu11 Apr 5, 2024

chu11 Apr 5, 2024

cmoussa1 Apr 8, 2024

cmoussa1 Apr 8, 2024

chu11 left a comment

cmoussa1 commented May 9, 2024

codecov bot commented May 9, 2024

flux-accounting: add a local job-archive #357

flux-accounting: add a local job-archive #357

Conversation

cmoussa1 commented May 31, 2023 • edited Loading

Background

cmoussa1 commented Jun 1, 2023

garlick commented Jun 1, 2023

cmoussa1 commented Jun 1, 2023

cmoussa1 commented Jun 26, 2023

cmoussa1 commented Jul 18, 2023

garlick commented Aug 9, 2023 • edited Loading

garlick commented Aug 10, 2023 • edited Loading

cmoussa1 commented Aug 15, 2023

garlick commented Aug 15, 2023

garlick commented Aug 15, 2023

cmoussa1 commented Aug 15, 2023

chu11 commented Apr 5, 2024 • edited Loading

cmoussa1 commented Apr 5, 2024

chu11 Apr 5, 2024

Choose a reason for hiding this comment

cmoussa1 Apr 8, 2024

Choose a reason for hiding this comment

chu11 Apr 5, 2024

Choose a reason for hiding this comment

chu11 Apr 5, 2024

Choose a reason for hiding this comment

cmoussa1 Apr 8, 2024

Choose a reason for hiding this comment

cmoussa1 Apr 8, 2024

Choose a reason for hiding this comment

chu11 left a comment

Choose a reason for hiding this comment

cmoussa1 commented May 9, 2024

codecov bot commented May 9, 2024

Codecov Report

cmoussa1 commented May 31, 2023 •

edited

Loading

garlick commented Aug 9, 2023 •

edited

Loading

garlick commented Aug 10, 2023 •

edited

Loading

chu11 commented Apr 5, 2024 •

edited

Loading