Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-accounting: add a local job-archive #357

Merged
merged 12 commits into from
May 9, 2024

Conversation

cmoussa1
Copy link
Member

@cmoussa1 cmoussa1 commented May 31, 2023

Background

As discussed in flux-framework/flux-core#3136 and in #353, it may be advantageous for flux accounting to do its own job record archival. With archival moved to flux accounting, the job-archive module in flux-core could be retired and the flux accounting project can evolve the schema as needed to fit its requirements.


This PR looks to begin to add a job archive local to flux-accounting. It creates two new tables in the flux-accounting DB:

  • jobs: a table that stores job records used for calculating job usage (and subsequently fair share)
  • last_seen_job_table: a table that stores a timestamp of the last inactive job seen

It also adds a new Python script that is designed to be run as a cron job (and in conjunction with the other Python scripts and commands in this repository that run periodically). This new Python script, flux-account-fetch-job-records.py, will fetch inactive jobs using Flux's job-info and job-list interfaces. It will create job-record objects and insert them into the newly created jobs table. The benefit here is when it comes time for flux-accounting to re-calculate job-usage and fair share values for user/banks during a periodic update, it can just look locally in its own DB instead of having to connect to flux-core's job-archive.

Some minor cleanup of the update-usage and view-job-records commands (as well as some of the unit tests for the portion of flux-accounting that previously dealt with the flux-core job-archive) follow the first couple commits that add the new tables and Python script. No major changes were required to the tests that previously dealt with the flux-core job-archive (specifically t1011-job-archive-interface.t). The big change there was to add a call to the new Python script and point the view-job-records and update-usage commands to the flux-accounting DB.

Note that this new Python script does not yet have a "purge" function just yet to remove old jobs that are no longer considered by job usage or fair share calculations. For the sake of keeping this PR as narrow in scope as possible, I figured I would open a PR that follows this one which looks to add purge capabilities.


You'll also notice that right now there is a commit in this PR that removes one of the sharness tests (t1010-update-usage.t) from t/Makefile.am, essentially ignoring it when running make check. The reason for this is t1010-job-usage.t creates a fake job-archive DB to test the update-usage command for flux-accounting. This test was created before additional tests were added to flux-accounting that actually loaded flux-core's job-archive. But now that flux-accounting has implemented its own job-archive, I don't think this sharness test file is necessarily needed, especially because the next sharness test file, t1011-job-archive-interface.db, actually tests fetching job records and updating usage using (now) flux-accounting's job-archive. So I'm not totally sure what the best approach here is. Should I remove the test file completely from the repository and re-number the following sharness tests? Is leaving it out of t/Makefile.am fine? Let me know.

@cmoussa1 cmoussa1 added the new feature new feature label May 31, 2023
@cmoussa1 cmoussa1 changed the title [WIP] flux-accounting: add a local job-archive flux-accounting: add a local job-archive Jun 1, 2023
@cmoussa1 cmoussa1 marked this pull request as ready for review June 1, 2023 15:21
@cmoussa1
Copy link
Member Author

cmoussa1 commented Jun 1, 2023

OK, I think I'll mark this is as ready for some initial review whenever convenient. 😄

@garlick
Copy link
Member

garlick commented Jun 1, 2023

Awesome @cmoussa1 !

Should I remove the test file completely from the repository and re-number the following sharness tests? Is leaving it out of t/Makefile.am fine? Let me know.

IMHO if the test is no longer needed, just remove it. No need to renumber the tests. Just leave a hole :-)

I'll plan to make a review pass later today and test this on my small cluster FWIW. I think it would be good for @chu11 and @ryanday36 to also have a look, keeping in mind this would be intended to replace job-archive.

Maybe it would make sense to create a PR for the flux accounting guide that remains pending until this is included in a new tag? The stuff in there about configuring job-archive could be dropped and a suggested crontab line to run the new script periodically could be added to the example there (which I would cut and paste for my testing).

@cmoussa1
Copy link
Member Author

cmoussa1 commented Jun 1, 2023

Thanks @garlick!

IMHO if the test is no longer needed, just remove it. No need to renumber the tests. Just leave a hole :-)

That sounds good. I'll edit the commit to just remove it. Thanks!

I'll plan to make a review pass later today and test this on my small cluster FWIW. I think it would be good for @chu11 and @ryanday36 to also have a look, keeping in mind this would be intended to replace job-archive.

Yes, this sounds great. Note that I leave for vacation in a couple hours so I'll probably be delayed in a response to your review comments. 😅

Maybe it would make sense to create a PR for the flux accounting guide that remains pending until this is included in a new tag? The stuff in there about configuring job-archive could be dropped and a suggested crontab line to run the new script periodically could be added to the example there (which I would cut and paste for my testing).

I agree! I'll create an issue for it so I don't forget and try to open one soon. 👍

@cmoussa1
Copy link
Member Author

Note: I've updated this PR to use the new flux-core job-info convenience function that was added in flux-framework/flux-core#5265

cmoussa1 added a commit to cmoussa1/flux-accounting that referenced this pull request Jun 30, 2023
Problem: The schema for the flux-accounting DB has been changed as a
result of flux-framework#357.

Update the schema version for the flux-accounting DB in __init__.py.in
and in the front-end flux-account-service.py script.
@cmoussa1 cmoussa1 force-pushed the add.local.job-archive branch 2 times, most recently from 45e5605 to 164b5e3 Compare July 18, 2023 20:08
@cmoussa1
Copy link
Member Author

just rebased to catch up after #367

@cmoussa1 cmoussa1 requested a review from garlick July 18, 2023 20:13
@cmoussa1 cmoussa1 force-pushed the add.local.job-archive branch from 164b5e3 to c37294a Compare July 24, 2023 19:17
@garlick
Copy link
Member

garlick commented Aug 9, 2023

Poking at this a bit (finally!)

To get it going I rebased on current master in a private repo and build a deb, then installed that on my test cluster rank 0. Then I ran

$ sudo -u flux flux account-update-db
checking for new tables...
new table found: jobs
new table found: last_seen_job_table
checking for new columns to add in tables...

So far so good. For fun I poked at the db and dumped the definition of the jobs table:

$ sudo -u flux sqlite3 FluxAccounting.db
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> .schema jobs
CREATE TABLE IF NOT EXISTS "jobs" (id integer  NOT NULL, userid integer  NOT NULL, t_submit real  NOT NULL, t_run real  NOT NULL, t_inactive real  NOT NULL, ranks tinytext  NOT NULL, R tinytext  NOT NULL, jobspec tinytext  NOT NULL, PRIMARY KEY (id));

And noted a few differences from the job-archive jobs table (no problem - just noting!):

$ sudo -u flux sqlite3 job-archive.sqlite
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> .schema jobs
CREATE TABLE jobs(  id CHAR(16) PRIMARY KEY,  userid INT,  ranks TEXT,  t_submit REAL,  t_run REAL,  t_cleanup REAL,  t_inactive REAL,  eventlog TEXT,  jobspec TEXT,  R TEXT);

Then I tried manually populating the new table with the fetch-job-records script but that didn't go so well:

$ sudo -u flux flux account-fetch-job-records
Traceback (most recent call last):
  File "/usr/lib/aarch64-linux-gnu/flux/cmd/flux-account-fetch-job-records.py", line 171, in <module>
    main()
  File "/usr/lib/aarch64-linux-gnu/flux/cmd/flux-account-fetch-job-records.py", line 156, in main
    timestamp = get_last_job_ts(conn)
  File "/usr/lib/aarch64-linux-gnu/flux/cmd/flux-account-fetch-job-records.py", line 65, in get_last_job_ts
    return float(row[0])
TypeError: 'NoneType' object is not subscriptable

Probably unrelated but I also noticed that for me I got a couple of errors when running make check - were these supposed to have been removed?

ERROR: test/test_job_archive_interface.py - missing test plan
ERROR: test/test_job_archive_interface.py - exited with status 1

I thought maybe I'd pause here and ask you to rebase this PR on current master and make sure everything still works for you.

@garlick
Copy link
Member

garlick commented Aug 10, 2023

Two notes on the schema:

  • I don't think TINYTEXT is a sqlite type - it is accepted but converted to TEXT
  • A sqlite INTEGER is a 64 bit signed value, but job IDs (in integer form) use the full 64 bit unsigned range. Maybe that is why @chu11 used CHAR(16) as the type for job IDs? Although maybe INTEGER would work and be more efficient as long as the sign conversion is carefully handled (a test would be good).

Also CHAR(16) is just another alias for TEXT in sqlite (and subscripts are ignored)

https://www.sqlite.org/draft/datatype3.html

Edit: I guess one other question is any reason not to use the same technique as @chu11 for getting the last job ID based on the inactive time stamp? E.g.

SELECT MAX(t_inactive) FROM jobs

It might be a bit simpler than managing the last_seen_job_table (I'm assuming forgetting to add it to the schema update script is what caused the error above)

@cmoussa1 cmoussa1 force-pushed the add.local.job-archive branch 2 times, most recently from 063cea5 to f131b4c Compare August 15, 2023 19:17
@cmoussa1
Copy link
Member Author

Thanks for beginning to take a look at this @garlick! I spent some time this morning looking at your feedback; sorry that it didn't work out so gracefully on your test cluster this first time around!

And noted a few differences from the job-archive jobs table (no problem - just noting!):

Thanks for providing this comparison. Yeah, I noticed that the types for some of the data that flux-accounting is interested in was different in flux-accounting's job-archive and flux-core's, so I went ahead and changed them to match the data types of flux-core's job-archive. They should match now.

Probably unrelated but I also noticed that for me I got a couple of errors when running make check - were these supposed to have been removed?

ERROR: test/test_job_archive_interface.py - missing test plan
ERROR: test/test_job_archive_interface.py - exited with status 1

That is odd - that set of tests is supposed to be there and run. I have never run make deb before (I am unsure how), however, so I wonder if there is something weird going on with this unit test during that build process? FWIW, this unit test runs in CI.

Edit: I guess one other question is any reason not to use the same technique as @chu11 for getting the last job ID based on ? > the inactive time stamp? E.g.

SELECT MAX(t_inactive) FROM jobs

It might be a bit simpler than managing the last_seen_job_table (I'm assuming forgetting to add it to the schema update > script is what caused the error above)

Ah, that would be because I forgot you could query for MIN and MAX values, :-) Thanks for pointing this out. I went ahead and force-pushed a change to this script to run SELECT MAX(t_inactive) FROM jobs when looking for the most recent inactive job in the jobs table. If the jobs table is empty, the variable that holds the timestamp of the most recent inactive job should just be 0.0 (indicating it should look for all inactive jobs).

I've also rebased on current master, so it should be caught up after #369 now.

@garlick
Copy link
Member

garlick commented Aug 15, 2023

Thanks @Cmoussa - i've updated to your branch, and I'm still getting those two test failures. The logs say

Traceback (most recent call last):
  File "/nfshome/garlick/proj/flux-accounting/src/bindings/python/fluxacct/accounting/./test/test_job_archive_interface.py", line 20, in <module>
    from fluxacct.accounting import job_archive_interface as jobs
  File "/nfshome/garlick/proj/flux-accounting/src/bindings/python/fluxacct/accounting/job_archive_interface.py", line 18, in <module>
    from flux.resource import ResourceSet
ModuleNotFoundError: No module named 'flux'
ERROR: test/test_job_archive_interface.py - missing test plan
ERROR: test/test_job_archive_interface.py - exited with status 1

I'll see if I can find out why that's failing on my system.

Note: the make deb just builds a deb package for the project (like an RPM). It shouldn't affect the tests in any way.

@garlick
Copy link
Member

garlick commented Aug 15, 2023

Those tests are failing for me on master also so I guess it's not related to this PR.

@cmoussa1
Copy link
Member Author

Thanks for letting me know @garlick!

Traceback (most recent call last):
  File "/nfshome/garlick/proj/flux-accounting/src/bindings/python/fluxacct/accounting/./test/test_job_archive_interface.py", line 20, in <module>
    from fluxacct.accounting import job_archive_interface as jobs
  File "/nfshome/garlick/proj/flux-accounting/src/bindings/python/fluxacct/accounting/job_archive_interface.py", line 18, in <module>
    from flux.resource import ResourceSet
ModuleNotFoundError: No module named 'flux'
ERROR: test/test_job_archive_interface.py - missing test plan
ERROR: test/test_job_archive_interface.py - exited with status 1

It looks like from the description of this error that flux-accounting is having trouble importing a Python module from flux-core: flux.resource is a Python module that allows flux-accounting to fetch nnodes from a job's R.

@chu11
Copy link
Member

chu11 commented Apr 5, 2024

slowly starting to review this, and just wondering what is the plan to transition from the old job-archive? perhaps there needs to be a script to do a "one time copy" of what you need from job-archive? (apologies if this is discussed above and I missed it).

Edit: n/m it appears my answer is in commit 2 of the PR series :-)

@cmoussa1
Copy link
Member Author

cmoussa1 commented Apr 5, 2024

was just typing up a response! Sorry that I didn't make that clearer @chu11 - I should update the PR description

Comment on lines 29 to 33
# try to open database file; will exit with -1 if database file not found
def est_sqlite_conn(path):
if not os.path.isfile(path):
print(f"Database file does not exist: {path}", file=sys.stderr)
sys.exit(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you mean "exit 1"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - just pushed up a change to fix the comment

Comment on lines +94 to +121
# insert newly seen jobs into the "jobs" table in the flux-accounting DB
def insert_jobs_in_db(conn, job_records):
cur = conn.cursor()

for single_job in job_records:
cur.execute(
"""
INSERT OR IGNORE INTO jobs
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""",
(
single_job["id"],
single_job["userid"],
single_job["t_submit"],
single_job["t_run"],
single_job["t_inactive"],
single_job["ranks"],
single_job["R"],
single_job["jobspec"],
),
)

conn.commit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a job didn't run (i.e. cancelled) do you care? perhaps if t_run not set or == 0, no need for record?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't gotten far enough if your PR series to see tests, but this would be a good test to add too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good suggestion, thanks - you're right, I should probably check to see if a job actually ran. I'll go ahead and add a check as well as a test to submit a job that never runs and ensures that the job record doesn't show up under the user and that their job usage value doesn't change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just force-pushed up a couple changes per your suggestion - added a check to just skip adding the job record if it never ran and added a couple tests in t1011-job-archive-interface.t for the case of a canceled job

@cmoussa1 cmoussa1 force-pushed the add.local.job-archive branch 2 times, most recently from 426bff7 to 45fd833 Compare April 8, 2024 19:01
Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per discussion at meeting yesterday, this is good to go! We'll have to test on fluke before it rolls out to cluster to make sure transition instructions are good.

@cmoussa1
Copy link
Member Author

cmoussa1 commented May 9, 2024

Thanks @chu11! I have an open PR #446 that looks to move the accounting guide to this repository; I should add some additional instructions to that guide regarding the change to fetching job records. I'll update that PR.

@ryanday36 - I'll try to keep you posted when a new version of flux-accounting is created so that we can coordinate in case there are hiccups.

I'll set MWP here

cmoussa1 added 12 commits May 9, 2024 11:04
Add a new table to the flux-accounting DB: jobs, which will store jobs
fetched by flux-accounting periodically to be used for job-usage and
fair-share calculation as well as for viewing by users/admins.
Add a new Python script to the flux-accounting command suite that
fetches jobs from Flux using the job-list and job-info interfaces and
inserts them into a table in the flux-accounting DB.
Problem: The update-usage command takes in a positional argument for
the path to flux-core's job-archive DB. Now that flux-accounting has
implemented its own job-archive within its own database, it no longer
needs to look for jobs in the job-archive DB.

Remove the positional argument that specifies a path to flux-core's
job-archive DB in the update-usage command.
Problem: The view-job-records and update-usage commands both take
arguments that specify where flux-core's job-archive DB is located in
order to look for job-records. Now that flux-accounting has implemented
its own job-archive, these commands can just look in the
flux-accounting DB for these records.

Remove the job-archive DB path argument from both of these commands in
flux-account-service.py.
Problem: The calc_usage_factor() and update_job_usage() functions both
take an argument for a SQLite connection to flux-core's job-archive DB.
Now that flux-accounting has implemented its own job-archive locally
in its own DB, these two functions no longer need to connect to the
job-archive DB.

Remove the SQLite connection parameter from both the
calc_usage_factor() and update_job_usage() functions that connect to
flux-core's job-archive DB, and instead just use one SQLite connection
to the flux-accounting DB.
Add "jobs" as an expected table to the list of expected tables in the
unit tests for create_db.py.
Problem: The unit tests for test_job_archive_interface.py create and
use a test job-archive DB. Now that flux-accounting has implemented its
own job-archive, the test should just use one connection to the
flux-accounting DB's "jobs" table.

Remove the creation of the test job-archive DB.

Remove the SQLite connection to the test job-archive DB.

Restructure the INSERT SQLite statement that inserts fake job records
into the jobs table to match the schema of the new flux-accounting
"jobs" table.
Problem: t1010-job-usage.t creates a fake job-archive DB to test the
update-usage command for flux-accounting. Now that flux-accounting has
implemented its own job-archive, this sharness test file is not
necessarily needed, especially because the next sharness test file,
t1011-job-archive.db, actually tests fetching job records and updating
usage using flux-accounting's job-archive.

Remove t1010-job-usage.t from the test suite.
Remove the argument to flux-core's job-archive DB when testing the
update-usage command in t1026-flux-account-perms.t.
Problem: t1011-job-archive-interface.py doesn't make use of the new
flux-accounting Python script that fetches new job records and adds it
to the jobs table in the flux-accounting DB.

Add tests that call this Python script.
Problem: t1011-job-archive-interface does not have any tests that ensure
a user's job-usage value does not get affected by a job that never ran.

Add a basic set of tests in t1011-job-archive-interface.t that submits a
job which gets canceled and ensures that the record does not get added
to flux-accounting's jobs table and thus does not affect a user's
job-usage value.
Problem: There is a test that looks for an expected list of tables in a
flux-accounting DB, but there is no "jobs" table.

Add the new table as an expected table in t1017-update-db.t.
Copy link

codecov bot commented May 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.47%. Comparing base (2f811cf) to head (f095423).
Report is 1 commits behind head on master.

❗ Current head f095423 differs from pull request most recent head 035a0b5. Consider uploading reports for the commit 035a0b5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #357      +/-   ##
==========================================
+ Coverage   83.30%   83.47%   +0.17%     
==========================================
  Files          23       23              
  Lines        1545     1525      -20     
==========================================
- Hits         1287     1273      -14     
+ Misses        258      252       -6     

see 3 files with indirect coverage changes

@mergify mergify bot merged commit f44be60 into flux-framework:master May 9, 2024
11 checks passed
cmoussa1 added a commit to cmoussa1/flux-accounting that referenced this pull request May 15, 2024
Problem: The schema version of the flux-accounting database was never
updated with the addition of the jobs table in flux-framework#357.

Update the schema version number of the flux-accounting database.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants