-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-accounting: add a local job-archive #357
flux-accounting: add a local job-archive #357
Conversation
OK, I think I'll mark this is as ready for some initial review whenever convenient. 😄 |
Awesome @cmoussa1 !
IMHO if the test is no longer needed, just remove it. No need to renumber the tests. Just leave a hole :-) I'll plan to make a review pass later today and test this on my small cluster FWIW. I think it would be good for @chu11 and @ryanday36 to also have a look, keeping in mind this would be intended to replace job-archive. Maybe it would make sense to create a PR for the flux accounting guide that remains pending until this is included in a new tag? The stuff in there about configuring job-archive could be dropped and a suggested crontab line to run the new script periodically could be added to the example there (which I would cut and paste for my testing). |
Thanks @garlick!
That sounds good. I'll edit the commit to just remove it. Thanks!
Yes, this sounds great. Note that I leave for vacation in a couple hours so I'll probably be delayed in a response to your review comments. 😅
I agree! I'll create an issue for it so I don't forget and try to open one soon. 👍 |
bf3d221
to
ba6c8bc
Compare
4e29b43
to
0a321cd
Compare
Note: I've updated this PR to use the new flux-core job-info convenience function that was added in flux-framework/flux-core#5265 |
Problem: The schema for the flux-accounting DB has been changed as a result of flux-framework#357. Update the schema version for the flux-accounting DB in __init__.py.in and in the front-end flux-account-service.py script.
45e5605
to
164b5e3
Compare
just rebased to catch up after #367 |
164b5e3
to
c37294a
Compare
Poking at this a bit (finally!) To get it going I rebased on current master in a private repo and build a deb, then installed that on my test cluster rank 0. Then I ran
So far so good. For fun I poked at the db and dumped the definition of the
And noted a few differences from the job-archive
Then I tried manually populating the new table with the
Probably unrelated but I also noticed that for me I got a couple of errors when running
I thought maybe I'd pause here and ask you to rebase this PR on current master and make sure everything still works for you. |
Two notes on the schema:
Also CHAR(16) is just another alias for TEXT in sqlite (and subscripts are ignored) https://www.sqlite.org/draft/datatype3.html Edit: I guess one other question is any reason not to use the same technique as @chu11 for getting the last job ID based on the inactive time stamp? E.g. SELECT MAX(t_inactive) FROM jobs It might be a bit simpler than managing the |
063cea5
to
f131b4c
Compare
Thanks for beginning to take a look at this @garlick! I spent some time this morning looking at your feedback; sorry that it didn't work out so gracefully on your test cluster this first time around!
Thanks for providing this comparison. Yeah, I noticed that the types for some of the data that flux-accounting is interested in was different in flux-accounting's job-archive and flux-core's, so I went ahead and changed them to match the data types of flux-core's job-archive. They should match now.
That is odd - that set of tests is supposed to be there and run. I have never run
Ah, that would be because I forgot you could query for MIN and MAX values, :-) Thanks for pointing this out. I went ahead and force-pushed a change to this script to run I've also rebased on current master, so it should be caught up after #369 now. |
Thanks @Cmoussa - i've updated to your branch, and I'm still getting those two test failures. The logs say
I'll see if I can find out why that's failing on my system. Note: the |
Those tests are failing for me on master also so I guess it's not related to this PR. |
Thanks for letting me know @garlick!
It looks like from the description of this error that flux-accounting is having trouble |
slowly starting to review this, and just wondering what is the plan to transition from the old Edit: n/m it appears my answer is in commit 2 of the PR series :-) |
was just typing up a response! Sorry that I didn't make that clearer @chu11 - I should update the PR description |
# try to open database file; will exit with -1 if database file not found | ||
def est_sqlite_conn(path): | ||
if not os.path.isfile(path): | ||
print(f"Database file does not exist: {path}", file=sys.stderr) | ||
sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you mean "exit 1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks - just pushed up a change to fix the comment
# insert newly seen jobs into the "jobs" table in the flux-accounting DB | ||
def insert_jobs_in_db(conn, job_records): | ||
cur = conn.cursor() | ||
|
||
for single_job in job_records: | ||
cur.execute( | ||
""" | ||
INSERT OR IGNORE INTO jobs | ||
VALUES (?, ?, ?, ?, ?, ?, ?, ?) | ||
""", | ||
( | ||
single_job["id"], | ||
single_job["userid"], | ||
single_job["t_submit"], | ||
single_job["t_run"], | ||
single_job["t_inactive"], | ||
single_job["ranks"], | ||
single_job["R"], | ||
single_job["jobspec"], | ||
), | ||
) | ||
|
||
conn.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a job didn't run (i.e. cancelled) do you care? perhaps if t_run
not set or == 0, no need for record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't gotten far enough if your PR series to see tests, but this would be a good test to add too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good suggestion, thanks - you're right, I should probably check to see if a job actually ran. I'll go ahead and add a check as well as a test to submit a job that never runs and ensures that the job record doesn't show up under the user and that their job usage value doesn't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just force-pushed up a couple changes per your suggestion - added a check to just skip adding the job record if it never ran and added a couple tests in t1011-job-archive-interface.t
for the case of a canceled job
426bff7
to
45fd833
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per discussion at meeting yesterday, this is good to go! We'll have to test on fluke before it rolls out to cluster to make sure transition instructions are good.
Thanks @chu11! I have an open PR #446 that looks to move the accounting guide to this repository; I should add some additional instructions to that guide regarding the change to fetching job records. I'll update that PR. @ryanday36 - I'll try to keep you posted when a new version of flux-accounting is created so that we can coordinate in case there are hiccups. I'll set MWP here |
Add a new table to the flux-accounting DB: jobs, which will store jobs fetched by flux-accounting periodically to be used for job-usage and fair-share calculation as well as for viewing by users/admins.
Add a new Python script to the flux-accounting command suite that fetches jobs from Flux using the job-list and job-info interfaces and inserts them into a table in the flux-accounting DB.
Problem: The update-usage command takes in a positional argument for the path to flux-core's job-archive DB. Now that flux-accounting has implemented its own job-archive within its own database, it no longer needs to look for jobs in the job-archive DB. Remove the positional argument that specifies a path to flux-core's job-archive DB in the update-usage command.
Problem: The view-job-records and update-usage commands both take arguments that specify where flux-core's job-archive DB is located in order to look for job-records. Now that flux-accounting has implemented its own job-archive, these commands can just look in the flux-accounting DB for these records. Remove the job-archive DB path argument from both of these commands in flux-account-service.py.
Problem: The calc_usage_factor() and update_job_usage() functions both take an argument for a SQLite connection to flux-core's job-archive DB. Now that flux-accounting has implemented its own job-archive locally in its own DB, these two functions no longer need to connect to the job-archive DB. Remove the SQLite connection parameter from both the calc_usage_factor() and update_job_usage() functions that connect to flux-core's job-archive DB, and instead just use one SQLite connection to the flux-accounting DB.
Add "jobs" as an expected table to the list of expected tables in the unit tests for create_db.py.
Problem: The unit tests for test_job_archive_interface.py create and use a test job-archive DB. Now that flux-accounting has implemented its own job-archive, the test should just use one connection to the flux-accounting DB's "jobs" table. Remove the creation of the test job-archive DB. Remove the SQLite connection to the test job-archive DB. Restructure the INSERT SQLite statement that inserts fake job records into the jobs table to match the schema of the new flux-accounting "jobs" table.
Problem: t1010-job-usage.t creates a fake job-archive DB to test the update-usage command for flux-accounting. Now that flux-accounting has implemented its own job-archive, this sharness test file is not necessarily needed, especially because the next sharness test file, t1011-job-archive.db, actually tests fetching job records and updating usage using flux-accounting's job-archive. Remove t1010-job-usage.t from the test suite.
Remove the argument to flux-core's job-archive DB when testing the update-usage command in t1026-flux-account-perms.t.
Problem: t1011-job-archive-interface.py doesn't make use of the new flux-accounting Python script that fetches new job records and adds it to the jobs table in the flux-accounting DB. Add tests that call this Python script.
Problem: t1011-job-archive-interface does not have any tests that ensure a user's job-usage value does not get affected by a job that never ran. Add a basic set of tests in t1011-job-archive-interface.t that submits a job which gets canceled and ensures that the record does not get added to flux-accounting's jobs table and thus does not affect a user's job-usage value.
Problem: There is a test that looks for an expected list of tables in a flux-accounting DB, but there is no "jobs" table. Add the new table as an expected table in t1017-update-db.t.
45fd833
to
035a0b5
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #357 +/- ##
==========================================
+ Coverage 83.30% 83.47% +0.17%
==========================================
Files 23 23
Lines 1545 1525 -20
==========================================
- Hits 1287 1273 -14
+ Misses 258 252 -6 |
Problem: The schema version of the flux-accounting database was never updated with the addition of the jobs table in flux-framework#357. Update the schema version number of the flux-accounting database.
Background
As discussed in flux-framework/flux-core#3136 and in #353, it may be advantageous for flux accounting to do its own job record archival. With archival moved to flux accounting, the job-archive module in flux-core could be retired and the flux accounting project can evolve the schema as needed to fit its requirements.
This PR looks to begin to add a job archive local to flux-accounting. It creates two new tables in the flux-accounting DB:
jobs
: a table that stores job records used for calculating job usage (and subsequently fair share)last_seen_job_table
: a table that stores a timestamp of the last inactive job seenIt also adds a new Python script that is designed to be run as a cron job (and in conjunction with the other Python scripts and commands in this repository that run periodically). This new Python script,
flux-account-fetch-job-records.py
, will fetch inactive jobs using Flux'sjob-info
andjob-list
interfaces. It will create job-record objects and insert them into the newly createdjobs
table. The benefit here is when it comes time for flux-accounting to re-calculate job-usage and fair share values for user/banks during a periodic update, it can just look locally in its own DB instead of having to connect to flux-core's job-archive.Some minor cleanup of the
update-usage
andview-job-records
commands (as well as some of the unit tests for the portion of flux-accounting that previously dealt with the flux-core job-archive) follow the first couple commits that add the new tables and Python script. No major changes were required to the tests that previously dealt with the flux-core job-archive (specificallyt1011-job-archive-interface.t
). The big change there was to add a call to the new Python script and point theview-job-records
andupdate-usage
commands to the flux-accounting DB.Note that this new Python script does not yet have a "purge" function just yet to remove old jobs that are no longer considered by job usage or fair share calculations. For the sake of keeping this PR as narrow in scope as possible, I figured I would open a PR that follows this one which looks to add purge capabilities.
You'll also notice that right now there is a commit in this PR that removes one of the sharness tests (
t1010-update-usage.t
) fromt/Makefile.am
, essentially ignoring it when runningmake check
. The reason for this ist1010-job-usage.t
creates a fake job-archive DB to test theupdate-usage
command for flux-accounting. This test was created before additional tests were added to flux-accounting that actually loaded flux-core's job-archive. But now that flux-accounting has implemented its own job-archive, I don't think this sharness test file is necessarily needed, especially because the next sharness test file,t1011-job-archive-interface.db
, actually tests fetching job records and updating usage using (now) flux-accounting's job-archive. So I'm not totally sure what the best approach here is. Should I remove the test file completely from the repository and re-number the following sharness tests? Is leaving it out oft/Makefile.am
fine? Let me know.