-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Release Naming Conventions and Handling of Processing Data Transfers #413
Comments
The general principle sounds good to me. The proposed name Now that I think about it, putting |
I think we should stick with what we had before for simplicity. dr2 is more general then wfd, so I think that's why it's first, though I think the order really doesn't matter much. So for historic reasons, I would choose what we had before. |
My take on this : I am not sure I follow the current train of thoughts on processing naming convention and how it translates into snapshots and released areas and naming convention
Maybe I misunderstand some of what is written above, in which case sorry for the noise. Maybe there is a sense that we need to keep a one to one relationship between processing area and released area. I think that very often this will be naturally guaranteed, as the processing area will be different for any new processing effort. But in the case of 4852 such a one to one requirement seems overkill to me. Caveat : I am not sure how gen3 processing is going to modify my seasoned view of how all this is coming out. |
I basically agree with Johann but would like some clarification of details.
|
Most important is indeed that there is no reason to think of it as something understandable by people outside of processing, especially the public of the releases. As for conventions relevant for processing, with gen2 there really are nothing built in so I just built rerun names out of the blue, it made sense to me, not necessarily to others. With gen3 I would not be surprised that this needs revisiting and more forethought
I am not sure I understand your point. I hope that when we speak of releases it is clear that we are not speaking about internal processing directories. But we can make sure that the naming are different, in any case.
Indeed, tricked by a double negation on the first day back to work. Bummer :) |
Going in order of comments, starting with Yao & Katrin. Ok - we'll go with snapshots created for releases are copies of a subset of the processing area, and here we need to be very careful to define that subset and when it is appropriate to bump to a new version. For the specific case of DR6 tract 4852 patch 1,5, that should result in a separate released version (v2) of the object catalogs. Concerning the butler rerun area, it would be incorrect to just simply update the 4852 1,5 files (even if this was initially a processing failure), without marking this as an updated versioned release. Due to disk space concerns, maybe we store v1 to tape (or even just store the original version of the updated files to tape, so v1 could be recreated, if that is ever needed) |
ok, so we disagree on several points here. For 4852, imho there is no difference between reprocessing it and rolling back a stream due to computing failure. And I do not think that you advocate bumping version for each and every random rollback that occurs during processing...... At least for gen2 system, it would have been hell. I do not want to argue forever though, so whatever is ok with the majority is ok with me. |
My thoughts on versioning are strictly in regards to releases. If we release something and name it |
I think maybe we need a more clear distinction between pre-releases and releases, given that some of the validation tests require the data propagate all the way down to GCRCatalogs. If we distinguish them, then we can say it's ok to update the content in place for pre-releases, but a snapshot copy must be made for releases. |
For the object catalogs, we have pre-release areas, but have not done that for the butler/rerun areas... we could by creating a pre-release snapshot for validation and then renaming when a release is ready. I think that's fine. I don't think we can use the processing area for our pre-release validation, necessarily. An immediate thing I would like to reach a consensus on.. renaming the DR6 v1 butler area at NERSC which is
|
For the DR6 object catalogs, previously we have been treating pre-releases like releases, i.e., a new copy is made when a new pre-release is made. I think that's a bit overkill and results in way too many deprecated catalogs in GCRCatalogs in the end. I am ok with keeping the pre-releases in the release area. The main point is just that if the processing area is updated and we want to propagate the update the pre-release, we can just overwrite the existing pre-release instead of making a new one. For releases we will never overwrite, of course. I don't really have opinions regarding renaming dr6. What you proposed sounds good to me. But at the very beginning you said that we are using |
If I look at |
To answer Yao's question, yes, I was making a new proposal... in the interest of reusing the existing naming convention at CC. Concerning Johann's comment:
|
Chatted briefly with Johann offline, and we have an updated proposal. The "releases", which include butler accessible data, would reside under
Asking the Data Access team (@JoanneBogart & @yymao) if they feel it is ok to include butler accessible data in the |
I think that's fine, and a good proposal in fact. The Did you have specific concerns regarding including butler/Postgres accessible data in One note is that all the symlinks in |
Great - that all makes sense. Not today, but in the next couple of days, I will start setting this up at NERSC. |
Putting butler-accessible data under our shared directory sounds good to me. |
I would imagine it is possible to dump the db and set it up elsewhere. Definitely something we should try to see how it goes. I could imagine other sites may want to mirror and we probably do want to support that. |
We need to write up concrete steps to handle the naming and versioning of our data releases, which also includes some recommendations for dealing with the data transfers between CC and NERSC. Looking for comments.
An example of the issue using Run2.2i DR6
The CC processing area is named
run2.2i-coadd-wfd-dr6-v1
and at NERSC we reused this name for our DR6 v1 release.Processing continues at CC and updated data must be copied to NERSC, but we do not want to update the dataset that has already been released.
Recommendations
run2.2i-coadd-wfd-dr6-v1
area we already have. The CC area with that name will continue to be used for processing. At NERSC, there is a newrun2.2i-coadd-dr6-processing
directory and correspondingrun2.2i-coadd-dr6-processing-grizy
andrun2.2i-coadd-dr6-processing-u
directories. Future data transfers from CC to NERSC should be directed into these areas.run2.2i-coadd-wfd-dr2-v1
at both CC and NERSC will continue to be the area that may be updated due to reprocessing. Snapshots will be taken at NERSC for releases and named according to our naming conventions. See below.Review of Release Naming Conventions
As discussed on Slack our release naming conventions should communicate
the run (2.2i or 3.1i), the depth (DR1, DR2, etc.), and the cadence (WFD or DDF)
.Examples
For the upcoming Run2.2i DR2 v1 release, a snapshot will be taken at NERSC, resulting in butler data directories:
run2.2i-wfd-dr2-v1
run2.2i-wfd-dr2-v1-grizy
run2.2i-wfd-dr2-v1-u
Additional DR6 releases will use a similar naming convention, however, as long as this processing includes both WFD and DDF visits, no cadence will be indicated in the name:
run2.2i-dr6-v2
run2.2i-dr6-v2-grizy
run2.2i-dr6-v2-u
The text was updated successfully, but these errors were encountered: