Data Release Naming Conventions and Handling of Processing Data Transfers #413

heather999 · 2021-01-04T17:02:35Z

We need to write up concrete steps to handle the naming and versioning of our data releases, which also includes some recommendations for dealing with the data transfers between CC and NERSC. Looking for comments.

An example of the issue using Run2.2i DR6

The CC processing area is named run2.2i-coadd-wfd-dr6-v1 and at NERSC we reused this name for our DR6 v1 release.
Processing continues at CC and updated data must be copied to NERSC, but we do not want to update the dataset that has already been released.

Recommendations

CC's data processing directories likely should not indicate any particular version, as reprocessing and re-using those same directories will result in new versions for release. Something more generic that indicates this is a dataset that is still undergoing updates would be appropriate.
When updated data processing results in a new release, we will take a snapshot of the processing directories at NERSC. Those snapshots will be named using our agreed-upon release naming convention. See below.
To deal with the run2.2i-coadd-wfd-dr6-v1 area we already have. The CC area with that name will continue to be used for processing. At NERSC, there is a new run2.2i-coadd-dr6-processing directory and corresponding run2.2i-coadd-dr6-processing-grizy and run2.2i-coadd-dr6-processing-u directories. Future data transfers from CC to NERSC should be directed into these areas.
run2.2i-coadd-wfd-dr2-v1 at both CC and NERSC will continue to be the area that may be updated due to reprocessing. Snapshots will be taken at NERSC for releases and named according to our naming conventions. See below.

Review of Release Naming Conventions

As discussed on Slack our release naming conventions should communicate the run (2.2i or 3.1i), the depth (DR1, DR2, etc.), and the cadence (WFD or DDF).

Examples

For the upcoming Run2.2i DR2 v1 release, a snapshot will be taken at NERSC, resulting in butler data directories:

run2.2i-wfd-dr2-v1
run2.2i-wfd-dr2-v1-grizy
run2.2i-wfd-dr2-v1-u

Additional DR6 releases will use a similar naming convention, however, as long as this processing includes both WFD and DDF visits, no cadence will be indicated in the name:

run2.2i-dr6-v2
run2.2i-dr6-v2-grizy
run2.2i-dr6-v2-u

The text was updated successfully, but these errors were encountered:

yymao · 2021-01-04T17:24:18Z

The general principle sounds good to me. The proposed name run2.2i-wfd-dr2-v1 has a different order for wfd and dr2 from what we did in GCRCatalogs (see https://github.com/LSSTDESC/gcr-catalogs/releases/tag/v1.1.0).

Now that I think about it, putting wfd first seems to make sense. But when we made the decision for GCRCatalogs we somehow went with run2.2i_dr2_wfd...

katrinheitmann · 2021-01-04T17:39:51Z

I think we should stick with what we had before for simplicity. dr2 is more general then wfd, so I think that's why it's first, though I think the order really doesn't matter much. So for historic reasons, I would choose what we had before.

johannct · 2021-01-04T18:25:36Z

My take on this :
I agree with Katrin, let's keep it simple, I do not think that the order brings any added value, so this is historical.

I am not sure I follow the current train of thoughts on processing naming convention and how it translates into snapshots and released areas and naming convention

Processing areas are internal; they should not be released. We may have been lazy in the past, but if we are to set up long term conventions and standards, this needs to be clear. As a consequence, naming convention for processing is arbitrary.
Processing areas should be exactly mirrored, without any modification.
Snapshots are needed and welcome to define releases.... but processing areas are never released per se, only a subset of their content, which thus needs to be copied over.
The way we deal with updates to the processing areas needs to remain totally independent of the way we deal with the release areas. Example 1 : I fixed 4852 because it was failed, I did not create a new rerun name because it would have been overkill, but that does not mean that at NERSC the copy should move to a different directory. Again mirroring should be exact. What needs to be updated as an independent standalone area is the area of the released products. Example 2 : we may have to rerun metacal for DR2 at CC; in this case I will likely use a new rerun, because the underlying metacal codebase will have changed; but I do not see any reason to change the current released area, as we just add new products that do not replace previously released ones; so this is another example where processing area and release area behave totally differently.

Maybe I misunderstand some of what is written above, in which case sorry for the noise. Maybe there is a sense that we need to keep a one to one relationship between processing area and released area. I think that very often this will be naturally guaranteed, as the processing area will be different for any new processing effort. But in the case of 4852 such a one to one requirement seems overkill to me. Caveat : I am not sure how gen3 processing is going to modify my seasoned view of how all this is coming out.

JoanneBogart · 2021-01-04T18:48:14Z

I basically agree with Johann but would like some clarification of details.

Is it fair to say "naming convention for processing is arbitrary" means "has no particular connection to naming of releases"? There still will be conventions for processing which are suited to the task at hand.
If processing areas are exactly mirrored at NERSC, we at least want to avoid conflicts or confusion with releases, which may put restrictions on naming of top-level processing directories
In the first example concerning 4852, should it read "I fixed 4852 because it was failed, I did not create a new rerun name because it would have been overkill, but that does not mean that at NERSC the copy should not move to a different directory. " ?

johannct · 2021-01-04T18:58:01Z

@JoanneBogart

Is it fair to say "naming convention for processing is arbitrary" means "has no particular connection to naming of releases"?There still will be conventions for processing which are suited to the task at hand.

Most important is indeed that there is no reason to think of it as something understandable by people outside of processing, especially the public of the releases. As for conventions relevant for processing, with gen2 there really are nothing built in so I just built rerun names out of the blue, it made sense to me, not necessarily to others. With gen3 I would not be surprised that this needs revisiting and more forethought

If processing areas are exactly mirrored at NERSC, we at least want to avoid conflicts or confusion with releases, which may put restrictions on naming of top-level processing directories

I am not sure I understand your point. I hope that when we speak of releases it is clear that we are not speaking about internal processing directories. But we can make sure that the naming are different, in any case.

In the first example concerning 4852, should it read "I fixed 4852 because it was failed, I did not create a new rerun name because it would have been overkill, but that does not mean that at NERSC the copy should not move to a different directory. " ?

Indeed, tricked by a double negation on the first day back to work. Bummer :)

heather999 · 2021-01-04T19:44:38Z

Going in order of comments, starting with Yao & Katrin. Ok - we'll go with run2.2i-dr2-wfd-v1 and use something along those lines for other releases.
Moving on to Johann & Joanne:
Agreed processing areas should not be released. For Run2.2i, we have the unfortunate situation that the rerun area at NERSC contains both the releases and the processings. In the future, that should be avoided, and perhaps I can manage to reorganize the directories at NERSC to create separate processing and release areas.
This mixing of directories is part of the reason I would rather the processing directory names not appear to be some release version, but I feel it is also generally confusing. The v1 on run2.2i-coadd-wfd-dr6-v1, which at CC is a processing area, is misleading.
For the specific case of DR6, given that we have made a release, I have been reluctant to rename the released NERSC directory from run2.2i-coadd-wfd-dr6-v1. Upon further consideration, maybe it is time to rename the directories to the form: run2.2i-dr6-v1, update desc-dm-dc2-data and then run2.2i-coadd-wfd-dr6-v1 at NERSC would again be mirrored to the processing area at CC.

snapshots created for releases are copies of a subset of the processing area, and here we need to be very careful to define that subset and when it is appropriate to bump to a new version. For the specific case of DR6 tract 4852 patch 1,5, that should result in a separate released version (v2) of the object catalogs. Concerning the butler rerun area, it would be incorrect to just simply update the 4852 1,5 files (even if this was initially a processing failure), without marking this as an updated versioned release. Due to disk space concerns, maybe we store v1 to tape (or even just store the original version of the updated files to tape, so v1 could be recreated, if that is ever needed)
Releasing DR2 v1 now without metacal (which is the plan), when a metacal processing becomes available, I still think that will result in a DR2 v2 release. Whatever the reason for a change in the data released, whether it is updating existing files or adding new ones - I think that deserves a bump in version.

johannct · 2021-01-04T19:54:14Z

ok, so we disagree on several points here. For 4852, imho there is no difference between reprocessing it and rolling back a stream due to computing failure. And I do not think that you advocate bumping version for each and every random rollback that occurs during processing...... At least for gen2 system, it would have been hell. I do not want to argue forever though, so whatever is ok with the majority is ok with me.

heather999 · 2021-01-04T19:58:34Z

My thoughts on versioning are strictly in regards to releases. If we release something and name it v1 and we then update the data in some way later, and then release that.. it is v2.
We took a long time to release DR6 initially and during that time there were of course rollbacks in the processing, but none of that mattered from a versioning standpoint, because we had not released anything yet.

yymao · 2021-01-04T20:02:29Z

I think maybe we need a more clear distinction between pre-releases and releases, given that some of the validation tests require the data propagate all the way down to GCRCatalogs.

If we distinguish them, then we can say it's ok to update the content in place for pre-releases, but a snapshot copy must be made for releases.

heather999 · 2021-01-04T20:27:06Z

For the object catalogs, we have pre-release areas, but have not done that for the butler/rerun areas... we could by creating a pre-release snapshot for validation and then renaming when a release is ready. I think that's fine. I don't think we can use the processing area for our pre-release validation, necessarily.
For the DR2 release and beyond.. I could create a release area that is separate from the processing area at NERSC. That area could contain pre-releases, which will be updated as needed based on validation testing.

An immediate thing I would like to reach a consensus on.. renaming the DR6 v1 butler area at NERSC which is run2.2i-coadd-wfd-dr6-v1 and moving to run2.2i-dr6-v1.. this could live under the new release area mentioned above..

There would need to be some announcement that is happening, and noting that there would not be any backward compatibility provided. But that is probably fine, esp for those that are using desc-dc2-dm-data, the alias 2.2i_dr6_wfd would remain available (and deprecated) but the path it points to would be updated. There would also be a new 2.2i_dr6 alias that points to the same new location.
desc-dc2-dm-data should be updated and released
Rename the directories under rerun
This would allow us to mirror the CC DR6 processing area without any future confusion.

yymao · 2021-01-04T20:56:51Z

For the DR6 object catalogs, previously we have been treating pre-releases like releases, i.e., a new copy is made when a new pre-release is made. I think that's a bit overkill and results in way too many deprecated catalogs in GCRCatalogs in the end.

I am ok with keeping the pre-releases in the release area. The main point is just that if the processing area is updated and we want to propagate the update the pre-release, we can just overwrite the existing pre-release instead of making a new one. For releases we will never overwrite, of course.

I don't really have opinions regarding renaming dr6. What you proposed sounds good to me. But at the very beginning you said that we are using run2.2i-coadd-dr6-processing for syncing with CC. Are you making a new proposal that we make a copy with the name run2.2i-dr6-v1 in the release area, and use run2.2i-coadd-wfd-dr6-v1 for syncing with CC processing? But either is fine with me actually.

johannct · 2021-01-04T20:59:09Z

If I look at https://github.com/LSSTDESC/desc-dc2-dm-data/blob/master/desc_dc2_dm_data/repos.py the path at NERSC for DR6 is /global/cfs/cdirs/lsst/production/DC2_ImSim/Run2.2i/desc_dm_drp/v19.0.0-v1/rerun/run2.2i-coadd-wfd-dr6-v1
In my mind mirroring means that everything below a root path is strictly identical. Here the rootpath is defined as /global/cfs/cdirs/lsst/production/DC2_ImSim/Run2.2i/desc_dm_drp. Indeed for CC the path is /sps/lssttest/dataproducts/desc/DC2/Run2.2i/v19.0.0-v1/rerun/run2.2i-coadd-wfd-dr6-v1 and its rootpath is /sps/lssttest/dataproducts/desc/DC2/Run2.2i. Everything below each rootpath is strictly identical between the two sites, be it directory hierarchy or content (we relax the strict identity for some intermediary products). This seems a sound situation to me.....

heather999 · 2021-01-04T21:50:08Z

To answer Yao's question, yes, I was making a new proposal... in the interest of reusing the existing naming convention at CC.

Concerning Johann's comment:
As far as I recall, desc-dc2-dm-data doesn't have a rootpath like GCRCatalogs, but does utilize SITE, so we have a set of repos for both NERSC and CC.
desc-dc2-dm-data is meant to point to release data rather than processing. We could continue to maintain processing and release data side by side in the same rerun area, but then I think we have to be much more clear about how we manage the naming of processing versus releases.
The processing areas at CC and NERSC should continue to live under desc_dm_drp and be mirrored and really should have nothing to do with desc-dc2-dm-data.
Releases should similarly be mirrors at NERSC and CC but as of this moment, we are just defining this.. and we would update desc-dc2-dm-data accordingly to point to released data. For now, some of that may live under desc_dm_drp, but I think we want to move away from that. We could have something like:

Run2.2i
 |__ releases 
       |__ 19.0.0
              |__ rerun

heather999 · 2021-01-04T22:38:41Z

Chatted briefly with Johann offline, and we have an updated proposal. The "releases", which include butler accessible data, would reside under shared, utilizing names that are identical to the naming convention used for the object & dpdd catalogs (which we should review given all the discussions about WFD, DDF, etc).
So.. we were thinking about introducing a new area at NERSC (and ultimately CC): /global/cfs/cdirs/lsst/shared/DC2-prod/Run2.2i/butler That would look something like:

19.0.0
|_ CALIB, _mapper, ref_cats, raw, etc (everything the butler needs to make sense of the data - these could be symlinks)
|_ rerun
       |_ run2.2i-dr2-wfd-v1

Asking the Data Access team (@JoanneBogart & @yymao) if they feel it is ok to include butler accessible data in the shared area? Thinking ahead to Gen3.. this might mean including files accessible from Postgres... is that appropriate?

yymao · 2021-01-04T22:47:02Z

I think that's fine, and a good proposal in fact. The shared area is designed to be mirrored across DESC sites (currently only CC and NERSC, of course), so moving the release area into shared makes sense to me.

Did you have specific concerns regarding including butler/Postgres accessible data in shared? I couldn't think of any immediately.

One note is that all the symlinks in **/butler should be internal (i.e., not linked outside of shared, preferably not linked outside of **/butler)

heather999 · 2021-01-05T16:33:28Z

Great - that all makes sense. Not today, but in the next couple of days, I will start setting this up at NERSC.

JoanneBogart · 2021-01-05T17:47:13Z

Putting butler-accessible data under our shared directory sounds good to me.
I don't have a problem with including files accessible via Postgres. People just have to understand that to access them in the recommended fashion they have to port the Postgres database as well. (Is that something we should be thinking about? Will they have to re-ingest or is it possible to dump the db and reconstitute it elsewhere?).

heather999 · 2021-01-05T18:03:07Z

I would imagine it is possible to dump the db and set it up elsewhere. Definitely something we should try to see how it goes. I could imagine other sites may want to mirror and we probably do want to support that.
Individual users are another question, where I would assume they may only want access to specific subsets of data... but are they constrained to work at NERSC or CC to access the butler data? Not sure... I guess right now - even without Postgres, only advanced users would do otherwise and extract files to their own machines.

heather999 assigned heather999, johannct, yymao, JoanneBogart, katrinheitmann, jchiang87, wmwv and airnandez Jan 4, 2021

jchiang87 mentioned this issue Oct 27, 2021

Write up details of Run2.2i and Run3.1i #412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Release Naming Conventions and Handling of Processing Data Transfers #413

Data Release Naming Conventions and Handling of Processing Data Transfers #413

heather999 commented Jan 4, 2021

yymao commented Jan 4, 2021

katrinheitmann commented Jan 4, 2021

johannct commented Jan 4, 2021

JoanneBogart commented Jan 4, 2021

johannct commented Jan 4, 2021 •

edited

Loading

heather999 commented Jan 4, 2021

johannct commented Jan 4, 2021

heather999 commented Jan 4, 2021

yymao commented Jan 4, 2021

heather999 commented Jan 4, 2021 •

edited

Loading

yymao commented Jan 4, 2021 •

edited

Loading

johannct commented Jan 4, 2021 •

edited

Loading

heather999 commented Jan 4, 2021 •

edited

Loading

heather999 commented Jan 4, 2021

yymao commented Jan 4, 2021

heather999 commented Jan 5, 2021

JoanneBogart commented Jan 5, 2021 •

edited

Loading

heather999 commented Jan 5, 2021

Data Release Naming Conventions and Handling of Processing Data Transfers #413

Data Release Naming Conventions and Handling of Processing Data Transfers #413

Comments

heather999 commented Jan 4, 2021

An example of the issue using Run2.2i DR6

Recommendations

Review of Release Naming Conventions

Examples

yymao commented Jan 4, 2021

katrinheitmann commented Jan 4, 2021

johannct commented Jan 4, 2021

JoanneBogart commented Jan 4, 2021

johannct commented Jan 4, 2021 • edited Loading

heather999 commented Jan 4, 2021

johannct commented Jan 4, 2021

heather999 commented Jan 4, 2021

yymao commented Jan 4, 2021

heather999 commented Jan 4, 2021 • edited Loading

yymao commented Jan 4, 2021 • edited Loading

johannct commented Jan 4, 2021 • edited Loading

heather999 commented Jan 4, 2021 • edited Loading

heather999 commented Jan 4, 2021

yymao commented Jan 4, 2021

heather999 commented Jan 5, 2021

JoanneBogart commented Jan 5, 2021 • edited Loading

heather999 commented Jan 5, 2021

johannct commented Jan 4, 2021 •

edited

Loading

heather999 commented Jan 4, 2021 •

edited

Loading

yymao commented Jan 4, 2021 •

edited

Loading

johannct commented Jan 4, 2021 •

edited

Loading

heather999 commented Jan 4, 2021 •

edited

Loading

JoanneBogart commented Jan 5, 2021 •

edited

Loading