Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Release Naming Conventions and Handling of Processing Data Transfers #413

Open
heather999 opened this issue Jan 4, 2021 · 18 comments

Comments

@heather999
Copy link
Collaborator

We need to write up concrete steps to handle the naming and versioning of our data releases, which also includes some recommendations for dealing with the data transfers between CC and NERSC. Looking for comments.

An example of the issue using Run2.2i DR6

The CC processing area is named run2.2i-coadd-wfd-dr6-v1 and at NERSC we reused this name for our DR6 v1 release.
Processing continues at CC and updated data must be copied to NERSC, but we do not want to update the dataset that has already been released.

Recommendations

  • CC's data processing directories likely should not indicate any particular version, as reprocessing and re-using those same directories will result in new versions for release. Something more generic that indicates this is a dataset that is still undergoing updates would be appropriate.
  • When updated data processing results in a new release, we will take a snapshot of the processing directories at NERSC. Those snapshots will be named using our agreed-upon release naming convention. See below.
  • To deal with the run2.2i-coadd-wfd-dr6-v1 area we already have. The CC area with that name will continue to be used for processing. At NERSC, there is a new run2.2i-coadd-dr6-processing directory and corresponding run2.2i-coadd-dr6-processing-grizy and run2.2i-coadd-dr6-processing-u directories. Future data transfers from CC to NERSC should be directed into these areas.
  • run2.2i-coadd-wfd-dr2-v1 at both CC and NERSC will continue to be the area that may be updated due to reprocessing. Snapshots will be taken at NERSC for releases and named according to our naming conventions. See below.

Review of Release Naming Conventions

As discussed on Slack our release naming conventions should communicate the run (2.2i or 3.1i), the depth (DR1, DR2, etc.), and the cadence (WFD or DDF).

Examples

For the upcoming Run2.2i DR2 v1 release, a snapshot will be taken at NERSC, resulting in butler data directories:

  • run2.2i-wfd-dr2-v1
  • run2.2i-wfd-dr2-v1-grizy
  • run2.2i-wfd-dr2-v1-u

Additional DR6 releases will use a similar naming convention, however, as long as this processing includes both WFD and DDF visits, no cadence will be indicated in the name:

  • run2.2i-dr6-v2
  • run2.2i-dr6-v2-grizy
  • run2.2i-dr6-v2-u
@yymao
Copy link
Member

yymao commented Jan 4, 2021

The general principle sounds good to me. The proposed name run2.2i-wfd-dr2-v1 has a different order for wfd and dr2 from what we did in GCRCatalogs (see https://github.com/LSSTDESC/gcr-catalogs/releases/tag/v1.1.0).

Now that I think about it, putting wfd first seems to make sense. But when we made the decision for GCRCatalogs we somehow went with run2.2i_dr2_wfd...

@katrinheitmann
Copy link
Contributor

I think we should stick with what we had before for simplicity. dr2 is more general then wfd, so I think that's why it's first, though I think the order really doesn't matter much. So for historic reasons, I would choose what we had before.

@johannct
Copy link
Contributor

johannct commented Jan 4, 2021

My take on this :
I agree with Katrin, let's keep it simple, I do not think that the order brings any added value, so this is historical.

I am not sure I follow the current train of thoughts on processing naming convention and how it translates into snapshots and released areas and naming convention

  • Processing areas are internal; they should not be released. We may have been lazy in the past, but if we are to set up long term conventions and standards, this needs to be clear. As a consequence, naming convention for processing is arbitrary.
  • Processing areas should be exactly mirrored, without any modification.
  • Snapshots are needed and welcome to define releases.... but processing areas are never released per se, only a subset of their content, which thus needs to be copied over.
  • The way we deal with updates to the processing areas needs to remain totally independent of the way we deal with the release areas. Example 1 : I fixed 4852 because it was failed, I did not create a new rerun name because it would have been overkill, but that does not mean that at NERSC the copy should move to a different directory. Again mirroring should be exact. What needs to be updated as an independent standalone area is the area of the released products. Example 2 : we may have to rerun metacal for DR2 at CC; in this case I will likely use a new rerun, because the underlying metacal codebase will have changed; but I do not see any reason to change the current released area, as we just add new products that do not replace previously released ones; so this is another example where processing area and release area behave totally differently.

Maybe I misunderstand some of what is written above, in which case sorry for the noise. Maybe there is a sense that we need to keep a one to one relationship between processing area and released area. I think that very often this will be naturally guaranteed, as the processing area will be different for any new processing effort. But in the case of 4852 such a one to one requirement seems overkill to me. Caveat : I am not sure how gen3 processing is going to modify my seasoned view of how all this is coming out.

@JoanneBogart
Copy link
Contributor

I basically agree with Johann but would like some clarification of details.

  • Is it fair to say "naming convention for processing is arbitrary" means "has no particular connection to naming of releases"? There still will be conventions for processing which are suited to the task at hand.
  • If processing areas are exactly mirrored at NERSC, we at least want to avoid conflicts or confusion with releases, which may put restrictions on naming of top-level processing directories
  • In the first example concerning 4852, should it read "I fixed 4852 because it was failed, I did not create a new rerun name because it would have been overkill, but that does not mean that at NERSC the copy should not move to a different directory. " ?

@johannct
Copy link
Contributor

johannct commented Jan 4, 2021

@JoanneBogart

Is it fair to say "naming convention for processing is arbitrary" means "has no particular connection to naming of releases"?There still will be conventions for processing which are suited to the task at hand.

Most important is indeed that there is no reason to think of it as something understandable by people outside of processing, especially the public of the releases. As for conventions relevant for processing, with gen2 there really are nothing built in so I just built rerun names out of the blue, it made sense to me, not necessarily to others. With gen3 I would not be surprised that this needs revisiting and more forethought

If processing areas are exactly mirrored at NERSC, we at least want to avoid conflicts or confusion with releases, which may put restrictions on naming of top-level processing directories

I am not sure I understand your point. I hope that when we speak of releases it is clear that we are not speaking about internal processing directories. But we can make sure that the naming are different, in any case.

In the first example concerning 4852, should it read "I fixed 4852 because it was failed, I did not create a new rerun name because it would have been overkill, but that does not mean that at NERSC the copy should not move to a different directory. " ?

Indeed, tricked by a double negation on the first day back to work. Bummer :)

@heather999
Copy link
Collaborator Author

Going in order of comments, starting with Yao & Katrin. Ok - we'll go with run2.2i-dr2-wfd-v1 and use something along those lines for other releases.
Moving on to Johann & Joanne:
Agreed processing areas should not be released. For Run2.2i, we have the unfortunate situation that the rerun area at NERSC contains both the releases and the processings. In the future, that should be avoided, and perhaps I can manage to reorganize the directories at NERSC to create separate processing and release areas.
This mixing of directories is part of the reason I would rather the processing directory names not appear to be some release version, but I feel it is also generally confusing. The v1 on run2.2i-coadd-wfd-dr6-v1, which at CC is a processing area, is misleading.
For the specific case of DR6, given that we have made a release, I have been reluctant to rename the released NERSC directory from run2.2i-coadd-wfd-dr6-v1. Upon further consideration, maybe it is time to rename the directories to the form: run2.2i-dr6-v1, update desc-dm-dc2-data and then run2.2i-coadd-wfd-dr6-v1 at NERSC would again be mirrored to the processing area at CC.

snapshots created for releases are copies of a subset of the processing area, and here we need to be very careful to define that subset and when it is appropriate to bump to a new version. For the specific case of DR6 tract 4852 patch 1,5, that should result in a separate released version (v2) of the object catalogs. Concerning the butler rerun area, it would be incorrect to just simply update the 4852 1,5 files (even if this was initially a processing failure), without marking this as an updated versioned release. Due to disk space concerns, maybe we store v1 to tape (or even just store the original version of the updated files to tape, so v1 could be recreated, if that is ever needed)
Releasing DR2 v1 now without metacal (which is the plan), when a metacal processing becomes available, I still think that will result in a DR2 v2 release. Whatever the reason for a change in the data released, whether it is updating existing files or adding new ones - I think that deserves a bump in version.

@johannct
Copy link
Contributor

johannct commented Jan 4, 2021

ok, so we disagree on several points here. For 4852, imho there is no difference between reprocessing it and rolling back a stream due to computing failure. And I do not think that you advocate bumping version for each and every random rollback that occurs during processing...... At least for gen2 system, it would have been hell. I do not want to argue forever though, so whatever is ok with the majority is ok with me.

@heather999
Copy link
Collaborator Author

My thoughts on versioning are strictly in regards to releases. If we release something and name it v1 and we then update the data in some way later, and then release that.. it is v2.
We took a long time to release DR6 initially and during that time there were of course rollbacks in the processing, but none of that mattered from a versioning standpoint, because we had not released anything yet.

@yymao
Copy link
Member

yymao commented Jan 4, 2021

I think maybe we need a more clear distinction between pre-releases and releases, given that some of the validation tests require the data propagate all the way down to GCRCatalogs.

If we distinguish them, then we can say it's ok to update the content in place for pre-releases, but a snapshot copy must be made for releases.

@heather999
Copy link
Collaborator Author

heather999 commented Jan 4, 2021

For the object catalogs, we have pre-release areas, but have not done that for the butler/rerun areas... we could by creating a pre-release snapshot for validation and then renaming when a release is ready. I think that's fine. I don't think we can use the processing area for our pre-release validation, necessarily.
For the DR2 release and beyond.. I could create a release area that is separate from the processing area at NERSC. That area could contain pre-releases, which will be updated as needed based on validation testing.

An immediate thing I would like to reach a consensus on.. renaming the DR6 v1 butler area at NERSC which is run2.2i-coadd-wfd-dr6-v1 and moving to run2.2i-dr6-v1.. this could live under the new release area mentioned above..

  • There would need to be some announcement that is happening, and noting that there would not be any backward compatibility provided. But that is probably fine, esp for those that are using desc-dc2-dm-data, the alias 2.2i_dr6_wfd would remain available (and deprecated) but the path it points to would be updated. There would also be a new 2.2i_dr6 alias that points to the same new location.
  • desc-dc2-dm-data should be updated and released
  • Rename the directories under rerun
    This would allow us to mirror the CC DR6 processing area without any future confusion.

@yymao
Copy link
Member

yymao commented Jan 4, 2021

For the DR6 object catalogs, previously we have been treating pre-releases like releases, i.e., a new copy is made when a new pre-release is made. I think that's a bit overkill and results in way too many deprecated catalogs in GCRCatalogs in the end.

I am ok with keeping the pre-releases in the release area. The main point is just that if the processing area is updated and we want to propagate the update the pre-release, we can just overwrite the existing pre-release instead of making a new one. For releases we will never overwrite, of course.

I don't really have opinions regarding renaming dr6. What you proposed sounds good to me. But at the very beginning you said that we are using run2.2i-coadd-dr6-processing for syncing with CC. Are you making a new proposal that we make a copy with the name run2.2i-dr6-v1 in the release area, and use run2.2i-coadd-wfd-dr6-v1 for syncing with CC processing? But either is fine with me actually.

@johannct
Copy link
Contributor

johannct commented Jan 4, 2021

If I look at https://github.com/LSSTDESC/desc-dc2-dm-data/blob/master/desc_dc2_dm_data/repos.py the path at NERSC for DR6 is /global/cfs/cdirs/lsst/production/DC2_ImSim/Run2.2i/desc_dm_drp/v19.0.0-v1/rerun/run2.2i-coadd-wfd-dr6-v1
In my mind mirroring means that everything below a root path is strictly identical. Here the rootpath is defined as /global/cfs/cdirs/lsst/production/DC2_ImSim/Run2.2i/desc_dm_drp. Indeed for CC the path is /sps/lssttest/dataproducts/desc/DC2/Run2.2i/v19.0.0-v1/rerun/run2.2i-coadd-wfd-dr6-v1 and its rootpath is /sps/lssttest/dataproducts/desc/DC2/Run2.2i. Everything below each rootpath is strictly identical between the two sites, be it directory hierarchy or content (we relax the strict identity for some intermediary products). This seems a sound situation to me.....

@heather999
Copy link
Collaborator Author

heather999 commented Jan 4, 2021

To answer Yao's question, yes, I was making a new proposal... in the interest of reusing the existing naming convention at CC.

Concerning Johann's comment:
As far as I recall, desc-dc2-dm-data doesn't have a rootpath like GCRCatalogs, but does utilize SITE, so we have a set of repos for both NERSC and CC.
desc-dc2-dm-data is meant to point to release data rather than processing. We could continue to maintain processing and release data side by side in the same rerun area, but then I think we have to be much more clear about how we manage the naming of processing versus releases.
The processing areas at CC and NERSC should continue to live under desc_dm_drp and be mirrored and really should have nothing to do with desc-dc2-dm-data.
Releases should similarly be mirrors at NERSC and CC but as of this moment, we are just defining this.. and we would update desc-dc2-dm-data accordingly to point to released data. For now, some of that may live under desc_dm_drp, but I think we want to move away from that. We could have something like:

Run2.2i
 |__ releases 
       |__ 19.0.0
              |__ rerun

@heather999
Copy link
Collaborator Author

Chatted briefly with Johann offline, and we have an updated proposal. The "releases", which include butler accessible data, would reside under shared, utilizing names that are identical to the naming convention used for the object & dpdd catalogs (which we should review given all the discussions about WFD, DDF, etc).
So.. we were thinking about introducing a new area at NERSC (and ultimately CC): /global/cfs/cdirs/lsst/shared/DC2-prod/Run2.2i/butler That would look something like:

19.0.0
|_ CALIB, _mapper, ref_cats, raw, etc (everything the butler needs to make sense of the data - these could be symlinks)
|_ rerun
       |_ run2.2i-dr2-wfd-v1

Asking the Data Access team (@JoanneBogart & @yymao) if they feel it is ok to include butler accessible data in the shared area? Thinking ahead to Gen3.. this might mean including files accessible from Postgres... is that appropriate?

@yymao
Copy link
Member

yymao commented Jan 4, 2021

I think that's fine, and a good proposal in fact. The shared area is designed to be mirrored across DESC sites (currently only CC and NERSC, of course), so moving the release area into shared makes sense to me.

Did you have specific concerns regarding including butler/Postgres accessible data in shared? I couldn't think of any immediately.

One note is that all the symlinks in **/butler should be internal (i.e., not linked outside of shared, preferably not linked outside of **/butler)

@heather999
Copy link
Collaborator Author

Great - that all makes sense. Not today, but in the next couple of days, I will start setting this up at NERSC.

@JoanneBogart
Copy link
Contributor

JoanneBogart commented Jan 5, 2021

Putting butler-accessible data under our shared directory sounds good to me.
I don't have a problem with including files accessible via Postgres. People just have to understand that to access them in the recommended fashion they have to port the Postgres database as well. (Is that something we should be thinking about? Will they have to re-ingest or is it possible to dump the db and reconstitute it elsewhere?).

@heather999
Copy link
Collaborator Author

I would imagine it is possible to dump the db and set it up elsewhere. Definitely something we should try to see how it goes. I could imagine other sites may want to mirror and we probably do want to support that.
Individual users are another question, where I would assume they may only want access to specific subsets of data... but are they constrained to work at NERSC or CC to access the butler data? Not sure... I guess right now - even without Postgres, only advanced users would do otherwise and extract files to their own machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants