Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalize the storage of all DC2 data products #309

Open
heather999 opened this issue Jan 6, 2019 · 22 comments
Open

Rationalize the storage of all DC2 data products #309

heather999 opened this issue Jan 6, 2019 · 22 comments
Assignees

Comments

@heather999
Copy link
Collaborator

NERSC CSCRATCH space is temporary. To avoid losing any Run2.xi data, we have started copying files over to /global/projecta/projectdirs/lsst/production/DC2_ImSim/ This issue is meant to document the transfer, and sort out some remaining questions.

Here is the copy plan for Run2.0i:
instance catalogs: /global/cscratch1/sd/desc/Run2.0i/instCat => /global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.0i/instCat

data: /global/cscratch1/sd/desc/Run2.0i/outputs => /global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.0i/outputs

Run2.1i:
instance catalogs: /global/cscratch1/sd/desc/DC2/Run2.0i/cosmoDC2_v1.1.4 => /global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i/cosmoDC2_v1.1.4

There are some files which lack proper permissions to allow copying, such as:
-rw-r----- 1 asv13 asv13 3411 Sep 22 10:18 /global/cscratch1/sd/desc/DC2/Run2.0i/instCat/edison_packed_submissions.py
To allow a clean copy, it is requested that these files have their permissions adjusted to allow reading by the lsst group.

Concerning the Run2.1i instance catalogs, it is my understanding that this is the area used for production: /global/cscratch1/sd/desc/DC2/Run2.0i/cosmoDC2_v1.1.4 and not: /global/cscratch1/sd/desc/DC2/Run2.0i/Run2.1i/instCat`
meaning this last area does NOT need to be copied over to projecta. Correct?

@jchiang87
Copy link
Contributor

More generally, for Run1.1p, Run1.2p, Run1.2i, Run2.0i (omit?), and Run2.1i, we need the following data products to be saved in permanent storage and in standard locations (so they can be readily found):

For on-sky simulations:

  • instance catalogs
  • eimage files (if they were created)
  • raw files

Calibration products:

  • calibration raw files: bias frames, dark frames, dome flats, BF flats
  • calibration products, i.e., the CALIB folder filled by ingestCalibs.py and the BF kernel data that is meant to be put into the calibrations folder in the Stack input repository.

Image processing related:

  • Input files to the IngestReferenceCatalog.py, e.g. files like dc2_reference_catalog_dc2v3_fov4_thinned.txt
  • The ref_cats folder, i.e., the output of IngestReferenceCatalog.py.
  • Stack input/output repositories

@danielsf
Copy link
Contributor

danielsf commented Jan 8, 2019

"""
Concerning the Run2.1i instance catalogs, it is my understanding that this is the area used for production: /global/cscratch1/sd/desc/DC2/Run2.0i/cosmoDC2_v1.1.4 and not: /global/cscratch1/sd/desc/DC2/Run2.0i/Run2.1i/instCat`
meaning this last area does NOT need to be copied over to projecta. Correct?
"""

@villarrealas should have the final say but, as far as I understand, you are correct, @heather999

@heather999 heather999 changed the title Copying Run2.0i and Run2.1i data from NERSC SCRATCH to projecta Rationalize the storage of all DC2 data products Jan 8, 2019
@johannct
Copy link
Contributor

johannct commented Jan 8, 2019

* Stack input/output repositories

This is not exactly the nomenclature when dealing with --rerun syntax, and we should come to an agreement about rerun attributes. The CC pipeline currently only uses 2 reruns values, one for calexp production, and the other for coadd and multiband. Something like calexp-vX:coadd-vY has been proposed recently, with X=Y nominally.

@heather999
Copy link
Collaborator Author

I'm all in favor of using --rerun. Including some versioning as you mention above with calexp-vX:coadd-vY sounds good. Do we want to consider separate coadd and multiband reruns so that we can support different coadd configurations for u-band (for example)?

@johannct
Copy link
Contributor

johannct commented Jan 8, 2019

I'd rather avoid it if I can, as it implies modifiying many scripts. At least for the time being...

@heather999
Copy link
Collaborator Author

heather999 commented Jan 11, 2019

Some files under the Run2.0i instance catalogs (/global/cscratch1/sd/desc/Run2.0i/instCat at NERSC on CSCRATCH have started to be purged as of Jan 7th. See: /global/cscratch1/sd/desc/.purged.20190107 We do not have a copy, as I was hoping we could sort out the permissions on this particular directory @katrinheitmann @villarrealas @jchiang87 Is there a copy at ANL? Do we care about Run2.0i instance catalogs? Fortunately, the Run2.0i outputs have already been copied to projecta
data: /global/cscratch1/sd/desc/Run2.0i/outputs =>/global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.0i/outputs
Anyway, I'll see about getting all the data organized... hopefully something we can discuss further at an upcoming CI meeting.

@katrinheitmann
Copy link
Contributor

katrinheitmann commented Jan 11, 2019 via email

@JulienPeloton
Copy link
Contributor

Hi! Is there a plan to archive data to HPSS at NERSC for old runs? That would free a lot of space in /global/projecta/projectdirs.

@heather999
Copy link
Collaborator Author

@JulienPeloton We can certainly do that (and ultimately we will!). We need to review our HPSS quota and current tape use. Migrating to HPSS really should be done when we feel we are effectively "done" actively working with some of the DC2 data. Transfer out of HPSS is very slow so I'm hesitant to start moving data to free up space until we know certain parts of the data are really unnecessary for immediate access.. Though we can certainly start copying portions of the data into HPSS just to get that done. There is also some question of what tape resource we are going to use - is IN2P3 planning to use their tape archive? I though there was also mention of ANL doing the same. It would be helpful to get that straight too. And if we use multiple tape archives, how do we keep track of it?

@JulienPeloton
Copy link
Contributor

Thanks @heather999 for your detailed answer!

@heather999
Copy link
Collaborator Author

@danielsf Can I trouble you to point me to the instance catalogs for Run1.1p, Run1.2p, and Run1.2i?

@heather999
Copy link
Collaborator Author

Keeping track of all the data here:
https://confluence.slac.stanford.edu/display/LSSTDESC/NERSC+DC2+Data
As discussed at this week's DM-DC2 mtg, we should be able to remove some of the DM output from Run1.2i.
w_2018_39/rerun/281118/postISRCCD will be completely removed.
It also sound like the "warps" under w_2018_39/rerun/coadd-v1/deepCoadd could be removed as well. I will need some guidance to determine which files we can eliminate @wmwv @rearmstr @jchiang87 If I dig into deepCoadd I see:

ls coadd-v1/deepCoadd/g/4429/2,6
psfMatchedWarp-g-4429-2,6-399393.fits  warp-g-4429-2,6-185734.fits

Is it those warp*.fits that can be removed or something else?

69T     w_2018_39/rerun
69T     w_2018_39
39T     w_2018_39/rerun/coadd-v1
38T     w_2018_39/rerun/coadd-v1/deepCoadd
31T     w_2018_39/rerun/281118
16T     w_2018_39/rerun/281118/calexp
14T     w_2018_39/rerun/281118/postISRCCD
825G    w_2018_39/rerun/coadd-v1/deepCoadd-results
345G    w_2018_39/rerun/281118/src
118G    w_2018_39/rerun/281118/icSrc
51G     w_2018_39/rerun/coadd-v1/deepCoadd_results
22G     w_2018_39/rerun/281118/srcMatch
22G     w_2018_39/rerun/281118/singleFrameDriver_metadata
20M     w_2018_39/rerun/coadd-v1/scripts
1.4M    w_2018_39/rerun/coadd-v1/schema
898K    w_2018_39/rerun/281118/config
770K    w_2018_39/rerun/coadd-v1/config
385K    w_2018_39/rerun/281118/schema
1.5K    w_2018_39/rerun/281118/deepCoadd

@jchiang87
Copy link
Contributor

@wmwv and @rearmstr should confirm, but I think you can delete everything in

w_2018_39/rerun/coadd-v1/deepCoadd/u
w_2018_39/rerun/coadd-v1/deepCoadd/g
w_2018_39/rerun/coadd-v1/deepCoadd/r
w_2018_39/rerun/coadd-v1/deepCoadd/i
w_2018_39/rerun/coadd-v1/deepCoadd/z
w_2018_39/rerun/coadd-v1/deepCoadd/y

i.e., everything in w_2018_39/rerun/coadd-v1/deepCoadd except w_2018_39/rerun/coadd-v1/deepCoadd/skyMap.pickle.

@boutigny
Copy link

boutigny commented Mar 6, 2019

Be careful, if you delete everything under deepCoadd you will also delete the coadded fits images. I think that we need to keep all the x,y.fits files under coadd-v1/deepCoadd/filter/patch#.
The x,y directories can be safely deleted. And I don't know what are the x,y_nImage.fits.

@rearmstr
Copy link

rearmstr commented Mar 6, 2019

The x,y_nImage.fits have the number of images that went into the coadd. The coadds in these directories do not have the background removed so there may be people that would like to look at these. The directories that contain the warps* and psfMatchedWarp* can be removed.

@wmwv
Copy link
Contributor

wmwv commented Mar 6, 2019

I realize I put this in the meeting notes, but not this thread. Copying here:

We don't need to keep the postISRCCD files, e.g.

/global/projecta/projectdirs/lsst/production/DC2_PhoSim/Run1.2p/desc_dm_drp/w_2018_39/rerun/calexp-v4/postISRCCD

We don't need to keep the deepCoadd warps, e.g.:

/global/projecta/projectdirs/lsst/production/DC2_PhoSim/Run1.2p/desc_dm_drp/w_2018_39/rerun/coadd-v4/deepCoadd/g/4638/1,4

Contains

psfMatchedWarp-g-4638-1,4-159494.fits psfMatchedWarp-g-4638-1,4-193784.fits psfMatchedWarp-g-4638-1,4-254373.fits warp-g-4638-1,4-159494.fits warp-g-4638-1,4-193784.fits warp-g-4638-1,4-254373.fits
psfMatchedWarp-g-4638-1,4-183811.fits psfMatchedWarp-g-4638-1,4-193824.fits psfMatchedWarp-g-4638-1,4-399393.fits warp-g-4638-1,4-183811.fits warp-g-4638-1,4-193824.fits warp-g-4638-1,4-399393.fits
psfMatchedWarp-g-4638-1,4-185734.fits psfMatchedWarp-g-4638-1,4-254317.fits psfMatchedWarp-g-4638-1,4-400357.fits warp-g-4638-1,4-185734.fits warp-g-4638-1,4-254317.fits warp-g-4638-1,4-400357.fits
psfMatchedWarp-g-4638-1,4-185783.fits psfMatchedWarp-g-4638-1,4-254360.fits psfMatchedWarp-g-4638-1,4-400389.fits warp-g-4638-1,4-185783.fits warp-g-4638-1,4-254360.fits warp-g-4638-1,4-400389.fits
psfMatchedWarp-g-4638-1,4-193783.fits psfMatchedWarp-g-4638-1,4-254364.fits psfMatchedWarp-g-4638-1,4-449955.fits warp-g-4638-1,4-193783.fits warp-g-4638-1,4-254364.fits warp-g-4638-1,4-449955.fits

We can remove all of the psfMatchedWarp- and warp- files.

We don't need any of the {{deepCoadd/?/????/?,?}} level directories once the deepCoadd is made. The actual coadd data are stored one level higher in deepCoadd/?/????.

@heather999
Copy link
Collaborator Author

Removing the deepCoadd warps and postISRCCD slimmed down the Run1.2i DM DRP outputs to 18 TB. I now have a complete set of Run1.2i data in /global/projecta/projectdirs/lsst/production/DC2_ImSim/Run1.2i
Moving on to Run1.2p, I need to handle the object catalog, but have some confusion as to what version is "current", where I see the object_catalog symlink pointing to v3 rather than the more recent v4. My inclination is to store only the most recent object_catalog to HPSS - is that correct?

ls -l /global/projecta/projectdirs/lsst/global/in2p3/Run1.2p
total 42244
drwxrwx---  3 desc   lsst   131072 Mar 25 18:53 forced_source_catalog
drwxrwsr-x+ 2 fabioh lsst     4096 Nov 15 05:26 logs
lrwxrwxrwx  1 desc   lsst       17 Jan 30 13:27 object_catalog -> object_catalog_v3
drwxrws---+ 2 desc   lsst   131072 Jan 16 13:01 object_catalog_v3
drwxrws---+ 3 desc   lsst   131072 Feb 20 14:57 object_catalog_v4

And should the forced_source_catalog be considered for long term storage or is that too hot off the presses for now @wmwv ?

@wmwv
Copy link
Contributor

wmwv commented Mar 27, 2019

  • I fine with either only saving object_catalog_v4 to HPSS. I would advocate for keeping object_catalog_v3 on disk at NERSC for the next 6 months(?), but we shouldn't need it long term. If some bizarre thing comes up, we can re-generate object_catalog_v3.
  • I have updated the symlink to now be object_catalog->object_catalog_v4. I was supposed to do that when gcr-catalogs v0.10 was put in production but had not yet done so. Thanks for the reminder.
  • force_source catalog is still in development. Correct: Don't save yet.

@heather999
Copy link
Collaborator Author

At today's DM-DC2 meeting, we planned to review the directories that are stored under calexp now that skyCorrection has been added to the processing. We could:

  • identify directories to be completely removed and discarded from future transfers from IN2P3
  • indicate some directories only need to be retained on HPSS at NERSC and can be removed from projecta/CSCRATCH .

Here is the breakdown for y1-y2-wfd calexp which includes 2880 visits

du -hs /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1
/pbs/home/h/hkelly(130)>du -hc -d 1 /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1
599G    /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1/icSrc
256K    /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1/schema
85T     /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1/calexp
30G     /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1/srcMatch
370G    /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1/skyCorr
32K     /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1/deepCoadd
82G     /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1/singleFrameDriver_metadata
576K    /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1/config
2.0T    /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1/calexp_camera
5.0T    /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1/src
93T     /sps/lsst/dataproducts/desc/DC2/Run2.1i/w_2019_19-v1/rerun/calexp-v1
93T     total

Pinging @jchiang87 @wmwv @rearmstr

@jchiang87
Copy link
Contributor

jchiang87 commented Jun 10, 2019

The calexp_camera folder is created by the constructSky.py task. It just contains FITS files of the full focalplane but with only a single CCD rendered for each sensor visit. It's not used by downstream processing, afaik. Only the data in the skyCorr folder is used in making the coadds, so I think calexp_camera can be deleted from NERSC.

@wmwv
Copy link
Contributor

wmwv commented Jun 10, 2019

The icSrc directory doesn't need to be available to users at NERSC. It can be saved to HPSS.

@wmwv
Copy link
Contributor

wmwv commented Jan 22, 2021

@heather999 Given your great work to organized the DC2 products, I believe we can close this. Do you agree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants