Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deep workflow #48

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Deep workflow #48

wants to merge 4 commits into from

Conversation

DinoBektesevic
Copy link
Member

@DinoBektesevic DinoBektesevic commented Nov 26, 2024

Implements the DEEP Workflow used to analyze the single ccd single night search, but can be used to run on any set of image collections in a directory.

Additional log parsing and visualization functionality is provided.

Solution Description

Workflow

Running the Workflow is a multi-step process which includes additional preparation and cleanup work that executes at the
submit location:

  • Run prep
    • Load runtime config
    • find all files in staging_directory that match ic_filename_pattern
    • filter out unwanted files based on the name
  • Run KBMOD Search
  • Summarize the workflow execution
    • Parse logs, create a list of failed and successful logs
    • Create a workflow Gantt chart

The config section that gives the directory, pattern and exclusion can look like:

[workflow]
staging_directory = "collections"
ic_filename_pattern = "*collection"
skip_ccds = ["002", "031", "061"]

The workflow writes to logs, resampled_wus, results and plots directories. There are two additional smaller workflows added - the one that just runs step2 and one that just runs postscript. These are mostly good for testing and short-cutting the entire workflow for debugging purposes.

Running a KBMOD search is a 3 step process:

  • step 1 (executed on CPUs)
    • load ImageCollection
    • filter unwanted rows of data from it
      • for Rubin processed DEEP this should exclude all wcs_err > 1e-04
      • for Rubin processed DEEP probably should exclude zeropoint > 27 or similar
      • this should probably reflect back onto the original ImageCollection, but it doesn't
      • so they're not implemented atm, because it causes confusion down the line
    • load SearchConfiguration
    • update search config values based on the IC metadata
      • currently the n_obs is the biggest discriminator for the data that is kept
      • we set the n_obs to len(ic)//2
      • once the len(ic) gets below 50-60, this seems to be too strict of a requirement
      • then the n_obs is pinned to 15
    • materialize a WorkUnit
      • requires the Rubin Data Butler
    • resample a WorkUnit
      • targets the largest common footprint WCS
    • writes the WorkUnit to file
  • step 2 (executed on GPUs)
    • loads the WorkUnit
    • runs KBMOD search
    • adds relevant metadata to the Results Table
      • this is not generalized solution
      • wcs, mjd_mid and uuid are guaranteed
      • visit and detector don't exist for every ImageCollection
    • writes Results to file
  • step 3 (executed on CPUs)
    • loads Results file
    • makes an analysis plot
    • this requires external resources, catalogs of fake and known objects
    • Known objects are SkyBot query results - so the plotting code is structured to work with that
    • this is not a general code though because user-made catalogs have no agreed upon standard

The config section for the step1 and postscript external resources can look like:

[apps.step1]
search_config_filepath = "earch_config.yaml"
butler_config_filepath = "/repo/root/butler.yaml"
n_workers = 8

[apps.postscript]
fake_object_catalog = "resources/fakes_detections_joined.minified.fits.bz2"
known_object_catalog = "resources/skybot_results_joined.minified.fits.bz2

Logging

Logs can be quite complicated.

A single log file can have several starting and ending messages - because the jobs can be resubmitted. Often they are preempted with means they will have several initial messages and maybe one ending message.

A single task will output several log files - one per step.

Each Log can consist of multiple log files. For example the complete log of a Task 20210101_A can consist of multiple
steps called 20210101_A_step1, 20210101_A_step2 etc. thus making the full Log of the event a union of every step.

Individual steps are parsed, extracting individual log entries and context; assigning to each log step a success, error and completion status:

  • a log is successful if all individual steps were successful.
  • a step is not completed if the log does not contain the expected final log entry.
  • a step is marked with an error if the log produces a traceback or an error report. A not completed log step does not indicate a error.
  • a step that is marked not-completed, but not-errored, may be safely re-run
  • a log is a failure if all steps were not completed or any of the steps has an error. A log may be safely re-run if no steps have an error, but not all steps have completed or not all steps have run (happens if the workflow times out and there were multiple retries).

Some functionality is pretty general - parse_logfile, the Log class and its fmt string, but contextualization depends heavily on the log content. This means much of the plotting colors etc. are affected too. To overwrite the default contextualization of the parsed logs override the _parse_single method.

Code Quality

The code is pretty shit, but at least it has documentation now and isn't duplicated across tasks.

There will be some wrestling with the default values in the repo right now, but that is at least approachable compared to the draft PR.

Copy link
Collaborator

@wilsonbb wilsonbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left several comments about cleaning up code that was commented out or adding some clarifications. But mainly approving

import time


#"ckpt_2gb_2cpus", "ckpt_2gb_2cpus", "astro_2gb_2cpus"]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can remove this

import time


# "esci_48_8cpus" "astro_48_8cpus"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With comments like these, I would either remove them or just write out a quick one line comment about the executors

ic = ImageCollection.read(ic_file)

### Mask out images that we don't want or can not search through
# mask out poor weather images
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably fully remove these. Can add a comment explaining why we no longer do this if needed.

#mask_zp = np.logical_and(ic["zeroPoint"] > 27 , ic["zeroPoint"] < 32)
#ic = ic[np.logical_and(mask_zp, mask_wcs_err)]

# mask out images with WCS error more than 0.1 arcseconds because we
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we no longer apply all of the masking, do we want to remove this whole section?

newx = result["x"]+dt*result["vx"]
newy = result["y"]+dt*result["vy"]
coord = wcs.pixel_to_world(newx, newy)
#pos.append(list(zip(coord.ra.deg, coord.dec.deg)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this ever get fixed? If not, and we still need it maybe replace with a longer comment about what was wrong

#pos.append(list(zip(coord.ra.deg, coord.dec.deg)))
#pos_valid.append(obs_valid) # NOTE TO SELF: FIX LATER

return coord, obs_valid #SkyCoord(pos), pos_valid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update return value.

Copy link
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks ok to me. I'm not seeing anything overly frightening in here.

Copy link
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks ok to me. I'm not seeing anything overly frightening in here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants