Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Group #56

Open
gordonwatts opened this issue Jan 4, 2023 · 6 comments
Open

Dataset Group #56

gordonwatts opened this issue Jan 4, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@gordonwatts
Copy link
Member

During the last AGC challenge, the idea of a dataset group came up.

  • “ServiceXDataSetGroup” like structure perhaps that wraps multiple datasets

We should explore this, taking into account prior work done by @kyungeonchoi on his project

@gordonwatts gordonwatts added the enhancement New feature or request label Jan 4, 2023
@alexander-held
Copy link
Member

@ekauffma found out that we can use a ServiceXDataSet and feed multiple types of files to it (e.g. from multiple datasets) and later on sort them again using the parent information. That allows for running multiple transforms at once, provided the same query is used. The downside of this approach is that everything appears as a single transform on the dashboard.

@alexander-held
Copy link
Member

A related prototype (which does not fully generalize) is in iris-hep/analysis-grand-challenge#107.

@alexander-held
Copy link
Member

Getting back to this: the workaround in iris-hep/analysis-grand-challenge#107 relies on some : -> / substitution to be able to use the .file property of the get_data_rootfiles_uri return. Would it perhaps make sense to move that : version to an internal implementation detail and leave a public .file that actually matches the input file path exactly? It would make this workaround easier and probably be easier to use more generally, at least I cannot think of why a user would need the : version at the moment.

@alexander-held
Copy link
Member

Hi, I was just trying to find this issue again and noticed that it might be in the wrong repository. ServiceXDataset comes from servicex, so this should live in https://github.com/ssl-hep/ServiceX_frontend presumably?

@alexander-held
Copy link
Member

alexander-held commented Mar 6, 2023

Example of the .file string replacement:

from servicex import ServiceXDataset
from func_adl_servicex import ServiceXSourceUpROOT

dataset_opendata = "http://xrootd-local.unl.edu:1094//store/user/AGC/datasets/RunIIFall15MiniAODv2/"\
    "TT_TuneCUETP8M1_13TeV-powheg-pythia8/MINIAODSIM//PU25nsData2015v1_76X_mcRun2_asymptotic_v12_"\
    "ext3-v1/00000/00DF0A73-17C2-E511-B086-E41D2D08DE30.root"
sx_dataset = ServiceXDataset(dataset_opendata, backend_name='uproot', ignore_cache=False)
ds = ServiceXSourceUpROOT(sx_dataset, "events")

dummy_ds = ServiceXSourceUpROOT("cernopendata://dummy", "events", backend_name="uproot")
dummy_ds.return_qastle = True
jet_pt_query = dummy_ds.Select(lambda event: event.jet_pt).value()

res = sx_dataset.get_data_rootfiles_uri(jet_pt_query, as_signed_url=True)

print(f"output .file      {res[0].file}")
print(f"input with / -> : {dataset_opendata.replace('/', ':')}")

output:

output .file      http:::xrootd-local.unl.edu:1094::store:user:AGC:datasets:RunIIFall15MiniAODv2:TT_TuneCUETP8M1_13TeV-powheg-pythia8:MINIAODSIM::PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1:00000:00DF0A73-17C2-E511-B086-E41D2D08DE30.root
input with / -> : http:::xrootd-local.unl.edu:1094::store:user:AGC:datasets:RunIIFall15MiniAODv2:TT_TuneCUETP8M1_13TeV-powheg-pythia8:MINIAODSIM::PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1:00000:00DF0A73-17C2-E511-B086-E41D2D08DE30.root

Related to the above: is there a way to work around the need for a dummy_ds? This feels inconvenient.

@alexander-held
Copy link
Member

Small update here: we are going ahead with our workaround for AGC purposes, but I think it should be upstreamed. The loss of more detailed information in the dashboard (since now everything shows up as one big transform) is certainly inconvenient, and an ideal solution should probably be able to preserve the splitting there. I think it would be good to discuss this UX aspect at the AGC workshop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants