Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: find files on disk #40

Open
kpedro88 opened this issue Jan 21, 2025 · 15 comments
Open

Feature request: find files on disk #40

kpedro88 opened this issue Jan 21, 2025 · 15 comments

Comments

@kpedro88
Copy link

PPD recently deleted large fractions of existing premixed pileup input samples in order to make room for the new (2024) premixed pileup input sample. This means that DAS queries for the lists of files in those samples return many files that are now only on tape, and jobs that try to read those files will obviously fail.

For users, I recently implemented a simple script to get a list of files on disk using Rucio: https://github.com/FNALLPC/lpc-scripts/blob/master/get_files_on_disk.py. This is probably not the optimal set of Rucio queries, but I had some snippet from years ago and just wrapped it up into a nicer/more generalized script.

Is it possible to add a custom dasgoclient query that only finds files on disk? The information should all be there (since Rucio is one of the sources), it's just a matter of how to extract it in a suitable way. This would help both production operations and users.

@belforte
Copy link
Member

Could PPD rather create new Rucio containers with the list of kept files (hopefully they did the deletion by blocks, btw) ? Then users can do rucio list-files cms:/a/b/c --csv.
Sort of ... let's use the DM tool for data bookkeeping.

@kpedro88
Copy link
Author

That would be a nice solution using Rucio. I am not sure that it fully addresses all the problems, because cmsDriver commands still use DAS queries to find files. Maybe we need to upgrade that functionality to use Rucio directly... @smuzaffar ?

@belforte
Copy link
Member

could also make dasgoclient understand scope:container but honestly I am not optimistic about CMS capability to modify dasgoclient code.

@DickyChant
Copy link

Hi, which files to be put on disk is not a PPD scope, we give fractions and DM implements the exact details.

I am in favor of moving to whatever functionality to query on disk.

@todor-ivanov
Copy link

Hi @belforte @kpedro88

I am not sure this is about capability of changing the dasgoclient code or not, but rather on the separation of the data moving pieces of the system from the data book keeping ones. As far as I am aware, the actual storage type where the data is currently available is not an information extractable from the DBS database, and @vkuznet correct me if I am wrong, but dasgoclient is not making any queries to Rucio right now. I am not sure we'd want to go that road. If we do .. many things may be done indeed.

@kpedro88
Copy link
Author

I made this request for dasgoclient because that is currently what CMSSW uses to get file lists.

You can search this codebase, or look at the help/examples from the executable, and see that rucio is included as a source.

@belforte
Copy link
Member

@todor-ivanov just try dasgoclient --query 'site dataset=<yourfavoritedataset> to verify that it takes location information from Rucio.

@vkuznet
Copy link
Collaborator

vkuznet commented Jan 21, 2025

DAS supports queries to Rucio, as it did for Phedex, and it aggregates results from both DBS and Rucio. Therefore, this request is valid and can be implemented in DAS. But as I replied to Kevin in private email my time is no longer allocated to DAS development.

To make appropriate changes the following steps should be done:

  • locate Rucio APIs in das2go codebase
  • add appropriate option to DAS QL to support scope, or ondisk flag
  • adjust DAS parser to properly parse these new options and it will properly put them in DAS spec used by DAS
  • adjust calls to Rucio APIs to use new options
  • adjust DAS maps to support new options
  • test code on web interface and if it is working the dasgoclient will naturally support it

The appropriate syntax of DAS queries should be modified as following:

# look-up all files for a given dataset (query goes to DBS and Rucio)
file dataset=/a/b/c

# look-up ondisk files for a given dataset (query will go to Rucio)
file dataset=/a/b/c ondisk

# look-up files from a given rucio scope for a given dataset (query will go to Rucio)
file dataset=/a/b/c scope=production

So, the amount of work is not negligible as it requires changes in DAS QL, DAS APIs, DAS maps, but it is doable.

@belforte
Copy link
Member

thanks @DickyChant
Could then DM explain how they do this ? Who is the contact for this ? Hopefully they can create those Rucio container quite easily.

IMHO the simplest solution is:

  • DM creates Rucio containers with DBS-blocks/Rucio-datasets on disk
  • add rucio:<dataset> to cmsdriver possible input, in parallel to dbs:<dataset> and file:... (IIUC)
  • inside cmsdriver use rucio CLI, not dasgoclient to resolve that. Forking a subprocess which starts with scram unsetenv ; source /cvmfs/../Rucio/... will work for any CMSSW and may be good for future use cases as well. E.g. allow users to provide their own container extending the syntax to accept rucio:scope:name while rucio:name will be handled as the cms:name DID

That said, none of this in my scope.

@todor-ivanov
Copy link

Hi @belforte @kpedro88

Just to clarify, on my comment here: #40 (comment).

Since it is an issue created in the dasgoclient I was referring to the fact that the client itself does not spawn its won queries to Rucio. It is DAS who knows about Rucio existence. The dasgoclient is just an interface to it. And thanks @vkuznet for elaborating and giving full details.

So, I think this issue would be better moved to the DAS repository: https://github.com/dmwm/das2go, because I suspect the information cold indeed be already accessible through the DAS link to Rucio. We just need to crosscheck if we make the proper queries to Rucio and if we convert them to the proper fields in the structure returned by DAS: https://github.com/dmwm/das2go/blob/master/services/rucio.go

@todor-ivanov
Copy link

BTW, @kpedro88

A clever approach (pointed to me by @mapellidario in a private chat) could actually help identify the location for every file.
What needs to be done on the dasgoclient query is to fetch the data in json format. For both queries pointed by both you and @belforte: [1]. And then to just increase the granularity of the query from dataset to file level, and point the inormation source to be Rucio: [2].

And as seen in the query output, the storage type is already provided. Of course this would require external output aggregation, but at least the information is already retrievable. Wouldn't that suffice?

BTW, Thanks @mapellidario for the clever suggestion!

[1]

$ dasgoclient -query="site dataset=/EGamma-Error/Run2018B-v1/RAW" -json  |json_pp
[
   {
      "das" : {
         "expire" : 1737544307,
         "instance" : "prod/global",
         "primary_key" : "site.name",
         "record" : 1,
         "services" : [
            "combined:site4dataset_pct"
         ]
      },
      "qhash" : "54a22ce52df7c02b8a107ce4f956b54b",
      "site" : [
         {
            "block_completion" : "100.00%",
            "block_fraction" : "100.00%",
            "dataset_fraction" : " 0.00%",
            "kind" : "TAPE",
            "name" : "T0_CH_CERN_Tape",
            "nblocks" : 1,
            "nfiles" : 105,
            "replica_fraction" : "100.00%",
            "se" : "T0_CH_CERN_Tape",
            "total_blocks" : 1,
            "total_files" : 105
         }
      ]
   }
]

[2]

$ dasgoclient -query="file dataset=/EGamma-Error/Run2018B-v1/RAW system=rucio" -json  | json_pp

[
   {
      "das" : {
         "expire" : 1737544468,
         "instance" : "prod/global",
         "primary_key" : "file.name",
         "record" : 1,
         "services" : [
            "rucio:file4dataset"
         ]
      },
      "file" : [
         {
            "adler32" : "2c38c381",
            "bytes" : 27856682045,
            "md5" : null,
            "name" : "/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/007970C7-4C68-E811-B1BA-FA163EC648D9.root",
            "pfns" : {
               "davs://eosctacms.cern.ch:8444//eos/ctacms/archive/cms/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/007970C7-4C68-E811-B1BA-FA163EC648D9.root" : {
                  "client_extract" : false,
                  "domain" : "wan",
                  "priority" : 1,
                  "rse" : "T0_CH_CERN_Tape",
                  "rse_id" : "f44c866a264d4da9972969e9f3b5bb52",
                  "type" : "TAPE",
                  "volatile" : false
               }
            },
            "replicas" : [
               {
                  "name" : "T0_CH_CERN_Tape",
                  "state" : "AVAILABLE"
               }
            ],
            "rses" : {
               "T0_CH_CERN_Tape" : [
                  "davs://eosctacms.cern.ch:8444//eos/ctacms/archive/cms/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/007970C7-4C68-E811-B1BA-FA163EC648D9.root"
               ]
            },
            "scope" : "cms",
            "size" : 27856682045,
            "states" : {
               "T0_CH_CERN_Tape" : "AVAILABLE"
            }
         }
      ],
      "qhash" : "f9a452a103dfd5701483f78e4f2d8bd2"
   },
   {
      "das" : {
         "expire" : 1737544468,
         "instance" : "prod/global",
         "primary_key" : "file.name",
         "record" : 1,
         "services" : [
            "rucio:file4dataset"
         ]
      },
      "file" : [
         {
            "adler32" : "0e6a88f2",
            "bytes" : 27964441528,
            "md5" : null,
            "name" : "/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/027AE8A2-5068-E811-90AA-FA163E160041.root",
            "pfns" : {
               "davs://eosctacms.cern.ch:8444//eos/ctacms/archive/cms/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/027AE8A2-5068-E811-90AA-FA163E160041.root" : {
                  "client_extract" : false,
                  "domain" : "wan",
                  "priority" : 1,
                  "rse" : "T0_CH_CERN_Tape",
                  "rse_id" : "f44c866a264d4da9972969e9f3b5bb52",
                  "type" : "TAPE",
                  "volatile" : false
               }
            },
            "replicas" : [
               {
                  "name" : "T0_CH_CERN_Tape",
                  "state" : "AVAILABLE"
               }
            ],
            "rses" : {
               "T0_CH_CERN_Tape" : [
                  "davs://eosctacms.cern.ch:8444//eos/ctacms/archive/cms/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/027AE8A2-5068-E811-90AA-FA163E160041.root"
               ]
            },
            "scope" : "cms",
            "size" : 27964441528,
            "states" : {
               "T0_CH_CERN_Tape" : "AVAILABLE"
            }
         }
      ]
...

@belforte
Copy link
Member

thanks @todor-ivanov and @mapellidario . Hopefully it will be enough to check block locations, not file-by-file. In any case changes to cmsdriver will be needed, so maintainer of that tool can decide if to use dasgoclient and parse the JSON or push for DM to create new container names and use rucio CLI.

@todor-ivanov
Copy link

todor-ivanov commented Jan 22, 2025

hi @belforte

Hopefully it will be enough to check block locations, not file-by-file.

Unfortunately, it wont work on the block level granularity [1]. Probably, because Rucio does not know about our concept of blocks or maybe because the go code does not make the full translation between CMS vs. Rucio data abstractions (blocks vs. datasets) for this query, I have not investigated for the moment. In case I get a request that a more complex filtering and aggregation needs to happen on the client side, then I'll have to dive deeper in the code anyway.

[1]

$ dasgoclient -query="block dataset=/EGamma-Error/Run2018B-v1/RAW system=rucio" -json  | json_pp 
[
   {
      "block" : [
         {
            "adler32" : null,
            "bytes" : null,
            "md5" : null,
            "name" : "/EGamma-Error/Run2018B-v1/RAW#ffa687a0-9f5e-4f3a-b4e6-6b168e371ecd",
            "scope" : "cms",
            "size" : null,
            "type" : "DATASET"
         }
      ],
      "das" : {
         "expire" : 1737548948,
         "instance" : "prod/global",
         "primary_key" : "block.name",
         "record" : 1,
         "services" : [
            "rucio:block4dataset"
         ]
      },
      "qhash" : "a45ae2fd17a484fc1609c46e24d2704f"
   }
]

@mapellidario
Copy link
Member

Uh, thanks Stefano for the suggestion! After playing a bit with dasgoclient I realized that removing the -json argument gives only what we care about [1] and matches the result from the das webgui. The details for the site4block query are here

just for completeness, the list of blocks for a given dataset can be obtained with [2]


[1]

> dasgoclient -query="site block=/ParkingBPH5/Run2018D-UL2018_MiniAODv2-v1/MINIAOD#ff8cf2f9-8fd2-413b-bfa6-09f0628d3c23 system=rucio"
T0_CH_CERN_Tape
T2_CH_CERN
T2_US_Florida

[2]

dasgoclient -query="block dataset=/ParkingBPH5/Run2018D-UL2018_MiniAODv2-v1/MINIAOD"

@belforte
Copy link
Member

No matter what, the core issue is that cmsdriver assumes that the pileup dataset is fully on disk. So it needs to be modified. As long as it accepts only DBS datasets, they can not overlap, so new dataset names (as containers in Rucio) will not help although they look an good thing to me.

My 2c's are that changing cmsdriver to work with current tools "the way they are now" is faster than changing it to use a yet-to-be-developed new dasgoclient functionality of the sort file dataset=.... rsetype=disk (an possibly expensive query IMHO).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants