Feature request: find files on disk #40

kpedro88 · 2025-01-21T10:17:20Z

PPD recently deleted large fractions of existing premixed pileup input samples in order to make room for the new (2024) premixed pileup input sample. This means that DAS queries for the lists of files in those samples return many files that are now only on tape, and jobs that try to read those files will obviously fail.

For users, I recently implemented a simple script to get a list of files on disk using Rucio: https://github.com/FNALLPC/lpc-scripts/blob/master/get_files_on_disk.py. This is probably not the optimal set of Rucio queries, but I had some snippet from years ago and just wrapped it up into a nicer/more generalized script.

Is it possible to add a custom dasgoclient query that only finds files on disk? The information should all be there (since Rucio is one of the sources), it's just a matter of how to extract it in a suitable way. This would help both production operations and users.

belforte · 2025-01-21T13:34:29Z

Could PPD rather create new Rucio containers with the list of kept files (hopefully they did the deletion by blocks, btw) ? Then users can do rucio list-files cms:/a/b/c --csv.
Sort of ... let's use the DM tool for data bookkeeping.

kpedro88 · 2025-01-21T13:35:59Z

That would be a nice solution using Rucio. I am not sure that it fully addresses all the problems, because cmsDriver commands still use DAS queries to find files. Maybe we need to upgrade that functionality to use Rucio directly... @smuzaffar ?

belforte · 2025-01-21T14:20:17Z

could also make dasgoclient understand scope:container but honestly I am not optimistic about CMS capability to modify dasgoclient code.

DickyChant · 2025-01-21T14:26:13Z

Hi, which files to be put on disk is not a PPD scope, we give fractions and DM implements the exact details.

I am in favor of moving to whatever functionality to query on disk.

todor-ivanov · 2025-01-21T14:41:00Z

Hi @belforte @kpedro88

I am not sure this is about capability of changing the dasgoclient code or not, but rather on the separation of the data moving pieces of the system from the data book keeping ones. As far as I am aware, the actual storage type where the data is currently available is not an information extractable from the DBS database, and @vkuznet correct me if I am wrong, but dasgoclient is not making any queries to Rucio right now. I am not sure we'd want to go that road. If we do .. many things may be done indeed.

kpedro88 · 2025-01-21T14:45:42Z

I made this request for dasgoclient because that is currently what CMSSW uses to get file lists.

You can search this codebase, or look at the help/examples from the executable, and see that rucio is included as a source.

belforte · 2025-01-21T15:15:35Z

@todor-ivanov just try dasgoclient --query 'site dataset=<yourfavoritedataset> to verify that it takes location information from Rucio.

vkuznet · 2025-01-21T15:21:54Z

DAS supports queries to Rucio, as it did for Phedex, and it aggregates results from both DBS and Rucio. Therefore, this request is valid and can be implemented in DAS. But as I replied to Kevin in private email my time is no longer allocated to DAS development.

To make appropriate changes the following steps should be done:

locate Rucio APIs in das2go codebase
add appropriate option to DAS QL to support scope, or ondisk flag
adjust DAS parser to properly parse these new options and it will properly put them in DAS spec used by DAS
adjust calls to Rucio APIs to use new options
adjust DAS maps to support new options
test code on web interface and if it is working the dasgoclient will naturally support it

The appropriate syntax of DAS queries should be modified as following:

# look-up all files for a given dataset (query goes to DBS and Rucio)
file dataset=/a/b/c

# look-up ondisk files for a given dataset (query will go to Rucio)
file dataset=/a/b/c ondisk

# look-up files from a given rucio scope for a given dataset (query will go to Rucio)
file dataset=/a/b/c scope=production

So, the amount of work is not negligible as it requires changes in DAS QL, DAS APIs, DAS maps, but it is doable.

belforte · 2025-01-21T15:31:55Z

thanks @DickyChant
Could then DM explain how they do this ? Who is the contact for this ? Hopefully they can create those Rucio container quite easily.

IMHO the simplest solution is:

DM creates Rucio containers with DBS-blocks/Rucio-datasets on disk
add rucio:<dataset> to cmsdriver possible input, in parallel to dbs:<dataset> and file:... (IIUC)
inside cmsdriver use rucio CLI, not dasgoclient to resolve that. Forking a subprocess which starts with scram unsetenv ; source /cvmfs/../Rucio/... will work for any CMSSW and may be good for future use cases as well. E.g. allow users to provide their own container extending the syntax to accept rucio:scope:name while rucio:name will be handled as the cms:name DID

That said, none of this in my scope.

todor-ivanov · 2025-01-22T09:36:21Z

Hi @belforte @kpedro88

Just to clarify, on my comment here: #40 (comment).

Since it is an issue created in the dasgoclient I was referring to the fact that the client itself does not spawn its won queries to Rucio. It is DAS who knows about Rucio existence. The dasgoclient is just an interface to it. And thanks @vkuznet for elaborating and giving full details.

So, I think this issue would be better moved to the DAS repository: https://github.com/dmwm/das2go, because I suspect the information cold indeed be already accessible through the DAS link to Rucio. We just need to crosscheck if we make the proper queries to Rucio and if we convert them to the proper fields in the structure returned by DAS: https://github.com/dmwm/das2go/blob/master/services/rucio.go

todor-ivanov · 2025-01-22T11:18:53Z

BTW, @kpedro88

A clever approach (pointed to me by @mapellidario in a private chat) could actually help identify the location for every file.
What needs to be done on the dasgoclient query is to fetch the data in json format. For both queries pointed by both you and @belforte: [1]. And then to just increase the granularity of the query from dataset to file level, and point the inormation source to be Rucio: [2].

And as seen in the query output, the storage type is already provided. Of course this would require external output aggregation, but at least the information is already retrievable. Wouldn't that suffice?

BTW, Thanks @mapellidario for the clever suggestion!

[1]

$ dasgoclient -query="site dataset=/EGamma-Error/Run2018B-v1/RAW" -json  |json_pp
[
   {
      "das" : {
         "expire" : 1737544307,
         "instance" : "prod/global",
         "primary_key" : "site.name",
         "record" : 1,
         "services" : [
            "combined:site4dataset_pct"
         ]
      },
      "qhash" : "54a22ce52df7c02b8a107ce4f956b54b",
      "site" : [
         {
            "block_completion" : "100.00%",
            "block_fraction" : "100.00%",
            "dataset_fraction" : " 0.00%",
            "kind" : "TAPE",
            "name" : "T0_CH_CERN_Tape",
            "nblocks" : 1,
            "nfiles" : 105,
            "replica_fraction" : "100.00%",
            "se" : "T0_CH_CERN_Tape",
            "total_blocks" : 1,
            "total_files" : 105
         }
      ]
   }
]

[2]

$ dasgoclient -query="file dataset=/EGamma-Error/Run2018B-v1/RAW system=rucio" -json  | json_pp

[
   {
      "das" : {
         "expire" : 1737544468,
         "instance" : "prod/global",
         "primary_key" : "file.name",
         "record" : 1,
         "services" : [
            "rucio:file4dataset"
         ]
      },
      "file" : [
         {
            "adler32" : "2c38c381",
            "bytes" : 27856682045,
            "md5" : null,
            "name" : "/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/007970C7-4C68-E811-B1BA-FA163EC648D9.root",
            "pfns" : {
               "davs://eosctacms.cern.ch:8444//eos/ctacms/archive/cms/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/007970C7-4C68-E811-B1BA-FA163EC648D9.root" : {
                  "client_extract" : false,
                  "domain" : "wan",
                  "priority" : 1,
                  "rse" : "T0_CH_CERN_Tape",
                  "rse_id" : "f44c866a264d4da9972969e9f3b5bb52",
                  "type" : "TAPE",
                  "volatile" : false
               }
            },
            "replicas" : [
               {
                  "name" : "T0_CH_CERN_Tape",
                  "state" : "AVAILABLE"
               }
            ],
            "rses" : {
               "T0_CH_CERN_Tape" : [
                  "davs://eosctacms.cern.ch:8444//eos/ctacms/archive/cms/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/007970C7-4C68-E811-B1BA-FA163EC648D9.root"
               ]
            },
            "scope" : "cms",
            "size" : 27856682045,
            "states" : {
               "T0_CH_CERN_Tape" : "AVAILABLE"
            }
         }
      ],
      "qhash" : "f9a452a103dfd5701483f78e4f2d8bd2"
   },
   {
      "das" : {
         "expire" : 1737544468,
         "instance" : "prod/global",
         "primary_key" : "file.name",
         "record" : 1,
         "services" : [
            "rucio:file4dataset"
         ]
      },
      "file" : [
         {
            "adler32" : "0e6a88f2",
            "bytes" : 27964441528,
            "md5" : null,
            "name" : "/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/027AE8A2-5068-E811-90AA-FA163E160041.root",
            "pfns" : {
               "davs://eosctacms.cern.ch:8444//eos/ctacms/archive/cms/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/027AE8A2-5068-E811-90AA-FA163E160041.root" : {
                  "client_extract" : false,
                  "domain" : "wan",
                  "priority" : 1,
                  "rse" : "T0_CH_CERN_Tape",
                  "rse_id" : "f44c866a264d4da9972969e9f3b5bb52",
                  "type" : "TAPE",
                  "volatile" : false
               }
            },
            "replicas" : [
               {
                  "name" : "T0_CH_CERN_Tape",
                  "state" : "AVAILABLE"
               }
            ],
            "rses" : {
               "T0_CH_CERN_Tape" : [
                  "davs://eosctacms.cern.ch:8444//eos/ctacms/archive/cms/store/data/Run2018B/EGamma/RAW/v1/000/317/434/00000/027AE8A2-5068-E811-90AA-FA163E160041.root"
               ]
            },
            "scope" : "cms",
            "size" : 27964441528,
            "states" : {
               "T0_CH_CERN_Tape" : "AVAILABLE"
            }
         }
      ]
...

belforte · 2025-01-22T12:03:58Z

thanks @todor-ivanov and @mapellidario . Hopefully it will be enough to check block locations, not file-by-file. In any case changes to cmsdriver will be needed, so maintainer of that tool can decide if to use dasgoclient and parse the JSON or push for DM to create new container names and use rucio CLI.

todor-ivanov · 2025-01-22T12:39:24Z

hi @belforte

Hopefully it will be enough to check block locations, not file-by-file.

Unfortunately, it wont work on the block level granularity [1]. Probably, because Rucio does not know about our concept of blocks or maybe because the go code does not make the full translation between CMS vs. Rucio data abstractions (blocks vs. datasets) for this query, I have not investigated for the moment. In case I get a request that a more complex filtering and aggregation needs to happen on the client side, then I'll have to dive deeper in the code anyway.

[1]

$ dasgoclient -query="block dataset=/EGamma-Error/Run2018B-v1/RAW system=rucio" -json  | json_pp 
[
   {
      "block" : [
         {
            "adler32" : null,
            "bytes" : null,
            "md5" : null,
            "name" : "/EGamma-Error/Run2018B-v1/RAW#ffa687a0-9f5e-4f3a-b4e6-6b168e371ecd",
            "scope" : "cms",
            "size" : null,
            "type" : "DATASET"
         }
      ],
      "das" : {
         "expire" : 1737548948,
         "instance" : "prod/global",
         "primary_key" : "block.name",
         "record" : 1,
         "services" : [
            "rucio:block4dataset"
         ]
      },
      "qhash" : "a45ae2fd17a484fc1609c46e24d2704f"
   }
]

mapellidario · 2025-01-22T15:18:03Z

Uh, thanks Stefano for the suggestion! After playing a bit with dasgoclient I realized that removing the -json argument gives only what we care about [1] and matches the result from the das webgui. The details for the site4block query are here

just for completeness, the list of blocks for a given dataset can be obtained with [2]

[1]

> dasgoclient -query="site block=/ParkingBPH5/Run2018D-UL2018_MiniAODv2-v1/MINIAOD#ff8cf2f9-8fd2-413b-bfa6-09f0628d3c23 system=rucio"
T0_CH_CERN_Tape
T2_CH_CERN
T2_US_Florida

[2]

dasgoclient -query="block dataset=/ParkingBPH5/Run2018D-UL2018_MiniAODv2-v1/MINIAOD"

belforte · 2025-01-22T15:50:13Z

No matter what, the core issue is that cmsdriver assumes that the pileup dataset is fully on disk. So it needs to be modified. As long as it accepts only DBS datasets, they can not overlap, so new dataset names (as containers in Rucio) will not help although they look an good thing to me.

My 2c's are that changing cmsdriver to work with current tools "the way they are now" is faster than changing it to use a yet-to-be-developed new dasgoclient functionality of the sort file dataset=.... rsetype=disk (an possibly expensive query IMHO).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: find files on disk #40

Feature request: find files on disk #40

kpedro88 commented Jan 21, 2025

belforte commented Jan 21, 2025

kpedro88 commented Jan 21, 2025

belforte commented Jan 21, 2025

DickyChant commented Jan 21, 2025

todor-ivanov commented Jan 21, 2025

kpedro88 commented Jan 21, 2025

belforte commented Jan 21, 2025

vkuznet commented Jan 21, 2025

belforte commented Jan 21, 2025

todor-ivanov commented Jan 22, 2025

todor-ivanov commented Jan 22, 2025

belforte commented Jan 22, 2025

todor-ivanov commented Jan 22, 2025 •

edited

Loading

mapellidario commented Jan 22, 2025

belforte commented Jan 22, 2025

Feature request: find files on disk #40

Feature request: find files on disk #40

Comments

kpedro88 commented Jan 21, 2025

belforte commented Jan 21, 2025

kpedro88 commented Jan 21, 2025

belforte commented Jan 21, 2025

DickyChant commented Jan 21, 2025

todor-ivanov commented Jan 21, 2025

kpedro88 commented Jan 21, 2025

belforte commented Jan 21, 2025

vkuznet commented Jan 21, 2025

belforte commented Jan 21, 2025

todor-ivanov commented Jan 22, 2025

todor-ivanov commented Jan 22, 2025

belforte commented Jan 22, 2025

todor-ivanov commented Jan 22, 2025 • edited Loading

mapellidario commented Jan 22, 2025

belforte commented Jan 22, 2025

todor-ivanov commented Jan 22, 2025 •

edited

Loading