Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple RelVal failures due to file being registered in DAS but not present #40889

Open
iarspider opened this issue Feb 27, 2023 · 18 comments
Open

Comments

@iarspider
Copy link
Contributor

RelVals 20834.x, 21034.x are broken since 2022-02-26-0000 due to inaccessible file:

Failed to open file at URL root://eoscms.cern.ch:1094//eos/cms/store/user/cmsbuild/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root

This file is a part of /RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM dataset, and registered in DAS as accessible on T2_CH_CERN:

$ dasgoclient --limit 0 --query 'file dataset=/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM site=T2_CH_CERN'
/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root

but it is not actually present on EOS:

$ ls /eos/cms/store/user/cmsbuild/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root
ls: cannot access /eos/cms/store/user/cmsbuild/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root: No such file or directory

Previously, no files from that dataset were registered as present on T2_CH_CERN, and DAS was returning a full list of files, so RelVal was using a different file (2c4c1ca9-73fe-4648-982f-e773c9ec91e9.root), which is cached on EOS.

@cmsbuild
Copy link
Contributor

A new Issue was created by @iarspider .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@smuzaffar
Copy link
Contributor

looks like some cleanup was done at T2_CH_CERN. Previously das query [a] was returning

/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/2c4c1ca9-73fe-4648-982f-e773c9ec91e9.root
/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root
/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/58ceb765-6c7b-4e24-b899-0333161f2db6.root
/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/6f34fee2-a183-4d33-8934-819cfb574f61.root
/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/ca5a619f-a1c7-4627-a1b6-d860b79fa23c.root
/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/cc6266b9-ec99-4883-a208-cfb52c105ed7.root
/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/ebeddbb5-e9b0-4e10-a274-c4815eb4e501.root

but now same query returns just one non accessible file

/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root

any one knows about this cleanup ?

[a]

 dasgoclient --limit 0 --query 'file dataset=/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM site=T2_CH_CERN'

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Let's add @cms-sw/pdmv-l2 in case they would know if there is or has been any general RelVal cleanup at CERN.

@makortel
Copy link
Contributor

I've followed this CMS Talk thread https://cms-talk.web.cern.ch/t/cannot-find-any-valid-file-inside-the-dataset/20761 where Rucio's automatic file-level cleanup has caused some unexpected behavior.

@kskovpen
Copy link
Contributor

If these data are not cached for IBs, the relval outputs are kept for 6 months - 1 year.

@smuzaffar
Copy link
Contributor

smuzaffar commented Feb 27, 2023

@kskovpen , we only cache files which are actually open during the IB/PR tests. In this case only one file 2c4c1ca9-73fe-4648-982f-e773c9ec91e9.root was cached and is availablein the cache. Problem is that now dasgoclient tells the system that there is only one file 49e54274-4298-4576-b47b-866e2247eab5.root for this dataset. System can still cache this new file but it need to be physically available. gfal-copy or xrootd can not copy it.

@makortel
Copy link
Contributor

makortel commented Feb 27, 2023

I see now in the DAS web GUI that the

  • site dataset=/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM shows T2_IN_TIFR
  • file dataset=/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM site=T2_CH_CERN indeed shows /store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root
  • site file=/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root shows T2_IN_TIFR

Clearly there is some inconsistency between file dataset=... site=T2_CH_CERN and site file=... on the same file. As far as I can tell, DAS is picking all this site information from Rucio. Let me add @ericvaandering in case he'd have an idea where to look further (or who could help further)

@ericvaandering
Copy link
Contributor

The two commands I run on the Rucio side

rucio list-file-replicas cms:/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root
rucio list-dataset-replicas --deep cms:/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM

are consistent. The file is only at T2_IN_TIFR. Maybe some cached info in DAS from CERN?

In fact there was a rule which could have had it there until Saturday:

[ewv@lxplus8s08 ~]$ rucio list-rules-history cms:/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM
----------------------------------------
Rule insertion
Account : wma_prod
RSE expression : (tier=2|tier=1)&cms_type=real&rse_type=DISK
Time : 2022-02-20 03:55:12
----------------------------------------
Rule deletion
Account : wma_prod
RSE expression : (tier=2|tier=1)&cms_type=real&rse_type=DISK
Time : 2022-02-21 09:25:39
----------------------------------------
Rule insertion
Account : wmcore_output
RSE expression : T1_US_FNAL_Disk|T2_CH_CERN
Time : 2022-02-20 08:31:44
----------------------------------------
Rule deletion
Account : wmcore_output
RSE expression : T1_US_FNAL_Disk|T2_CH_CERN
Time : 2022-02-25 12:24:46

@smuzaffar
Copy link
Contributor

dasgocliet json outout shows that rucio:file4dataset_site shows that this file is available at CERN too. @vkuznet , is there any das cache which could be returning the wrong information?

[a]

> dasgoclient -json -query 'file dataset=/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM site=T2_CH_CERN' | jq .
[muzaffar@cmsdev25 ~]$ dasgoclient -json -query 'file dataset=/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM site=T2_CH_CERN' | jq .
[
  {
    "das": {
      "expire": 1677513623,
      "instance": "prod/global",
      "primary_key": "file.name",
      "record": 1,
      "services": [
        "rucio:file4dataset_site"
      ]
    },
    "file": [
      {
        "adler32": "183320a0",
        "bytes": 222326185,
        "md5": null,
        "name": "/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root",
        "pfns": {
          "davs://eoscms.cern.ch:443/eos/cms/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root": {
            "client_extract": false,
            "domain": "wan",
            "priority": 1,
            "rse": "T2_CH_CERN",
            "rse_id": "542ab69d82bf401e9218bbe375bb1fce",
            "type": "DISK",
            "volatile": false
          },
          "davs://se01.indiacms.res.in:443/dpm/indiacms.res.in/home/cms/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root": {
            "client_extract": false,
            "domain": "wan",
            "priority": 2,
            "rse": "T2_IN_TIFR",
            "rse_id": "fb19bfcb3f504ebc8d75949ecf97b51a",
            "type": "DISK",
            "volatile": false
          }
        },
        "rses": {
          "T2_CH_CERN": [
            "davs://eoscms.cern.ch:443/eos/cms/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root"
          ],
          "T2_IN_TIFR": [
            "davs://se01.indiacms.res.in:443/dpm/indiacms.res.in/home/cms/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root"
          ]
        },
        "scope": "cms",
        "size": 222326185,
        "states": {
          "T2_CH_CERN": "BEING_DELETED",
          "T2_IN_TIFR": "AVAILABLE"
        }
      }
    ],
    "qhash": "0334f1487656a3cb85fdb817f56f0cab"
  }
]

@makortel
Copy link
Contributor

I see the JSON document has inside states element "T2_CH_CERN": "BEING_DELETED" and "T2_IN_TIFR": "AVAILABLE".

@smuzaffar Could you remind me, if DAS would have returned empty list of files for the file dataset=... site=T2_CH_CERN query, would the IB machinery have updated its DAS query output cache, or kept the earlier list of files?

@vkuznet
Copy link
Contributor

vkuznet commented Feb 27, 2023

DAS go client does not have cache, said that the issue is different rucio APIs DAS uses when using one query vs another. If you'll add to das client -verbose 2 option you can see how DAS queries services. In this particular case:

  • for file dataset=/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM site=T2_CH_CERN query DAS fetch info from this URL call http://cms-rucio.cern.ch/replicas/cms/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM
  • while for this query site file=/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root it uses this URL http://cms-rucio.cern.ch/replicas/cms/store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root

As far as I can tell they produce different results and therefore it is issue with output of different Rucio APIs rather DAS per se.

@ericvaandering
Copy link
Contributor

I suspect the first one is the equivalent of

[ewv@lxplus8s08 ~]$ rucio list-dataset-replicas cms:/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM

DATASET: cms:/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM#1db525db-9aa5-44f6-a4ca-79b3a2ecfa49
+------------+---------+---------+
| RSE        |   FOUND |   TOTAL |
|------------+---------+---------|
| T2_IN_TIFR |       5 |       7 |
+------------+---------+---------+

(Note the absence of the --deep flag compared to what I posted before.) That's not reliable, there is a table in Rucio which tries to keep track of complete block replicas and it usually is pretty accurate, but it's not 100% because the block is not an atomic thing in Rucio, the files are.

@smuzaffar
Copy link
Contributor

I see the JSON document has inside states element "T2_CH_CERN": "BEING_DELETED" and "T2_IN_TIFR": "AVAILABLE".

@smuzaffar Could you remind me, if DAS would have returned empty list of files for the file dataset=... site=T2_CH_CERN query, would the IB machinery have updated its DAS query output cache, or kept the earlier list of files?

@makortel , bot keeps the old results if das returns empty list or error for a query.

@vkuznet
Copy link
Contributor

vkuznet commented Feb 27, 2023

To @ericvaandering , I saw this --deep option discussion in many threads and I opened DAS ticket for it, see dmwm/das2go#53. Once you'll clarify how to use this flag in REST API it can be incorporated into DAS codebase.

@ericvaandering
Copy link
Contributor

@smuzaffar
Copy link
Contributor

For now I have update bot to ignore the /store/relval/CMSSW_12_3_0_pre5/RelValTTbar_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/10000/49e54274-4298-4576-b47b-866e2247eab5.root for T2_CH_CERN this basically forces bot to use the results of file dataset=/RelValTTbar_14TeV/CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v1/GEN-SIM. This is just a work around to fix the relvals failures in cmssw IB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants