Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uproot can not open files on dCache when 'https' protocol is used. #1255

Open
bockjoo opened this issue Jul 24, 2024 · 2 comments
Open

uproot can not open files on dCache when 'https' protocol is used. #1255

bockjoo opened this issue Jul 24, 2024 · 2 comments
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged

Comments

@bockjoo
Copy link

bockjoo commented Jul 24, 2024

When I tried to open a file on dCache using 'https'+X509, uproot fails to open it. I am using:

Python 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uproot
>>> uproot.__version__
'5.3.10'

This can be reproduced with :

import sys
import os
import ssl
import uproot

filenames = [{"T1_US_FNAL root":"root://cmsxrootd-site2.fnal.gov//store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root"}]
filenames.append({"T2_US_Wisconsin https":"https://cmsxrootd.hep.wisc.edu:1094/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/44187D37-0301-3942-A6F7-C723E9F4813D.root"})
filenames.append({"T1_US_FNAL https":"https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root"})
filenames.append({"T2_DE_DESY https":"https://dcache-cms-webdav-wan.desy.de:2880//pnfs/desy.de/cms/tier2/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root"})

i=0
for thefile in filenames:
 i=i+1
 try:
   sslctx = ssl.create_default_context()
   sslctx.load_cert_chain(os.environ["X509_USER_PROXY"], os.environ["X509_USER_PROXY"])
   
   uproot_options={'ssl': sslctx}
   site_protocol=list(thefile.keys())[0]   
   the_file = uproot.open({thefile[site_protocol]: None}, **uproot_options)
   print ("[",i,"] OPEN OK ",site_protocol, thefile[site_protocol], " size of CA certs ", len(uproot_options['ssl'].get_ca_certs()))
   #print ("[",i,"] file is open and the_file is ",the_file)
   the_file.close()
 except Exception as e:
   print ( "[",i,"] OPEN Exception ",site_protocol, thefile[site_protocol], " Exception was ",e)

The output of the above script looks like:

[ 1 ] OPEN OK  T1_US_FNAL root root://cmsxrootd-site2.fnal.gov//store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root  size of CA certs  147
[ 2 ] OPEN OK  T2_US_Wisconsin https https://cmsxrootd.hep.wisc.edu:1094/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/44187D37-0301-3942-A6F7-C723E9F4813D.root  size of CA certs  147
[ 3 ] OPEN Exception  T1_US_FNAL https https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root  Exception was  https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root
[ 4 ] OPEN Exception  T2_DE_DESY https https://dcache-cms-webdav-wan.desy.de:2880//pnfs/desy.de/cms/tier2/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root  Exception was  https://dcache-cms-webdav-wan.desy.de:2880//pnfs/desy.de/cms/tier2/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root

To access the files, one needs to pass X509 using SSLContext like above and might need to add the SSLContext like so:

lib/python3.12/site-packages/fsspec/implementations/http.py : add the creation of the SSLContext around lines between 224 and 225 and between 825 and 826 like so:
        import os
        import ssl
        import socket
        import copyreg
        def save_sslcontext(obj):
           return obj.__class__, (obj.protocol,)
    
        copyreg.pickle(ssl.SSLContext, save_sslcontext)
        sslctx = ssl.create_default_context()
        sslctx.load_cert_chain(os.environ['X509_USER_PROXY'], os.environ['X509_USER_PROXY'])
        sslctxdic={'ssl': sslctx}
        # Last - 0 necessary
        kw.update(sslctxdic)
@bockjoo bockjoo added the bug (unverified) The problem described would be a bug, but needs to be triaged label Jul 24, 2024
@bockjoo
Copy link
Author

bockjoo commented Jul 24, 2024

On the other hand, this traditional script properly downloads files from dCache:

import json,os,time
import urllib.request, urllib.error
import ssl
import os.path


url = 'https://cms-cric.cern.ch/api/accounts/user/query/?json&preset=people'
url = "https://cmsio9.rc.ufl.edu:1094/store/user/bockjoo/nano_dy.root"
url = "https://cmsxrootd.hep.wisc.edu:1094//store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/44187D37-0301-3942-A6F7-C723E9F4813D.root"
url = "https://dcache-cms-webdav-wan.desy.de:2880//pnfs/desy.de/cms/tier2/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root"

CERTIFICATE_CRT = '/home/bockjoo/.cmsuser.proxy'
CERTIFICATE_KEY = '/home/bockjoo/.cmsuser.proxy'
try:
            myContext = ssl.SSLContext()
            myContext.load_cert_chain(CERTIFICATE_CRT,
                                      CERTIFICATE_KEY)
            with urllib.request.urlopen(url,
                                        context=myContext) as urlHandle:
                urlCharset = urlHandle.headers.get_content_charset()
                if urlCharset is None:
                    urlCharset = "utf-8"
                try:
                 myData = urlHandle.read().decode( urlCharset )
                except:
                 myData = urlHandle.read()
            #response = requests.get(url, headers=headers) #, context=myContext)
except:
            print("Failed to download ",url)
            raise

print ( type (myData) ) 

with the output:

<class 'bytes'>

@nikoladze
Copy link
Contributor

I'm not sure if this is related, but i also noticed some issues when reading through dcache with the current default https method in uproot/fsspec. My main observation is that it hangs once i request columns (leading to the uproot source calling .chunks). I believe this boils down to 2 things:

  • aiohttp uses a connection pool of 100 TCP connections. DCache does not like this - typically it's expected that a single client only opens very few connections and one will see queuing when too many are opened.
  • dcache will redirect after a GET request to a url with some unique identifier in the parameters (i believe on the server these connections are treated in a stateful way, wheras http requests are in principle stateless). This location is then supposed to be used for subsequent requests to the same file, but that's not what aiohttp will do. Instead the next GET request will ask the original URL again, get another redirection (with a new state) and then use this. Here it also doesn't help that aiohttp keeps the TCP connections open since it does not remember the redirect urls (with the unique identifiers in the parameters)

Illustration of the second point in form of code (if you want to reproduce, replace url to something you have access to, the following probably needs a Belle II VO X509 certificate)

import ssl
import os
from urllib.parse import urlparse
from http.client import HTTPSConnection

ctx = ssl.create_default_context(capath=os.environ["X509_CERT_DIR"])
ctx.load_cert_chain(os.environ["X509_USER_PROXY"])

path = "https://lcg-lrz-http.grid.lrz.de:443/pnfs/lrz-muenchen.de/data/belle/localgroupdisk/belle/user/nhart/test_202408141225/sub00/RootOutput_00000_job428876195_00.root"
parsed = urlparse(path)
conn = HTTPSConnection(parsed.hostname, port=parsed.port, context=ctx)

Now

conn.request("GET", f"{parsed.path}?{parsed.query}", headers={"Range": "bytes=0-10"})
resp = conn.getresponse()
print(resp.headers.as_string())
print(resp.status)
print(resp.read())

Gives something like

Date: Fri, 23 Aug 2024 14:42:33 GMT
Server: dCache/9.2.17
Location: https://lcg-lrz-dc46.grid.lrz.de:62240/pnfs/lrz-muenchen.de/data/belle/localgroupdisk/belle/user/nhart/test_202408141225/sub00/RootOutput_00000_job428876195_00.root?dcache-http-uuid=d20f7960-32dc-42e3-802c-efb44f66e184&dcache-http-ref=https%3A%2F%2Flcg-lrz-http.grid.lrz.de%3A443
Content-Length: 0


302
b''

So, a redirect URL with some uuid in it. I can now open a connection to this and make multiple requests to it (but i can't use the URL parameters with the uuid in multiple connections)

location = resp.headers["Location"]
parsed = urlparse(location)
conn = HTTPSConnection(parsed.hostname, port=parsed.port, context=ctx) # this is the new connection to the redirect location

Now i can repeat the following many times, also with different ranges

conn.request("GET", f"{parsed.path}?{parsed.query}", headers={"Range": "bytes=0-10"}) # the parsed.query now contains the url parameters needed
resp = conn.getresponse()
print(resp.headers.as_string())
print(resp.status)
print(resp.read())

Not sure what the solution is - one would need to introduce a corresponding behavior in the fsspec https source and/or make it use multi range requests again (what the old uproot HTTPSource did). Concerning the multi range requests what i get from @jpivarski's comments on older issues like #3 it was quite a struggle to even find out if a http server supports this.

Or we stick to xrootd for storages like dcache? ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged
Projects
None yet
Development

No branches or pull requests

2 participants