Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor url - object split (motivated by fsspec integration) #976

Merged
merged 23 commits into from
Oct 19, 2023

Conversation

lobis
Copy link
Collaborator

@lobis lobis commented Oct 5, 2023

Currently the uproot._util.py file contains some helper methods for things such as url parsing, split root object from file path etc.

Besides the splitting of the root object from a URI which is non-standard, fsspec should be able to handle everything else and we could remove or simplify some helper methods.

Since this uses fsspec but fsspec is not currently a dependency my solution was to do the new url split using fsspec only if its installed, otherwise do the old one, which may cause some problems.

@lobis
Copy link
Collaborator Author

lobis commented Oct 5, 2023

Perhaps it would make sense to add fsspec as a dependency for this PR after the next major release? This would avoid having potentially different behaviours for users having fsspec installed or not, and I guess it would also make the code easier to write and read we wouldn't have to handle both cases.

@lobis lobis marked this pull request as ready for review October 5, 2023 16:15
@lobis lobis requested a review from jpivarski October 5, 2023 16:15
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this is done, fsspec will become a strict dependency.

This would avoid having potentially different behaviours for users having fsspec installed or not

If that's possible, then you would be introducing a behavior change now for those users who happen to have fsspec installed for other reasons. We don't want the path-handling to change at all. Could you get more certainty about that?

Meanwhile, this is not a change that has to happen now (or ever, technically). The original splitting was using a Python standard library function. Even when we start relying on fsspec for all file backends, there's no reason we couldn't still use the standard urlparse for URL parsing. Does fsspec.utils.urlsplit do something special?

I'm on the fence about this one.

src/uproot/_util.py Outdated Show resolved Hide resolved
@lobis
Copy link
Collaborator Author

lobis commented Oct 5, 2023

When this is done, fsspec will become a strict dependency.

This would avoid having potentially different behaviours for users having fsspec installed or not

If that's possible, then you would be introducing a behavior change now for those users who happen to have fsspec installed for other reasons. We don't want the path-handling to change at all. Could you get more certainty about that?

Meanwhile, this is not a change that has to happen now (or ever, technically). The original splitting was using a Python standard library function. Even when we start relying on fsspec for all file backends, there's no reason we couldn't still use the standard urlparse for URL parsing. Does fsspec.utils.urlsplit do something special?

I'm on the fence about this one.

Yes I think you are right we should not use the fsspec url parsing when we can just use urllib. At the end I ended up refactoring the helper method to use urllib and added an explicit test for it. The old method got confused with things such as double : in the url, etc.

I feel that this method is more concise but I am worried it does not fully reproduce the old one (everything looks good so far, all the tests pass).

@lobis lobis requested a review from nsmith- October 5, 2023 21:31
@nsmith-
Copy link
Member

nsmith- commented Oct 5, 2023

Might be worth reviewing all the supported syntaxes of files argument in https://uproot.readthedocs.io/en/latest/uproot._dask.dask.html as it is a bit more than uproot.open and thinking about what would make the most sense to normalize paths to something uniform.

@jpivarski
Copy link
Member

The colon-splitting between filename and object (not related to #974—that's a colon in an object path) has been a lot of trouble. There's a list of problems it's caused on #920 (comment), including platform-dependent issues with C:\ on Windows.

On page 19 of this talk I analyzed a large dataset of user code to see how many people are using it, to see if we could ever get rid of it. The result was 10% of uproot.open calls explicitly use the colon and 64% are unknown because they pass in a variable as the filename.

So the path interpretation is unfortunately complex and we should leave it untouched if possible. It seems to me that it should be possible, since replacing the backend doesn't change the path-or-URL that we send to the backend.

@nsmith-
Copy link
Member

nsmith- commented Oct 5, 2023

Personally I'd be in favor of dropping url:filepath in favor of {"url": "filepath"} but I believe there was some discussion in the past on this.

@jpivarski
Copy link
Member

Might be worth reviewing all the supported syntaxes of files argument in https://uproot.readthedocs.io/en/latest/uproot._dask.dask.html as it is a bit more than uproot.open and thinking about what would make the most sense to normalize paths to something uniform.

I think we have consistency across the file-opening functions under control. File syntax is never interpreted differently by different file-opening functions, but some functions have more options than others.

  • uproot.open can only take one file, so it has
    • str/bytes (filepath-colon-objectpath)
    • pathlib.Path (filepath only)
    • object with read and seek methods (file-like object)
    • length-1 dict of filepath, objectpath (control where the split happens)
  • uproot.iterate can take multiple files, so it has
    • str/bytes (filepath-colon-objectpath)
    • pathlib.Path (filepath only)
    • glob syntax in str/bytes or pathlib.Path, including bash extensions (multiple filepaths, but only one objectpath)
    • any-length dict of filepath, objectpath (control where the split happens and get multiple filepaths, multiple objectpaths)
    • already-open TTree objects (to chain them)
    • iterables of the above (to chain them)
  • uproot.concatenate can also take multiple files, so it has
    • all the same options as uproot.iterate
  • uproot.dask can take multiple files and needs to partition them somehow, so it has
    • all the same options as uproot.iterate
    • any-length dict of filepath, dict of {"object_path": OBJECTPATH, "steps": STEPS} where OBJECTPATH is the objectpath and STEPS is either a list of offsets or a list of start-stop pairs (entry numbers for the partitions).

(uproot.dask has three ways to partition, but each one is mutually exclusive of the other two. If you try to use more than one, it will raise an error.)

The above is complex, but I believe that it is under control. Each one of these methods was motivated by a request (a long history...).

@lobis lobis changed the title feat: use fsspec to split url feat: refactor url - object split (motivated by fsspec integration Oct 6, 2023
@lobis lobis changed the title feat: refactor url - object split (motivated by fsspec integration feat: refactor url - object split (motivated by fsspec integration) Oct 6, 2023
@lobis lobis mentioned this pull request Oct 6, 2023
@lobis
Copy link
Collaborator Author

lobis commented Oct 6, 2023

To clarify (after an in-person discussion with @jpivarski I think there was some confusion):

  • This was motivated by the fsspec integration (Integration of fsspec #972).
  • I tried using the newly added fsspec source with some exotic urls such as github://scikit-hep:[email protected]/src/skhep_testdata/data/uproot-issue121.root but it didn't work due to the helper function not working correctly, this is the reason for this PR: make a more robust url / object split helper function.
  • I thought it made sense to refactor this function to use urllib as it's being used elsewhere in the code.
  • The new implementation is more concise and it looks like it handles all previous cases (atleast all the tests pass, if this is not the case we should add them).
  • I added a new test just to explicitly test this (as the github api url test had to be skipped due to api rate limits).
  • This is not just a refactoring, it should make this method correclty handle more cases and hopefully 100% cover previous ones.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been staring at this for a while, it should be okay. Let's try it out in production and find out!

(We'll know where to look if anyone runs into any new problems with colon-parsing.)

lobis added 2 commits October 19, 2023 11:54
…split

* origin/fsspec-urlsplit:
  feat: add support for current RNTuple files (#962)
  chore: update pre-commit hooks (#993)
  chore: add types to most of the `uproot.source` module (#996)
  fix: make hist import optional in test_0965 (#994)
  docs: add GaetanLepage as a contributor for test (#995)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants