feat: refactor url - object split (motivated by `fsspec` integration) #976

lobis · 2023-10-05T14:55:48Z

Currently the uproot._util.py file contains some helper methods for things such as url parsing, split root object from file path etc.

Besides the splitting of the root object from a URI which is non-standard, fsspec should be able to handle everything else and we could remove or simplify some helper methods.

Since this uses fsspec but fsspec is not currently a dependency my solution was to do the new url split using fsspec only if its installed, otherwise do the old one, which may cause some problems.

lobis · 2023-10-05T15:47:18Z

Perhaps it would make sense to add fsspec as a dependency for this PR after the next major release? This would avoid having potentially different behaviours for users having fsspec installed or not, and I guess it would also make the code easier to write and read we wouldn't have to handle both cases.

jpivarski

When this is done, fsspec will become a strict dependency.

This would avoid having potentially different behaviours for users having fsspec installed or not

If that's possible, then you would be introducing a behavior change now for those users who happen to have fsspec installed for other reasons. We don't want the path-handling to change at all. Could you get more certainty about that?

Meanwhile, this is not a change that has to happen now (or ever, technically). The original splitting was using a Python standard library function. Even when we start relying on fsspec for all file backends, there's no reason we couldn't still use the standard urlparse for URL parsing. Does fsspec.utils.urlsplit do something special?

I'm on the fence about this one.

src/uproot/_util.py

* origin/main: test: use file in skhep-testdata for issue #121 (#973)

Co-authored-by: Jim Pivarski <[email protected]>

lobis · 2023-10-05T21:24:40Z

When this is done, fsspec will become a strict dependency.

This would avoid having potentially different behaviours for users having fsspec installed or not

If that's possible, then you would be introducing a behavior change now for those users who happen to have fsspec installed for other reasons. We don't want the path-handling to change at all. Could you get more certainty about that?

Meanwhile, this is not a change that has to happen now (or ever, technically). The original splitting was using a Python standard library function. Even when we start relying on fsspec for all file backends, there's no reason we couldn't still use the standard urlparse for URL parsing. Does fsspec.utils.urlsplit do something special?

I'm on the fence about this one.

Yes I think you are right we should not use the fsspec url parsing when we can just use urllib. At the end I ended up refactoring the helper method to use urllib and added an explicit test for it. The old method got confused with things such as double : in the url, etc.

I feel that this method is more concise but I am worried it does not fully reproduce the old one (everything looks good so far, all the tests pass).

nsmith- · 2023-10-05T21:41:50Z

Might be worth reviewing all the supported syntaxes of files argument in https://uproot.readthedocs.io/en/latest/uproot._dask.dask.html as it is a bit more than uproot.open and thinking about what would make the most sense to normalize paths to something uniform.

jpivarski · 2023-10-05T21:42:32Z

The colon-splitting between filename and object (not related to #974—that's a colon in an object path) has been a lot of trouble. There's a list of problems it's caused on #920 (comment), including platform-dependent issues with C:\ on Windows.

On page 19 of this talk I analyzed a large dataset of user code to see how many people are using it, to see if we could ever get rid of it. The result was 10% of uproot.open calls explicitly use the colon and 64% are unknown because they pass in a variable as the filename.

So the path interpretation is unfortunately complex and we should leave it untouched if possible. It seems to me that it should be possible, since replacing the backend doesn't change the path-or-URL that we send to the backend.

nsmith- · 2023-10-05T21:42:57Z

Personally I'd be in favor of dropping url:filepath in favor of {"url": "filepath"} but I believe there was some discussion in the past on this.

jpivarski · 2023-10-05T21:58:14Z

Might be worth reviewing all the supported syntaxes of files argument in https://uproot.readthedocs.io/en/latest/uproot._dask.dask.html as it is a bit more than uproot.open and thinking about what would make the most sense to normalize paths to something uniform.

I think we have consistency across the file-opening functions under control. File syntax is never interpreted differently by different file-opening functions, but some functions have more options than others.

uproot.open can only take one file, so it has
- str/bytes (filepath-colon-objectpath)
- pathlib.Path (filepath only)
- object with read and seek methods (file-like object)
- length-1 dict of filepath, objectpath (control where the split happens)
uproot.iterate can take multiple files, so it has
- str/bytes (filepath-colon-objectpath)
- pathlib.Path (filepath only)
- glob syntax in str/bytes or pathlib.Path, including bash extensions (multiple filepaths, but only one objectpath)
- any-length dict of filepath, objectpath (control where the split happens and get multiple filepaths, multiple objectpaths)
- already-open TTree objects (to chain them)
- iterables of the above (to chain them)
uproot.concatenate can also take multiple files, so it has
- all the same options as uproot.iterate
uproot.dask can take multiple files and needs to partition them somehow, so it has
- all the same options as uproot.iterate
- any-length dict of filepath, dict of {"object_path": OBJECTPATH, "steps": STEPS} where OBJECTPATH is the objectpath and STEPS is either a list of offsets or a list of start-stop pairs (entry numbers for the partitions).

(uproot.dask has three ways to partition, but each one is mutually exclusive of the other two. If you try to use more than one, it will raise an error.)

The above is complex, but I believe that it is under control. Each one of these methods was motivated by a request (a long history...).

lobis · 2023-10-06T19:18:31Z

To clarify (after an in-person discussion with @jpivarski I think there was some confusion):

This was motivated by the fsspec integration (Integration of fsspec #972).
I tried using the newly added fsspec source with some exotic urls such as github://scikit-hep:[email protected]/src/skhep_testdata/data/uproot-issue121.root but it didn't work due to the helper function not working correctly, this is the reason for this PR: make a more robust url / object split helper function.
I thought it made sense to refactor this function to use urllib as it's being used elsewhere in the code.
The new implementation is more concise and it looks like it handles all previous cases (atleast all the tests pass, if this is not the case we should add them).
I added a new test just to explicitly test this (as the github api url test had to be skipped due to api rate limits).
This is not just a refactoring, it should make this method correclty handle more cases and hopefully 100% cover previous ones.

jpivarski

We've been staring at this for a while, it should be okay. Let's try it out in production and find out!

(We'll know where to look if anyone runs into any new problems with colon-parsing.)

…split * origin/fsspec-urlsplit: feat: add support for current RNTuple files (#962) chore: update pre-commit hooks (#993) chore: add types to most of the `uproot.source` module (#996) fix: make hist import optional in test_0965 (#994) docs: add GaetanLepage as a contributor for test (#995)

lobis added 3 commits October 5, 2023 09:52

use fsspec to split url

14a7286

add comment

72cdce7

fix bad strip

5a65189

lobis marked this pull request as ready for review October 5, 2023 16:15

lobis requested a review from jpivarski October 5, 2023 16:15

jpivarski reviewed Oct 5, 2023

View reviewed changes

src/uproot/_util.py Outdated Show resolved Hide resolved

lobis and others added 6 commits October 5, 2023 16:03

add test for path split object

cb7aaf8

Merge remote-tracking branch 'origin/main' into fsspec-urlsplit

b47dfad

* origin/main: test: use file in skhep-testdata for issue #121 (#973)

Update src/uproot/_util.py

c8c135a

Co-authored-by: Jim Pivarski <[email protected]>

use urllib instead of fsspec for url parsing

64a739b

move tests to dedicated file

0236fee

do not shadow object

933f5e3

lobis requested a review from nsmith- October 5, 2023 21:31

lobis added 3 commits October 5, 2023 19:42

add more test cases

d4adff9

fix test

2c72704

correctly strip obj

557033a

lobis changed the title ~~feat: use fsspec to split url~~ feat: refactor url - object split (motivated by fsspec integration Oct 6, 2023

lobis changed the title ~~feat: refactor url - object split (motivated by fsspec integration~~ feat: refactor url - object split (motivated by fsspec integration) Oct 6, 2023

lobis mentioned this pull request Oct 6, 2023

Integration of fsspec #972

Closed

lobis added 4 commits October 6, 2023 14:37

add type hinting

7adfced

add path regularization

1c2e9a3

update hint

ceb7139

remove type hinting

ed87fec

lobis and others added 5 commits October 12, 2023 18:11

Merge branch 'main' into fsspec-urlsplit

ecd2e22

Merge branch 'main' into fsspec-urlsplit

0ef9451

attempt to fix windows issue

aca2021

attempt to fix windows issue

94c56fd

Merge branch 'main' into fsspec-urlsplit

118713e

jpivarski approved these changes Oct 19, 2023

View reviewed changes

lobis added 2 commits October 19, 2023 11:54

add a few more tests

cc30432

lobis merged commit fdcbb8e into main Oct 19, 2023

lobis deleted the fsspec-urlsplit branch October 19, 2023 17:05

jpivarski mentioned this pull request Oct 24, 2023

File-object separator colon parsing PR #976 must be fixed before the next release #1006

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: refactor url - object split (motivated by `fsspec` integration) #976

feat: refactor url - object split (motivated by `fsspec` integration) #976

lobis commented Oct 5, 2023 •

edited

Loading

lobis commented Oct 5, 2023

jpivarski left a comment

lobis commented Oct 5, 2023

nsmith- commented Oct 5, 2023

jpivarski commented Oct 5, 2023

nsmith- commented Oct 5, 2023

jpivarski commented Oct 5, 2023

lobis commented Oct 6, 2023

jpivarski left a comment

feat: refactor url - object split (motivated by fsspec integration) #976

feat: refactor url - object split (motivated by fsspec integration) #976

Conversation

lobis commented Oct 5, 2023 • edited Loading

lobis commented Oct 5, 2023

jpivarski left a comment

Choose a reason for hiding this comment

lobis commented Oct 5, 2023

nsmith- commented Oct 5, 2023

jpivarski commented Oct 5, 2023

nsmith- commented Oct 5, 2023

jpivarski commented Oct 5, 2023

lobis commented Oct 6, 2023

jpivarski left a comment

Choose a reason for hiding this comment

feat: refactor url - object split (motivated by `fsspec` integration) #976

feat: refactor url - object split (motivated by `fsspec` integration) #976

lobis commented Oct 5, 2023 •

edited

Loading