Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add join method to Url class #1378

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Meetesh-Saini
Copy link

  • added support for URL path joining with optional trailing slashes and multiple arguments.

Change Summary

This PR implements a feature based on pydantic/pydantic#9794 to join URL path into the base URL. It uses the join method from the url crate.

Related issue number

fix pydantic/pydantic#9794

Checklist

  • Unit tests for the changes exist
  • Documentation reflects the changes where applicable
  • Pydantic tests pass with this pydantic-core (except for expected changes)
  • My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

- added support for URL path joining with optional trailing slashes and multiple arguments.
Copy link

codecov bot commented Jul 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.10%. Comparing base (ab503cb) to head (6a4fa06).
Report is 207 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1378      +/-   ##
==========================================
- Coverage   90.21%   89.10%   -1.11%     
==========================================
  Files         106      112       +6     
  Lines       16339    17892    +1553     
  Branches       36       40       +4     
==========================================
+ Hits        14740    15943    +1203     
- Misses       1592     1929     +337     
- Partials        7       20      +13     
Files with missing lines Coverage Δ
src/url.rs 98.43% <100.00%> (+0.11%) ⬆️

... and 52 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b21b0f...6a4fa06. Read the comment docs.

Copy link

codspeed-hq bot commented Jul 29, 2024

CodSpeed Performance Report

Merging #1378 will not alter performance

Comparing Meetesh-Saini:dev-url-join (6a4fa06) with main (9b21b0f)

Summary

✅ 155 untouched benchmarks

Copy link
Contributor

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The signature looks fine, though I think the implementation can be simpler.

src/url.rs Show resolved Hide resolved
src/url.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just needs test cases and then I would be happy to see this merged. Sorry for the long delay 😬

Meetesh-Saini and others added 2 commits September 28, 2024 01:25
- Refactor URL join function for better handling of relative paths
- Add tests for joining URLs with and without trailing slashes
- Cover various edge cases in line with URL specification that the previous function would fail to handle
@Meetesh-Saini
Copy link
Author

I've added tests and changed the implementation. Now, it only takes one argument instead of multiple.
I noticed that using multiple arguments can be confusing, especially when trailing_slash=False. Keeping it to one argument makes things clearer, similar to how servo-url and urllib.urljoin work.
For example, the user may expect either of the following. I wasn't sure which one to implement.

            a.join("e", "?q=1", "g", trailing_slash=False) 
                    │     │      │                         
                    └──┬──┘      │                         
                       ▼         │                         
              ┌───► "e?q=1"      │                         
              │        │         │                         
    No trailing slash  └────┬────┘                         
                            ▼                              
                           "g"                             
                                                       
                                                                                       
         a.join("e", "?q=1", "g", trailing_slash=False)
                 │     │      │                        
                 └──┬──┘      │                        
                    ▼         │                        
                 "e/?q=1"     │                        
                    │         │                        
                    └────┬────┘                        
                         ▼                             
                       "e/g"                           

@davidhewitt davidhewitt marked this pull request as ready for review October 21, 2024 11:53
Copy link
Contributor

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for yet another long review cycle. I would like to just agree what the right default for trailing_slash is, and have a suggestion to avoid the cannot_be_a_base() restriction.

python/pydantic_core/_pydantic_core.pyi Outdated Show resolved Hide resolved
src/url.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the many iterations here and slow reviews by me. I am happy with the design here now. I would like to see the __floordiv__ operator removed (see comment below), and then let's merge 👍

Comment on lines +158 to +164
fn __truediv__(&self, other: &str) -> PyResult<Self> {
self.join(other, true)
}

fn __floordiv__(&self, other: &str) -> PyResult<Self> {
self.join(other, false)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sorry I missed these in the last round of review. I think the difference between the / and // operators here is subtle and hard to document.

I think better we just have /, and make it so that it matches the default of append_trailing_slash=False. This will also simplify testing, I think.

Suggested change
fn __truediv__(&self, other: &str) -> PyResult<Self> {
self.join(other, true)
}
fn __floordiv__(&self, other: &str) -> PyResult<Self> {
self.join(other, false)
}
fn __truediv__(&self, other: &str) -> PyResult<Self> {
self.join(other, false)
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay __floordiv__ can be removed but I feel the __truediv__ should have append_trailing_slash=True because this overloaded operator would likely be used to join multiple paths in shorter code. This behaviour would feel familiar to Python users, as it resembles pathlib's path joining.
For example,

a = Url("http://a")
print(a / "b" / "c" / "d")
# http://a/b/c/d/
a = Url("file:///home/user/")
print(a / "music" / "pop")
# file:///home/user/music/pop/

With append_trailing_slash=False it would instead result in http://a/d and file:///home/user/pop which I think is not what the user would expect.
I chose to add __floordiv__ too because it would simplify adding files at the end.

print(a / "dir" / "dir" / "dir" // "file.txt") # file:///home/user/dir/dir/dir/file.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Yikes, there are so many subleties here!

It seems to me that our .join() method really works like urllib.parse.urljoin when it comes to semantics, e.g.

>>> urllib.parse.urljoin("https://foo.com/a", "b")
'https://foo.com/b'

versus pathlib's

>>> pathlib.Path("/foo/a").joinpath("b")
PosixPath('/foo/a/b')

Given these are inconsistent, I think we should perhaps back away from trying to have pathlib-like semantics at all.

Would you be open to the idea of dropping the operators from the PR completely, so we can get .join() merged? We could then open a pydantic issue to discuss the design of the operators and move forward with an implementation when there's consensus?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could also have joinpath() which works like Pathlib and doesn't accept query string or fragments as the whole input?

And then could have / operator work like joinpath? 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joinpath() would certainly make things cleaner. Should I implement joinpath() in this PR, or should we drop the operators for now and discuss it in the issues instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. I think I'd prefer we just had .join() here and worried about .joinpath() and the operators later. That said, there's potentially a desire to agree a sketch of the follow ups here. @pydantic/oss - any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think without comment from anyone else, let's just do .join() here and then follow-up with an issue in the main pydantic repo where we can discuss .joinpath() and operators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to have more time to discuss the semantics (does it need to match urllib? What about other libraries like furl? Should be double check with the current RFCs? We should also check what was said in this discussion).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to have more time to discuss the semantics (does it need to match urllib? What about other libraries like furl? Should be double check with the current RFCs? We should also check what was said in this discussion).

Really sorry for the late response. The main URL joining part is handled by rust-url's join method which implements the WHATWG URL spec. So, the / operator semantics are handled by the libraries in our case, python libraries like urllib and furl. While furl provides / operator, urllib does not. urllib.parse.urljoin and furl.furl.join follows the RFC 3986 to resolve the new URL.

furl is using / operator for only adding path to URL like pathlib.Path.

I think the furl approach is good. I should not have written my function signature as signature=(path, append_trailing_slash=false) but rather as signature=(url, append_trailing_slash=false) because the argument can be a relative or absolute url, not just path.

IMO, we can have Url.join() for URL joining, similar to furl.furl.join, and Url.__truediv__ for the sole purpose of adding a path without a trailing slash, just like furl.furl.__truediv__.

@sydney-runkle
Copy link
Member

I think it would be great to have more time to discuss the semantics (does it need to match urllib? What about other libraries like furl? Should be double check with the current RFCs? We should also check what was said in this discussion).

Let's do this before moving forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request to add URL.join() to correctly join/construct new URLs
4 participants