-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(api): restrict arbitrary input nesting #8917
Conversation
@@ -293,6 +281,19 @@ def _bind_reduction_filter(self, where): | |||
|
|||
return where.resolve(self) | |||
|
|||
def bind(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs docstrings and must be covered with tests, but first we should have a consensus on the API behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are there but the docstring is missing.
There are some ambiguities around the first arguments which can be a list of expressions: table.select(["a", ["b"]])
# constructs
r0 := UnboundTable: table
a int8
b int16
c int32
d int64
e float32
f float64
g string
h boolean
i timestamp
j date
k time
Project[r0]
a: r0.a
('b',): ['b']
|
Apparently we also have a backend test case using a mapping as a single positional argument. |
Why can't we keep t.select([t.foo], t.bar) ? It doesn't seem meaningfully different from allowing t.select([t.foo], bar=t.bar) |
If we allow: t.select([t.foo], t.bar) then should we allow t.select([t.foo], [t.bar]) as well? |
I think so. It's really things like |
My preference would be to either provide args as variadics or as a single list-like argument. Even then we have confusion when to treat list-like inputs as literals vs value list. |
We talked today during triage and decided that at minimum we want to remove the support for arbitrarily nested expressions (as Phillip noted above). I actually agree with the initial proposal at the top (only accept lists of column refs if select is a 1-arg function, otherwise treat it as variadic). Coincidentally this matches how polars With this PR, what happens here? Does this error, or does it silently coerce these to array literals? t.select(["a", "b"], ["c", "d"]) # is this two array literals or an error?
t.select("a", "b", ["c", "d"]) # is ["c", "d"] an array literal, or does this error? Either way, I think we definitely want to forbid the case that phillip brought up before 9.0. I'm not opposed to the full changes here either, I think they make things more consistent, but we can always do that later IMO in 10.0. |
t.select(["a", "b"], ["c", "d"]) # is this two array literals or an error? yes, two array literals, not erroring t.select("a", "b", ["c", "d"]) # is ["c", "d"] an array literal, or does this error? yes, an array literal, not erroring |
I would prefer deciding what should be the ideal long term behaviour then extend that to be backward compatible and either raise a deprecation warning or provide a config option to keep the previous behaviour. |
Let's keep the full changeset here, and make sure to have both a |
Another thing I'm not sure about is whether we should do the dereferencing here or not? Like should |
Seems like it might keep |
Actually it could help code consolidation. I can try it after the dereferencer improvement gets merged: |
# TODO(kszucs): should use (table, *args, **kwargs) instead to avoid interpreting | ||
# nested inputs | ||
def bind(table: Table, value: Any, int_as_column=False) -> Iterator[ir.Value]: | ||
def bind(table: Table, value) -> Iterator[ir.Value]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- privatize to _bind()?
- make the API distinction that this takes a single arg, and call this _bind_one()?
- t.bind(othertable) feels not obvious what is supposed to happen (t.bind(t) feels acceptable and we could keep it). Should we error if the two tables are different? They can always do t.bind(othertable.columns) to get the old behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These don't seem like hard blockers (we try to reserve requesting changes for hard blockers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that workflow makese sense, thanks! Indeed these requests aren't hard blockers. I think my request for test cases is a blocker for me though, see my comment down there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.bind()
also dereference the value to t
now. bind()
simply coerces the input to a value expression and remains a private function.
We should probably kill |
I'm going to rip out |
Updated the test case to expect an input error rather than an integrity error. We should update |
cda3364
to
4b9d544
Compare
6a92a74
to
472f5c2
Compare
@NickCrews Can you review this again? This is a remaining blocker for the 9.0 release which we'd like to get out early next week. |
Previously we allowed arbitrary nesting of input expressions for various API methods like `table.join`, `table.group_by`, etc. While this was backward compatible with `8.0.0` it also exposes additional ambiguity which can be confusing for users and difficult to reason about. This change restricts the nesting of input expressions to a "single level" of nesting with the exception of the first positional argument which can be a list of expressions, but only if there are no more positional arguments. Examples: ```python t.select([t.foo, t.bar]) # OK t.select(t.foo, t.bar) # OK t.select([t.foo], t.bar) # Error t.select(t.foo, name=t.bar) # OK t.select([t.foo], name=t.bar) # OK ```
I just took a long look. Most things look great, a few nits, but the one substantial thing is the |
@NickCrews Thanks for doing a deep dive on tests. There were indeed a few cases that weren't tested and also the |
Cool this looks great then! |
@NickCrews Can you approve the PR? |
Sorry I was on the mobile app where the approval status isn't even shown so i didn't know there was anything to do |
Previously we allowed arbitrary nesting of input expressions for various
API methods like
table.join
,table.group_by
, etc. While this wasbackward compatible with
8.0.0
it also exposes additional ambiguitywhich can be confusing for users and difficult to reason about.
This change restricts the nesting of input expressions to a
"single level" of nesting with the exception of the first positional
argument which can be a list of expressions, but only if there are no
more positional arguments. Examples:
Aims to resolve #7819 and #8304, depends on #8884