Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-functionalize query internals #989

Merged
merged 30 commits into from
Sep 30, 2024
Merged

Conversation

billylanchantin
Copy link
Contributor

@billylanchantin billylanchantin commented Sep 21, 2024

Description

This is step 1 of implementing the ideas from this PR:

From @josevalim:

We can make query not return a function and that should be a relatively small change. We could even make it so accessing a lazy dataframe returns lazy series, so we can build queries without writing S.col. The semantics are well defined internally, we just chose to not expose them.

Before if you wanted to use filter_with/2 and friends you had to write a callback. You can still do that. But now you can also do:

alias Explorer.{DataFrame, Query, Series}

df = DataFrame.new(a: [1, 2, 3])
qf = Query.new(df)

gt_1 = Series.greater(qf["a"], 1)
lt_3 = Series.less(qf["a"], 3)

df
|> DataFrame.filter_with(gt_1)
|> DataFrame.to_columns(atom_keys: true)
#=> %{a: [2, 3]}

df
|> DataFrame.filter_with(lt_3)
|> DataFrame.to_columns(atom_keys: true)
#=> %{a: [1, 2]}

df
|> DataFrame.filter_with(Series.and(gt_1, lt_3))
|> DataFrame.to_columns(atom_keys: true)
#=> %{a: [2]}

Changes

  • The _with functions now accept the outputs of their callbacks too
  • Added Explorer.Query.new/1
  • Renamed Backend.LazyFrame as Backend.QueryFrame
  • Added some clarification to the "Implementation details" section

This just makes the intended use clearer.
This makes it clearer it's a boolean.
This is in line with what we do elsewhere.
This is the primary change. Now `query/2` takes
the dataframe as an explicit argument intead of
an implicit, unhygienized variable.
lib/explorer/query.ex Outdated Show resolved Hide resolved
@josevalim
Copy link
Member

josevalim commented Sep 22, 2024

There is another concern here related to exposing Explorer.Backends.LazyFrame or whatever we are going to call it. We already to_lazy and that returns something different. Maybe we should instead have Explorer.Query.new(df) and we rename Explorer.Backends.LazyFrame to QueryFrame?

Then we change filter_with and friends to accept either an anonymous function or the result of a Explorer.Query.new (which will be a Explorer.Backends.QueryFrame). This way we keep everything related to the Explorer.Query API?

If so, I'd do the following changes:

  • PR 1: Rename Explorer.Backends.LazyFrame to QueryFrame
  • PR 2: Introduce Explorer.Query.new and allow filter_with to accept it as an argument beyond functions
  • PR 3: Add access behaviour
  • PR 4: Any pending style refactoring

@billylanchantin
Copy link
Contributor Author

@josevalim I think you're on the right track. Let me clarify my goal.

I want to make functionality like Polars expressions more 1st class. I want users to be able to directly create/manipulate an expression-like data structure much like you can do with an Ecto.Queryable. And I want this data structure to replace the callbacks that are the current go-between for our verb/verb_with pairs.

We kind of have expressions already in the form of Backend.LazySeries. But our API is designed such that they're only an implementation detail. I want to either expose them or replace them.

I like the idea of a Explorer.Query.new or similar that returns this new data structure. But I'm not sure renaming Backend.LazyFrame as Backend.QueryFrame is quite right. I think they serve different purposes.

Here's my understanding of the concepts at play:

Polars Explorer
DataFrame DataFrame
LazyFrame LazyFrame
Series Series
Expression LazySeries

All that said, I'm pretty open to other suggestions on how to make this work! As I found out in the other PR, it's a bit of a tricky needle to thread.

@josevalim
Copy link
Member

I think the table is missing one entry, which is that we need a QueryFrame that, when accessed, returns LazySeries. That’s different from a lazy frame (the result of DF.to_lazy)

@billylanchantin
Copy link
Contributor Author

Ok thanks this is good food for thought.

I think I need to play with a few options. I want to make sure QueryFrames feel ergonomic. And I'm not quite sure how the interact with the other 4 concepts yet.

Once I have something I'll write it up. I'll also maybe move this to an issue. I'm realizing I'm still too much in the designing phase.

@josevalim
Copy link
Member

I think this PR is almost there. To get started, you could:

  1. add Explorer.Query.new as an alias to LazyFrame.new
  2. Add new clauses to the _with functions
  3. Add the access behavior

And if you change nothing else, it should be what you want. Everything else I mentioned is cleanup/refactoring. It is just that I tend to think bottom-up. :)

@billylanchantin
Copy link
Contributor Author

billylanchantin commented Sep 24, 2024

I did the naive thing but then I realized that this is now possible:

alias Explorer.{DataFrame, Query, Series}

df1 = DataFrame.new(a: [1, 2, 3])
df2 = DataFrame.new(b: [4, 5, 6])

qf1 = Query.new(df1)
qf2 = Query.new(df2)

c_lazy = Series.add(qf1["a"], qf2["b"])

DataFrame.mutate_with(df1, c: c_lazy)
** (RuntimeError) Polars Error: not found: b: 'with_columns' failed
    (explorer 0.10.0-dev) lib/explorer/polars_backend/shared.ex:53: Explorer.PolarsBackend.Shared.apply_dataframe/4
    (explorer 0.10.0-dev) lib/explorer/polars_backend/data_frame.ex:659: Explorer.PolarsBackend.DataFrame.mutate_with/3
    (explorer 0.10.0-dev) lib/explorer/data_frame.ex:2983: Explorer.DataFrame.mutate_with/3
    iex:13: (file)

I'm not sure this is what we want. We'll probably at least want better error handling.

Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful! We only need docs and this is good to go IMO!

Copy link
Member

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice improvement! 💯 :shipit:

@billylanchantin
Copy link
Contributor Author

Sorry for the delay!

Notes:

  • I renamed Explorer.Backend.LazyFrame as Explorer.Backend.QueryFrame as discussed.
    • My first draft of the docs were super confusing otherwise.
  • The "Implementation details" section is now kinda long. I'm happy to trim it if preferred :)

* Remove an extra "the"
* Reference the rewritten section in
  the `new/1` docs
lib/explorer/query.ex Outdated Show resolved Hide resolved
lib/explorer/query.ex Outdated Show resolved Hide resolved
lib/explorer/query.ex Outdated Show resolved Hide resolved
lib/explorer/query.ex Outdated Show resolved Hide resolved
@josevalim
Copy link
Member

Great @billylanchantin! I have added some feedback to the docs. In a nutshell, the implementation detail docs are going for too long and we have no documentation on how and why to use the Explorer.Query.new feature. I would focus on the user focusing API and minimize the changes to the implementation details. :)

@billylanchantin
Copy link
Contributor Author

Ok I think that's better. Thanks for the review :)

Do we still like the LazyFrame -> QueryFrame change? Or shall I revert that?

Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. The name QueryFrame is also much better to me. My only remaining question is if we rename to rename LazySeries to QuerySeries but I think they are not equivalent to QueryFrame (as in you can actually use a LazySeries in most series operations, but you can't do that with a QueryFrame).

@billylanchantin
Copy link
Contributor Author

My only remaining question is if we rename to rename LazySeries to QuerySeries but I think they are not equivalent to QueryFrame (as in you can actually use a LazySeries in most series operations, but you can't do that with a QueryFrame).

Yes that's my take too.

"Query" is good in QueryFrame because all you can do is access, aka query, it. LazySeries represent computations, so "query" feels wrong to me.

@billylanchantin billylanchantin merged commit fe52352 into main Sep 30, 2024
3 checks passed
@billylanchantin billylanchantin deleted the bl-defunctionalize-query branch September 30, 2024 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants