-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fall back to to_pandas if interchange protocol fails #3534
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #3534 +/- ##
=======================================
Coverage 98.34% 98.34%
=======================================
Files 75 75
Lines 24605 24617 +12
=======================================
+ Hits 24197 24209 +12
Misses 408 408
|
I've got no issues with this but it would be helpful to understand better when it's relevant. Is the idea that a library might have bugs in the hooks that |
Great, thanks!
A library can control its own In particular, if there's a datatype not supported by the consortium, then interchanging to pandas will raise regardless of whether that column will be used or not. Doesn't look like there's any real prospect of this changing data-apis/dataframe-api#288 maybe I can get an extra argument in to pandas (like, |
Thanks for explaining. TBH as an observer (but interested party) these nuances make it fairly confusing to understand what the "right" way to generically consume a dataframe object is. Not sure that that's really actionable feedback (and based on the linked thread it looks like you agree). So this feels slightly hacky but I think reasonable given the circumstances. |
I think the ideal way would be something like this: columns = list(necessary_columns)
df = pd.api.interchange.from_dataframe(
df.__dataframe__().select_columns_by_name(columns)
) but I understand that knowing in advance exactly which columns you need might not be possible without a major refactor |
I gather that a relevant example looks like this: import polars as pl
pol_df = pl.DataFrame({
"a": pl.Series([1, 2, 3, 4]),
"b": pl.Series([[1, 2], [3], [4, 5, 6], []]),
})
# Works, "b" column is object-typed
pol_df.to_pandas()
# Raises with "large_list<item: int64> not supported by interchange protocol"
pd.api.interchange.from_dataframe(pol_df) (Actually, first-class list-type support would be pretty handy for seaborn 🤔...) This makes sense to me as a current source of disagreement that we should work around. |
Also removed the check for the to_pandas method since we don't need it given that we're passing on any errors.
I'm really hoping for a fix, because this is an annoying problem I sometimes encounter. For example, this works without import seaborn.objects as so
import polars as pl
mpg = pl.read_csv("https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/mpg.csv")
(
so.Plot(mpg, x="displ", y="hwy")
.add(so.Dot())
) However, this does not work: economics = pl.read_csv("https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/economics.csv", try_parse_dates=True, dtypes={"pop": pl.Float32})
(
so.Plot(economics, x="date", y="uempmed")
.add(so.Path())
)
(
so.Plot(economics.to_pandas(), x="date", y="uempmed")
.add(so.Path())
) |
hi @mwaskom - just for my understanding, is there anything holding this back? or will it go in with the 0.14 release? |
Closes #3533