User experience feedback using Ibis as a general dataframe library #4542

jcmkk3 · 2022-09-20T20:23:55Z

jcmkk3
Sep 20, 2022

I'm kind of a dataframe geek and I feel like I take Ibis for a spin about once every 6 months or so to see how it feels. I always see a lot to like, but I come away feeling like I'm not quite the target user. That is okay, but I think that some of my feedback could be helpful to you all anyways.

I've hoped that Ibis would arrive somewhere like dplyr/dbplyr, but with a different approach. For dplyr/dbplyr, the initial focus was on in-memory data analysis, then later expanding out that API to work as seamlessly as possible by querying remote datasets in databases. Ibis has started with a focus on the databases, but I hope that it will become as easy to use as dplyr/pandas for local in-memory data analysis.

It took a lot of trial and error to figure out how to load a csv dataset to explore locally. I'm not even sure if the way I'm doing it is optimal. I've seen in the commits that you all are trying to make connecting to data easier. I saw that a new top-level connect option has been added and that the duckdb engine is being promoted as default when the engine isn't specified and duckdb is able to load it. I was hoping that this would smooth out my experience with loading data in the past.

I tried to ibis.connect("data/cars.csv"), but duckdb seemed to be expecting a database instead of a csv file. My method ended up being to load it into pandas first, then ibis.pandas.connect({"cars": cars}).

As a local first user, I don't really care what execution backend that I use, but I'd like it to be as fast as possible and scale as much as possible for my machine. One of the biggest draws for me to Ibis is that it takes an API first approach and is flexible in the computation backend. That feels like a great asset in the world of constantly evolving data systems. In my ideal world, I would be able to write analysis code in a consistent way and feel confident that the speed and scale are always best-in-class.

I'd like to think in "tables" instead of "databases". Having to create an intermediary dictionary mapping tables to names, then selecting the only table, seems clunky.

I'd also like to be able to read almost any sort of file type. I believe that this is one of the big reasons for pandas success. The lines of code, simplicity, and speed of getting data loaded is very satisfying.

Data manipulation and visualization go hand-in-hand as I'm trying to explore a new dataset. I was trying to follow along with https://r4ds.had.co.nz/data-visualisation.html by using Ibis and seaborn's new objects API. It took me quite a while to figure out that I needed to .execute() the table in order to get it to work with seaborn. Maybe the dataframe interchange protocol will improve this in the future and it will work to just pass the table object.

Some of these friction points might just be user errors and possibly could be improved with documentation. I do wonder if it is expecting too much of Ibis to be able to perform both in-memory and remote database analytics well. I think that Ibis has really good bones and it doesn't seem like a big leap to believe that it could work well for in-memory analytics using engines like duckdb, arrow-fusion, or polars. Maybe it would be easier to have a sibling package (like dplyr and dbplyr) that would come with data loaders and make it as easy as pandas to get data loaded into memory and ready to analyze?

cpcloud · 2022-09-21T11:03:32Z

cpcloud
Sep 21, 2022
Maintainer

@jcmkk3 First off, thanks for your candid, extensive and thoughtful feedback. All of us really appreciate that you took the time out of your day to write this up and share it with us.

After reading your feedback, I think you are definitely in our target user group.

I hope you don't mind, but I'll cherry-pick some of your points and discuss how we are thinking about addressing them.

It took a lot of trial and error to figure out how to load a csv dataset to explore locally.

This has definitely been a pain point for many users, and I don't think it's a stretch to say that it's painful for the maintainers as well. After all the world is filled with CSVs (for better or worse!) and anyone doing anything with data can only avoid them for so long.

One of the broad efforts we're hammering away at is making the kinds of tasks that feel extremely simple in pandas be as simple as possible with ibis.

To address your specific example, and how you'd go about doing that now:

con = ibis.connect("duckdb://")
t = con.register(path_to_csv)  # creates a view in duckdb
# t is an ibis expression

This can almost certainly be improved, and I know at least one other person @saulpw wants ibis.connect to grow the ability to handle "leaf" objects like tables in addition to retaining its current functionality. It's probably time we explored enabling that.

As a local first user, I don't really care what execution backend that I use

This is a very useful and interesting piece of feedback. One of the newest features, available in the just-released 3.2.0 version of ibis, is ibis.memtable which you can use like this:

import ibis

t = ibis.memtable(some_dataframe)
# do stuff with t, *including .execute*

Out of the box, this will use duckdb to execute SQL against the DataFrame underlying the expression. So, barring an ibis.connect('foo.csv') existing right now, you could at the very least use ibis.memtable(pd.read_csv(...)) to start getting more performance and scale out of CSVs.

In my ideal world, I would be able to write analysis code in a consistent way and feel confident that the speed and scale are always best-in-class.

Well, you've come to right place then. I think that we all share that vision :)

I think the framing you've used here is really interesting.

In the past we've thought about various ways of describing ibis, such as "write once, run anywhere" (ish), but the framing of "Best in class speed and scale, with minimal code changes" allows for the fact that there are small differences between backends while keeping the "kernel" of value of having the same API no matter what you're executing against.

I'd like to think in "tables" instead of "databases".

Got it. This is a useful piece of feedback and I think there's a bit of a split here. I think for the local experience "tables" is really where it's at. When things go remote there's usually a catalog of things that you want to explore, and of course eventually get to a table :)

Having to create an intermediary dictionary mapping tables to names, then selecting the only table, seems clunky.

That is definitely clunky, and we'll work (and are working!) to make that better.

Data manipulation and visualization go hand-in-hand as I'm trying to explore a new dataset.

Very much agreed!

I was trying to follow along with r4ds.had.co.nz/data-visualisation.html by using Ibis and seaborn's new objects API. It took me quite a while to figure out that I needed to .execute() the table in order to get it to work with seaborn.

This can be improved fairly easily in th short term. We can explore adding support for ibis things in seaborn, so that it'll call .execute() for you.

In the long term, we'll need something like the DataFrame protocol (as you suggest!) to make this work across the ecosystem more generally.

I do wonder if it is expecting too much of Ibis to be able to perform both in-memory and remote database analytics well.

I think we are all pretty bullish on closing that gap :)

0 replies

jcmkk3 · 2022-09-22T03:12:02Z

jcmkk3
Sep 22, 2022
Author

Thanks for your reply. The memtable tip was really helpful! That at least made it easy to get started with a pandas data set to kick the tires a little bit. Nothing groundbreaking here, but you can see my notebook where I experimented with Ibis (and seaborn objects a bit) to try to get a feel for it.

I'm going to start looking for places where I could try it out for some real work in the future.

4 replies

cpcloud Sep 22, 2022
Maintainer

Sweet, thanks!

Based on your feedback, two PRs have now landed in master and will be in the next release (4.0.0):

feat(api): implement __array__ #4547, which allows passing columns straight to matplotlib for maximum convenience. I will try out your notebook to see whether this can now be done without .execute()
feat(duckdb): make it more convenient to access files #4553, which makes ibis.connect('foo.csv') return a table for use with duckdb.

Let us know if there's anything else you'd like to see!

cpcloud Sep 22, 2022
Maintainer

I tried out the notebook with the latest ibis and it looks like the seaborn objects API is extremely specific to DataFrames, so just implementing __array__ doesn't appear to be enough.

Indeed there's an isinstance check on the first argument to Plot that checks that it's a mapping or pandas DataFrame 😞.

@mwaskom Where is the appropriate place to discuss opening up that API a bit so that Plot can handle anything implementing the __array__ interface?

mwaskom Sep 22, 2022

Missing some context here but I don’t immediately see how seaborn could accept __array__ instead of DataFrame. Does __array__ have the concept of named columns?

cpcloud Apr 6, 2023
Maintainer

@mwaskom Following up here, __array__ likely won't work, but if seaborn could call ibis expressions to_pandas() methods that would work. We already have plotnine integration this way so it might be worth looking into how they do it.

cpcloud · 2023-04-06T13:30:35Z

cpcloud
Apr 6, 2023
Maintainer

To recap and close this out: we have taken this feedback to heart and have since implemented many many features (which have stabilized a bit) and bug fixes based on this feedback.

Here's a short list based on the feedback:

I've hoped that Ibis would arrive somewhere like dplyr/dbplyr, but with a different approach. For dplyr/dbplyr, the initial focus was on in-memory data analysis, then later expanding out that API to work as seamlessly as possible by querying remote datasets in databases. Ibis has started with a focus on the databases, but I hope that it will become as easy to use as dplyr/pandas for local in-memory data analysis.

This is now about as easy as it could possibly be using ibis.memtable with the default backend:

In [3]: from ibis.interactive import *

In [4]: import pandas as pd

In [5]: df = pd.read_csv("/data/h-of-data-2022/noahs-customers.csv")

In [6]: df.head()
Out[6]:
   customerid                  name             address          citystatezip   birthdate         phone          timezone       lat      long
0        1001       Debbie Ferguson  2615 SE Dogwood Dr  Tuscaloosa, AL 35476  1959-08-20  251-435-4461   America/Chicago  33.23480 -87.52680
1        1002  Savannah Susan Chung       3204 Oak Rd E     Atlanta, GA 30314  1950-06-14  912-611-6026  America/New_York  33.76280 -84.42200
2        1003          Lauren Stout       70-28 71st St   Ridgewood, NY 11385  1982-09-13  315-597-4287  America/New_York  40.70072 -73.88048
3        1004       Ricky Mcfarland      1128 Park Dr S     Houston, TX 77032  1981-04-15  915-437-1627   America/Chicago  29.78600 -95.38850
4        1005       Gilbert Daniels      1103 Ogden Ave       Bronx, NY 10452  1997-07-16  585-425-7101  America/New_York  40.83322 -73.92665

In [7]: t = ibis.memtable(df)

In [8]: t.group_by(first_digit=_.phone[0]).count()
Out[8]:
┏━━━━━━━━━━━━━┳━━━━━━━┓
┃ first_digit ┃ count ┃
┡━━━━━━━━━━━━━╇━━━━━━━┩
│ string      │ int64 │
├─────────────┼───────┤
│ 2           │  1240 │
│ 9           │  3959 │
│ 3           │  5502 │
│ 5           │  7835 │
│ 8           │  5312 │
│ 7           │  2446 │
│ 6           │  4253 │
│ 4           │  1005 │
└─────────────┴───────┘

It took a lot of trial and error to figure out how to load a csv dataset to explore locally. I'm not even sure if the way I'm doing it is optimal. I've seen in the commits that you all are trying to make connecting to data easier. I saw that a new top-level connect option has been added and that the duckdb engine is being promoted as default when the engine isn't specified and duckdb is able to load it. I was hoping that this would smooth out my experience with loading data in the past.

Using the previous example as input, you can now use ibis.read_csv:

In [9]: from ibis.interactive import *

In [10]: t = ibis.read_csv("/data/h-of-data-2022/noahs-customers.csv")

In [11]: t.group_by(first_digit=_.phone[0]).count()
Out[11]:
┏━━━━━━━━━━━━━┳━━━━━━━┓
┃ first_digit ┃ count ┃
┡━━━━━━━━━━━━━╇━━━━━━━┩
│ string      │ int64 │
├─────────────┼───────┤
│ 3           │  5502 │
│ 9           │  3959 │
│ 4           │  1005 │
│ 5           │  7835 │
│ 8           │  5312 │
│ 6           │  4253 │
│ 7           │  2446 │
│ 2           │  1240 │
└─────────────┴───────┘

I tried to ibis.connect("data/cars.csv"), but duckdb seemed to be expecting a database instead of a csv file. My method ended up being to load it into pandas first, then ibis.pandas.connect({"cars": cars}).

Definitely not ideal. ibis.connect is more useful for databases. If you don't care about your backend, you can use ibis.read_csv and ibis will use the duckdb backend by default.

As a local first user, I don't really care what execution backend that I use, but I'd like it to be as fast as possible and scale as much as possible for my machine. One of the biggest draws for me to Ibis is that it takes an API first approach and is flexible in the computation backend. That feels like a great asset in the world of constantly evolving data systems. In my ideal world, I would be able to write analysis code in a consistent way and feel confident that the speed and scale are always best-in-class.

See the above answers!

We also have ibis.set_backend(...) if you want memtable et al to use a different backend to do the computation by default.

I'd like to think in "tables" instead of "databases". Having to create an intermediary dictionary mapping tables to names, then selecting the only table, seems clunky.

This isn't necessary anymore. Also, I'd avoid the pandas backend. It's slower and more memory hungry than raw pandas.

I'd also like to be able to read almost any sort of file type. I believe that this is one of the big reasons for pandas success. The lines of code, simplicity, and speed of getting data loaded is very satisfying.

We have the ability to read CSVs, TSVs, and Parquet, locally or remotely, across a few different engines. Definitely open to adding more kinds of files.

Data manipulation and visualization go hand-in-hand as I'm trying to explore a new dataset. I was trying to follow along with r4ds.had.co.nz/data-visualisation.html by using Ibis and seaborn's new objects API. It took me quite a while to figure out that I needed to .execute() the table in order to get it to work with seaborn. Maybe the dataframe interchange protocol will improve this in the future and it will work to just pass the table object.

This is a bit trickier and requires support from the plotting library. We've worked with the folks working on the plotnine library to make it seamless, but other libraries like Seaborn do not have their DataFrame-related code decoupled from everything else as well as plotnine does.

Efforts on this front will take longer to get to a better state.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User experience feedback using Ibis as a general dataframe library #4542

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

User experience feedback using Ibis as a general dataframe library #4542

jcmkk3 Sep 20, 2022

Replies: 3 comments · 4 replies

cpcloud Sep 21, 2022 Maintainer

jcmkk3 Sep 22, 2022 Author

cpcloud Sep 22, 2022 Maintainer

cpcloud Sep 22, 2022 Maintainer

mwaskom Sep 22, 2022

cpcloud Apr 6, 2023 Maintainer

cpcloud Apr 6, 2023 Maintainer

jcmkk3
Sep 20, 2022

Replies: 3 comments 4 replies

cpcloud
Sep 21, 2022
Maintainer

jcmkk3
Sep 22, 2022
Author

cpcloud Sep 22, 2022
Maintainer

cpcloud Sep 22, 2022
Maintainer

cpcloud Apr 6, 2023
Maintainer

cpcloud
Apr 6, 2023
Maintainer