Replies: 3 comments 4 replies
-
@jcmkk3 First off, thanks for your candid, extensive and thoughtful feedback. All of us really appreciate that you took the time out of your day to write this up and share it with us. After reading your feedback, I think you are definitely in our target user group. I hope you don't mind, but I'll cherry-pick some of your points and discuss how we are thinking about addressing them.
This has definitely been a pain point for many users, and I don't think it's a stretch to say that it's painful for the maintainers as well. After all the world is filled with CSVs (for better or worse!) and anyone doing anything with data can only avoid them for so long. One of the broad efforts we're hammering away at is making the kinds of tasks that feel extremely simple in pandas be as simple as possible with ibis. To address your specific example, and how you'd go about doing that now: con = ibis.connect("duckdb://")
t = con.register(path_to_csv) # creates a view in duckdb
# t is an ibis expression This can almost certainly be improved, and I know at least one other person @saulpw wants
This is a very useful and interesting piece of feedback. One of the newest features, available in the just-released 3.2.0 version of ibis, is import ibis
t = ibis.memtable(some_dataframe)
# do stuff with t, *including .execute* Out of the box, this will use duckdb to execute SQL against the DataFrame underlying the expression. So, barring an
Well, you've come to right place then. I think that we all share that vision :) I think the framing you've used here is really interesting. In the past we've thought about various ways of describing ibis, such as "write once, run anywhere" (ish), but the framing of "Best in class speed and scale, with minimal code changes" allows for the fact that there are small differences between backends while keeping the "kernel" of value of having the same API no matter what you're executing against.
Got it. This is a useful piece of feedback and I think there's a bit of a split here. I think for the local experience "tables" is really where it's at. When things go remote there's usually a catalog of things that you want to explore, and of course eventually get to a table :)
That is definitely clunky, and we'll work (and are working!) to make that better.
Very much agreed!
This can be improved fairly easily in th short term. We can explore adding support for ibis things in seaborn, so that it'll call In the long term, we'll need something like the DataFrame protocol (as you suggest!) to make this work across the ecosystem more generally.
I think we are all pretty bullish on closing that gap :) |
Beta Was this translation helpful? Give feedback.
-
Thanks for your reply. The I'm going to start looking for places where I could try it out for some real work in the future. |
Beta Was this translation helpful? Give feedback.
-
To recap and close this out: we have taken this feedback to heart and have since implemented many many features (which have stabilized a bit) and bug fixes based on this feedback. Here's a short list based on the feedback:
This is now about as easy as it could possibly be using
Using the previous example as input, you can now use
Definitely not ideal.
See the above answers! We also have
This isn't necessary anymore. Also, I'd avoid the pandas backend. It's slower and more memory hungry than raw pandas.
We have the ability to read CSVs, TSVs, and Parquet, locally or remotely, across a few different engines. Definitely open to adding more kinds of files.
This is a bit trickier and requires support from the plotting library. We've worked with the folks working on the plotnine library to make it seamless, but other libraries like Seaborn do not have their DataFrame-related code decoupled from everything else as well as plotnine does. Efforts on this front will take longer to get to a better state. |
Beta Was this translation helpful? Give feedback.
-
I'm kind of a dataframe geek and I feel like I take Ibis for a spin about once every 6 months or so to see how it feels. I always see a lot to like, but I come away feeling like I'm not quite the target user. That is okay, but I think that some of my feedback could be helpful to you all anyways.
I've hoped that Ibis would arrive somewhere like dplyr/dbplyr, but with a different approach. For dplyr/dbplyr, the initial focus was on in-memory data analysis, then later expanding out that API to work as seamlessly as possible by querying remote datasets in databases. Ibis has started with a focus on the databases, but I hope that it will become as easy to use as dplyr/pandas for local in-memory data analysis.
It took a lot of trial and error to figure out how to load a csv dataset to explore locally. I'm not even sure if the way I'm doing it is optimal. I've seen in the commits that you all are trying to make connecting to data easier. I saw that a new top-level
connect
option has been added and that the duckdb engine is being promoted as default when the engine isn't specified and duckdb is able to load it. I was hoping that this would smooth out my experience with loading data in the past.I tried to
ibis.connect("data/cars.csv")
, but duckdb seemed to be expecting a database instead of a csv file. My method ended up being to load it into pandas first, thenibis.pandas.connect({"cars": cars})
.As a local first user, I don't really care what execution backend that I use, but I'd like it to be as fast as possible and scale as much as possible for my machine. One of the biggest draws for me to Ibis is that it takes an API first approach and is flexible in the computation backend. That feels like a great asset in the world of constantly evolving data systems. In my ideal world, I would be able to write analysis code in a consistent way and feel confident that the speed and scale are always best-in-class.
I'd like to think in "tables" instead of "databases". Having to create an intermediary dictionary mapping tables to names, then selecting the only table, seems clunky.
I'd also like to be able to read almost any sort of file type. I believe that this is one of the big reasons for pandas success. The lines of code, simplicity, and speed of getting data loaded is very satisfying.
Data manipulation and visualization go hand-in-hand as I'm trying to explore a new dataset. I was trying to follow along with https://r4ds.had.co.nz/data-visualisation.html by using Ibis and seaborn's new objects API. It took me quite a while to figure out that I needed to
.execute()
the table in order to get it to work with seaborn. Maybe the dataframe interchange protocol will improve this in the future and it will work to just pass the table object.Some of these friction points might just be user errors and possibly could be improved with documentation. I do wonder if it is expecting too much of Ibis to be able to perform both in-memory and remote database analytics well. I think that Ibis has really good bones and it doesn't seem like a big leap to believe that it could work well for in-memory analytics using engines like duckdb, arrow-fusion, or polars. Maybe it would be easier to have a sibling package (like dplyr and dbplyr) that would come with data loaders and make it as easy as pandas to get data loaded into memory and ready to analyze?
Beta Was this translation helpful? Give feedback.
All reactions