Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement show_dataframe #177

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

henrifroese
Copy link
Collaborator

@henrifroese henrifroese commented Sep 6, 2020

This new function in the visualization module allows users to scroll through a DataFrame and search it. We believe this is much nicer than the built-in pandas printing and could be a heavily used function 🥈 . See this notebook for an example.

Internally, this works by creating an HTML DataTable. The relevant files are in the new texthero subfolder visualization-server that implements an extremely light-weight way to create our visualizations. It's adopted from pyLDAvis and refactored/simplified by us. This folder can also be used for further texthero visualization functions in the future.

Note: travis/setup changed because of #171 . This is branched straight from master.

EDIT: still working on support for online Jupyter Notebooks (e.g. Colab) ✔️

@henrifroese
Copy link
Collaborator Author

Finally fixed colab issues. Here is an example notebook. Below you can see a screenshot from the notebook.

Screenshot from 2020-09-06 20-34-32

@henrifroese henrifroese added the enhancement New feature or request label Sep 6, 2020
@henrifroese henrifroese marked this pull request as ready for review September 6, 2020 18:53
@henrifroese henrifroese requested a review from jbesomi September 6, 2020 18:53
@jbesomi
Copy link
Owner

jbesomi commented Sep 7, 2020

Beautiful!!

  • How fast it is the search with large datasets? Also, all dataset is loaded into the browser or only the page we are looking for? (i.e is that fast?)
  • It would be great if the width of the columns can be adapted according to the content. For instance, we don't need topic to be that large (search in google "html table adapt column width to content" for instance)
  • Are the left and right padding (margin) necessary?
  • We can also just call this function hero.show (we will accept both Series and DF) or even add it as custom accessor so that we can also call it with df.hero.show() (not sure about this approach, as it might confuse the users?)
  • Can we add a parameter used to define the maximum number of lines displayed for each text?
  • Is the whole webbrowser part code stable and safe?

@henrifroese
Copy link
Collaborator Author

How fast it is the search with large datasets? Also, all dataset is loaded into the browser or only the page we are looking for? (i.e is that fast?)

Yes, the whole dataset is loaded into the browser. Locally, this works for me until around ~50k rows, then it starts to get really slow to load. As soon as it is loaded, the search is really quick. I think that changing (i.e. only loading parts and dynamically loading more etc.) would be a huge amount of work that's probably out-of-scope at this point; but it might be interesting to revisit this later. I think that for many users it's still already very useful (as long as either the dataset is not very big or the machine is powerful, everything works great).

It would be great if the width of the columns can be adapted according to the content. For instance, we don't need topic to be that large (search in google "html table adapt column width to content" for instance)

That is by default enabled for datatables (see here), and it also works locally. It does not work in colab, not sure why. It's difficult to see why/how some HTML rendering fails in colab/jupyter 🦡

Are the left and right padding (margin) necessary?

Again, that's also only a colab problem and we're not sure how to solve this without a huge amount of work 😞

We can also just call this function hero.show (we will accept both Series and DF) or even add it as custom accessor so that we can also call it with df.hero.show() (not sure about this approach, as it might confuse the users?)

I agree that a custom accessor might be confusing; I think we could call it hero.show, but to me that sounds a little more general and hero.show_dataframe describes the function a little better 🐖

Can we add a parameter used to define the maximum number of lines displayed for each text?

Again, that's also quite difficult as far as I can see.

Is the whole webbrowser part code stable and safe?

Yes, very. Everything runs locally and we're basically only serving one HTML file with the datatable.

** Summary **:

We could probably spend a lot of time making this better for big data / ..., but that's probably out-of-scope for texthero; we think that the simple, relatively lightweight implementation here is good and other, more dynamic stuff would take lots and lots of effort spent on this one function. That's why we're a little hesitant to do that - we think it's already quite useful for most users, and still quite simple. Making it perfect would be interesting and fun, but also hard and time-consuming.

@jbesomi jbesomi marked this pull request as draft September 14, 2020 15:58
@mk2510 mk2510 marked this pull request as ready for review September 22, 2020 19:53
@mk2510
Copy link
Collaborator

mk2510 commented Sep 22, 2020

we just went through this PR again and from our side, it is ready for review 🍾 🍺 🍻

@henrifroese
Copy link
Collaborator Author

TODO

  • make usable for bigger datasets
  • test in different environments
  • make better looking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants