-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggplot histograms vertical breaks fix #711
Conversation
With the current change in this PR, I was able to make the following fix. SummaryUsed Cause of the errorthe last two bins are far away from each other more than the other bins. (For other bins, the distance is 180. For the last two, it is 360.) Thus, using As you can see below, a similar error occurs in the test plot images. Solution
|
@bbeat2782 did you validate (at least some) of the plots with R's ggplot? |
@edublancas Yes, I was able to replicate all the histograms' shapes through R ggplot2 using the bin create method we use in the |
|
As far as I know, there shouldn't be overlapping areas when plotting a histogram with one variable. Thus, for both cases ( And from my understanding, the narrower bar widths are correct because the larger bar widths are due to considering If any of the things I comment on is wrong, please correct me.
Could you please elaborate more on what kind of example you are asking? I'm a little bit confused on this point because it seems like the link already has examples of plotting different histograms. Thanks. |
I don't see one similar to the examples in the testcases. (with overlay). Maybe it's because of the dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the change makes sense and it looks good. However, there are three cases where we calculate the width, and we might calculate it differently for some of them.
For example:
%sqlplot histogram --table penguins.csv --column body_mass_g --bins 300
here for example we have 1 color without fill. In this case, when the bars are overlapping each other, it is actually easier to understand the chart.
based on @yafimvo's comment, I believe the one on the left is the current implementation and the one on the right is @bbeat2782's new implementation: I think we need to clarify the three possible ways to represent the bars:
I think depending on the user's settings, the bar width calculation might be different, but ultimately, we want all output histograms to be in scenario 3 (fitted). going through how histograms are calculated and plotted can help: https://en.wikipedia.org/wiki/Histogram @bbeat2782 does this clarify things? |
If using |
By default, a histogram plotted by ggplot in R contains some whitespaces between bins if the number of bins is large and some bins do not contain a data point. We can change the width of bins and remove the whitespaces by passing in the @edublancas So I will work from there. Thanks for the clarification. |
ok, to ensure I understand what you'll do: you'll mimic what ggplot in R does (this means that in some cases there will be whitespace if data is missing in certain data ranges). you will not implement |
Yes
I was going to implement |
let's tackle |
After clarifying the following with edublancas
do you think we still need to tackle the 3 plots you mentioned from your first comment? |
@neelasha23 |
I think the mention notification got missed, so I'm commenting again with the re-review request. After clarifying the following with edublancas
do you think we still need to tackle the 3 plots you mentioned from your first comment? |
I'm not sure which 3 plots you are referring to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code looks good. Just make sure the cases we test match the ggplot R outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a few tests and seems like we're getting different results (left R, right JupysQL). I plotted body_mass_g and fill by species.
in R, the species with higher counts is Adelie, in ours is Chinstrap. Let's double check this.
it might be better to create some fake data to ensure we know beforehand how many observations there are.
Notebooks:
ggplot-notebooks.zip
I will check it. @edublancas
are you referring to their locations (which category is stacked on top of others)? |
no, I mean the counts, the values in the Y axis |
@edublancas As you said, the binning in histogram works differently in R and our implementation. Thus, changing it will change lots of test histogram plots that is related to mimicking R ggplot behavior instead of fixing the vertical color breaks. Do you think it's better to change it here or in a different open PR? |
@AnirudhVIyer: this PR is having the issue that you're having. @bbeat2782: the error that you're seeing is unrelated to your PR I included a fix for this in #746, so once it's merged, you can rebase - the problem is an update to sqlglot |
@bbeat2782 ok, let's keep this as is (once the tests pass, we can merge) and let's tackle the shape of the histogram in a new PR (opened #751) |
update changelog test images changed more image changes ci documentation and docstring change ggplot wording change
Describe your changes
Fixed vertical color breaks in histograms by changing the method of getting bin width.
Issue number
Closes #702
Checklist before requesting a review
pkgmt format
📚 Documentation preview 📚: https://jupysql--711.org.readthedocs.build/en/711/