Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggplot histograms vertical breaks fix #711

Merged
merged 1 commit into from
Jul 24, 2023
Merged

ggplot histograms vertical breaks fix #711

merged 1 commit into from
Jul 24, 2023

Conversation

bbeat2782
Copy link

@bbeat2782 bbeat2782 commented Jul 5, 2023

Describe your changes

Fixed vertical color breaks in histograms by changing the method of getting bin width.

Issue number

Closes #702

Checklist before requesting a review


📚 Documentation preview 📚: https://jupysql--711.org.readthedocs.build/en/711/

@bbeat2782
Copy link
Author

With the current change in this PR, I was able to make the following fix.

Screen Shot 2023-07-05 at 8 46 43 AM

Summary

Used bin_size returned by _histogram() instead of calculating bin width with bins[-1] - bins[-2].

Cause of the error

After the query below,
Screen Shot 2023-07-05 at 12 55 25 PM

the last two bins are far away from each other more than the other bins. (For other bins, the distance is 180. For the last two, it is 360.)

Thus, using bins[-1] - bins[-2] to get width of bins during plotting causes the error described in #702.
Screen Shot 2023-07-05 at 12 58 03 PM

As you can see below, a similar error occurs in the test plot images.

Solution

_histogram() in plot.py computes bin_size with range_ / bins if min_ and max_ are numeric values. I thought using the bin_size computed here is more appropriate than bins[-1] - bins[-2], and this is the change I made in the plot.py.

@bbeat2782 bbeat2782 marked this pull request as ready for review July 5, 2023 20:26
@bbeat2782 bbeat2782 changed the title ggplot hist draft fix ggplot hist vertical breaks fix Jul 5, 2023
@bbeat2782 bbeat2782 changed the title ggplot hist vertical breaks fix ggplot histograms vertical breaks fix Jul 5, 2023
@edublancas
Copy link

@bbeat2782 did you validate (at least some) of the plots with R's ggplot?

@bbeat2782
Copy link
Author

@bbeat2782 did you validate (at least some) of the plots with R's ggplot?

@edublancas Yes, I was able to replicate all the histograms' shapes through R ggplot2 using the bin create method we use in the plot.py file. I didn't try all the color combinations because the colors are consistent between old and new images.

@neelasha23
Copy link

  1. For the histogram_numeric_categorical_combined_custom_multi_color, the bar width of color looks similar in the new image, but the bar of carat is much narrower. Is this expected? Same for many other images.
  2. In this src/tests/integration/baseline_images/test_questDB/custom_engine_histogram.png, the blue overlay bars seem gone in the new image.
  3. Can we add an example in the docs

@bbeat2782
Copy link
Author

@neelasha23

  1. For the histogram_numeric_categorical_combined_custom_multi_color, the bar width of color looks similar in the new image, but the bar of carat is much narrower. Is this expected? Same for many other images.
  2. In this src/tests/integration/baseline_images/test_questDB/custom_engine_histogram.png, the blue overlay bars seem gone in the new image.

As far as I know, there shouldn't be overlapping areas when plotting a histogram with one variable. Thus, for both cases (carat and bill_length_mm), I believe those overlapping areas should be gone.

And from my understanding, the narrower bar widths are correct because the larger bar widths are due to considering bins[-1] - bins[-2] as bar width when bins[-1] - bins[-2] is larger than distances between other two consecutive bins.

If any of the things I comment on is wrong, please correct me.

  1. Can we add an example in the docs

Could you please elaborate more on what kind of example you are asking? I'm a little bit confused on this point because it seems like the link already has examples of plotting different histograms. Thanks.

@neelasha23
Copy link

Could you please elaborate more on what kind of example you are asking? I'm a little bit confused on this point because it seems like the link already has examples of plotting different histograms. Thanks.

I don't see one similar to the examples in the testcases. (with overlay). Maybe it's because of the dataset

@bbeat2782

Copy link

@yafimvo yafimvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change makes sense and it looks good. However, there are three cases where we calculate the width, and we might calculate it differently for some of them.

For example:

%sqlplot histogram --table penguins.csv --column body_mass_g --bins 300

here for example we have 1 color without fill. In this case, when the bars are overlapping each other, it is actually easier to understand the chart.

image

@bbeat2782
Copy link
Author

@yafimvo

%sqlplot histogram --table penguins.csv --column body_mass_g --bins 300

here for example we have 1 color without fill. In this case, when the bars are overlapping each other, it is actually easier to understand the chart.

image

To check what I understand is correct, are you saying that when a user plots a histogram with 1 color and without fill (no stacked), the bars of histograms should overlap with each other no matter the number of bins?

However, there are three cases where we calculate the width, and we might calculate it differently for some of them.

And are these the other two cases you mentioned?

Screen Shot 2023-07-06 at 9 01 17 AM Screen Shot 2023-07-06 at 9 01 31 AM

@edublancas
Copy link

based on @yafimvo's comment, I believe the one on the left is the current implementation and the one on the right is @bbeat2782's new implementation:

I think we need to clarify the three possible ways to represent the bars:

  1. overlapping (example: bar 1 begins at x=0 and ends at x=1, while bar 2 begins at x=0.95, there is some overlap) - we don't want this
  2. spaced (example: bar 1 begins at x=0 and ends at x=1, while bar 2 begins at x=1.05, there is whitespace between the bars) - we don't want this
  3. fitted (example: bar 1 begins at x=0 and ends at x=1, while bar 2 begins at x=1) - we want this! strictly speaking, there is a slightly small overlap but that's fine, because we want the histogram to look "continuous"

I think depending on the user's settings, the bar width calculation might be different, but ultimately, we want all output histograms to be in scenario 3 (fitted).

going through how histograms are calculated and plotted can help: https://en.wikipedia.org/wiki/Histogram

@bbeat2782 does this clarify things?

@bbeat2782
Copy link
Author

bbeat2782 commented Jul 7, 2023

3. fitted (example: bar 1 begins at x=0 and ends at x=1, while bar 2 begins at x=1) - we want this! strictly speaking, there is a slightly small overlap but that's fine, because we want the histogram to look "continuous"

@edublancas

It clarifies that we want to have 3. fitted version of histograms.

The current implementation gives 1. overlapping histograms which produces the vertical color breaks.
My implementation gives the following, which is the 2. spaced version.
Screen Shot 2023-07-07 at 11 50 55 AM

So to confirm my understanding, are you saying that we would like to have all bins stand right next to each other (so that they look "continuous") even if it means the bins in the same histogram have different widths like the following image?
Screen Shot 2023-07-07 at 11 45 29 AM

@edublancas
Copy link

So to confirm my understanding, are you saying that we would like to have all bins stand right next to each other (so that they look "continuous") even if it means the bins in the same histogram have different widths like the following image?

If using %sqlplot histogram, I don't think there's a way for bins to have different width. but I'm unsure if this is a possible scenario with the ggplot API. is this possible? If so, it'd say let's stick with what ggplot in R does, as we want to mimic their API

@bbeat2782
Copy link
Author

is this possible? If so, it'd say let's stick with what ggplot in R does, as we want to mimic their API

By default, a histogram plotted by ggplot in R contains some whitespaces between bins if the number of bins is large and some bins do not contain a data point.
Screen Shot 2023-07-07 at 1 20 01 PM

We can change the width of bins and remove the whitespaces by passing in the breaks argument, which we don't have in our geom_histogram.py
Screen Shot 2023-07-07 at 1 14 29 PM

@edublancas So I will work from there. Thanks for the clarification.

@edublancas
Copy link

ok, to ensure I understand what you'll do:

you'll mimic what ggplot in R does (this means that in some cases there will be whitespace if data is missing in certain data ranges).

you will not implement breaks

@bbeat2782
Copy link
Author

@edublancas

you'll mimic what ggplot in R does (this means that in some cases there will be whitespace if data is missing in certain data ranges).

Yes

you will not implement breaks

I was going to implement breaks since it's in ggplot in R, but if we don't want it, I won't.

@edublancas
Copy link

let's tackle breaks in another PR. I'll open an issue

@bbeat2782
Copy link
Author

@yafimvo

After clarifying the following with edublancas

mimic what ggplot in R does (this means that in some cases there will be whitespace if data is missing in certain data ranges).

do you think we still need to tackle the 3 plots you mentioned from your first comment?

@bbeat2782
Copy link
Author

@neelasha23
Could you take a look at the changes I made based on your comment? Thanks!

neelasha23
neelasha23 previously approved these changes Jul 14, 2023
@bbeat2782 bbeat2782 requested a review from yafimvo July 15, 2023 04:39
@bbeat2782
Copy link
Author

@yafimvo

I think the mention notification got missed, so I'm commenting again with the re-review request.

After clarifying the following with edublancas

mimic what ggplot in R does (this means that in some cases there will be whitespace if data is missing in certain data ranges).

do you think we still need to tackle the 3 plots you mentioned from your first comment?

@yafimvo
Copy link

yafimvo commented Jul 16, 2023

@bbeat2782

do you think we still need to tackle the 3 plots you mentioned from your first comment?

I'm not sure which 3 plots you are referring to.
Generally, as discussed, I agree that we should mimic what ggplot R does and treat it as our source of truth. We should ensure that our plots resemble the ones generated by ggplot R, making sure they are easily readable, and present accurate information.

yafimvo
yafimvo previously approved these changes Jul 16, 2023
Copy link

@yafimvo yafimvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code looks good. Just make sure the cases we test match the ggplot R outputs.

@edublancas edublancas dismissed stale reviews from yafimvo and neelasha23 via 1397e65 July 16, 2023 22:47
Copy link

@edublancas edublancas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a few tests and seems like we're getting different results (left R, right JupysQL). I plotted body_mass_g and fill by species.

image

in R, the species with higher counts is Adelie, in ours is Chinstrap. Let's double check this.

it might be better to create some fake data to ensure we know beforehand how many observations there are.

Notebooks:
ggplot-notebooks.zip

@bbeat2782
Copy link
Author

I will check it.

@edublancas
Just to make sure, when you said

in R, the species with higher counts is Adelie, in ours is Chinstrap

are you referring to their locations (which category is stacked on top of others)?

@edublancas
Copy link

are you referring to their locations (which category is stacked on top of others)?

no, I mean the counts, the values in the Y axis

@bbeat2782
Copy link
Author

bbeat2782 commented Jul 19, 2023

@edublancas As you said, the binning in histogram works differently in R and our implementation. Thus, changing it will change lots of test histogram plots that is related to mimicking R ggplot behavior instead of fixing the vertical color breaks. Do you think it's better to change it here or in a different open PR?

@edublancas
Copy link

edublancas commented Jul 20, 2023

@AnirudhVIyer: this PR is having the issue that you're having.

@bbeat2782: the error that you're seeing is unrelated to your PR

I included a fix for this in #746, so once it's merged, you can rebase - the problem is an update to sqlglot

@edublancas
Copy link

@bbeat2782 ok, let's keep this as is (once the tests pass, we can merge)

and let's tackle the shape of the histogram in a new PR (opened #751)

update changelog

test images changed

more image changes

ci

documentation and docstring change

ggplot wording change
@edublancas edublancas merged commit a2fcc39 into ploomber:master Jul 24, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

problem with ggplot histograms
4 participants