Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign the public dashboard to be more compact and to support error bars #86

Closed
shankari opened this issue Jul 11, 2023 · 48 comments
Closed

Comments

@shankari
Copy link
Contributor

The public dashboard (e.g. https://open-access-openpath.nrel.gov/public/ or https://durham-openpath.nrel.gov/public/) currently represent mode and mileage shares are pie charts. Apparently, per visualization best-practices, pie charts are bad. I am not 100% convinced about this, but I am not a visualization expert either.

In the future, we also want to be able to represent uncertainty in the metrics (as part of the "count every trip" project), presumably through something like error bars. We want to redesign the metrics to meet these dual goals.

One option is to use stacked bar charts, which are widely promoted as the replacement for bar charts (although we would need to think about how to represent error bars in that case). I am open to other visualizations, and would like to see a simple comparison of potential replacements and their pro/con.

I will also point out that there is an existing implementation of the metrics using stacked bar charts in #78 so if we do choose to go with that, the implementation should be pretty simple and not take much time.

This was originally tracked in #83 but we are moving the discussion here because that one got too unweildy

@shankari shankari changed the title Redesign the dashboard to be more compact and to support error bars Redesign the public dashboard to be more compact and to support error bars Jul 11, 2023
@iantei
Copy link
Contributor

iantei commented Oct 19, 2023

This article talks about why pie chart Is not the preferred visualization tool. https://www.businessinsider.com/pie-charts-are-the-worst-2013-6.

Before exploring into the alternatives for the pie chart, it's important to understand what's the requirement for the change.
From my understanding, we want to represent uncertainty in the metrics alongside show the proportion to the whole representation.
From the previous design discussion, stacked bar chart seems like a good alternative, but I feel it will not be able to depict the uncertainty with error bars well since it would overlap. And moreover, in cases where we have many labels to present for a single metric. Stacked bar chart might not provide right clarity.
This https://bioinformatics.stackexchange.com/questions/11222/stacked-bargraph-with-error-bars depicts the possible issue with the integration of stacked bar chart with error bars.
Another reference https://stackoverflow.com/questions/58603380/how-to-organize-error-bars-to-relevant-bars-in-a-stacked-bar-plot-in-ggplot

@iantei
Copy link
Contributor

iantei commented Oct 19, 2023

Screenshot 2023-10-19 at 3 29 59 PM
The above screenshot is comparison between same metric "Number of Trips" for different dates. Comparing individual slices across these four charts are relatively not easy.

@shankari
Copy link
Contributor Author

shankari commented Oct 22, 2023

The above screenshot is comparison between same metric "Number of Trips" for different dates. Comparing individual slices across these four charts are relatively not easy.

That is indeed the argument for using stacked bar charts. What is your point here?

From my understanding, we want to represent uncertainty in the metrics alongside show the proportion to the whole representation.

I am not sure what "show the proportion to the whole representation." means. We do want to represent the uncertainty, and we want to give people a quick glance of the relative proportions displayed. That is why non-stacked bar charts are not a great option.

And moreover, in cases where we have many labels to present for a single metric. Stacked bar chart might not provide right clarity.

So what is your suggestion?

Note that a cursory search may not give you the perfect answer, particularly for a negative result (i.e. "error bars on stacked bar charts don't work"). All that means is that the person who responded didn't know how to make it work.

Proving a negative takes a lot more than 2 SO links.

@iantei
Copy link
Contributor

iantei commented Oct 24, 2023

That is indeed the argument for using stacked bar charts. What is your point here?
I was just justifying why moving to bar charts over pie chart is a better idea.

And moreover, in cases where we have many labels to present for a single metric. Stacked bar chart might not provide right clarity.

So what is your suggestion?

I was thinking moreover in the direction of using bar chart instead of stacked bar charts, but we would not be able to show the relative proportions - which is equally important.
My only reservation with the stacked bar chart is with scenarios where we have many labels, we might not be able to illustrate the error bars properly.

Note that a cursory search may not give you the perfect answer, particularly for a negative result (i.e. "error bars on stacked bar charts don't work"). All that means is that the person who responded didn't know how to make it work.

Combining values and limiting the number of different labels might be a solution. Let us consider a scenario two scenarios,

  1. Scenario 1:
    The distribution of labels are "Walk: 90%", "Bus: 2%", "Car: 2%", "Moped: 1%", "Others: 5%".
    Here, we have limited the labels to 5, so other labels would be clubbed with "Others".
    How can we effectively visualize data when the quantity of one specific "label" is significantly smaller than the others, to the extent that it needs to be grouped as "Others"? This will be an issue if the user is interested in this specific label.

  2. Scenario 2:
    In a scenario where the label distribution is heavily skewed, such as "Walk: 98%, "Bus: 1%", "Car: 0.5%", "Airplane: 0.25%", and "Others: 0.25%", it becomes challenging to effectively visualize error bars within a stacked bar chart presentation. Would this not be an issue with the use of stacked bar chart with error bars?

@iantei
Copy link
Contributor

iantei commented Oct 24, 2023

  1. Scenario 1:
    The distribution of labels are "Walk: 90%", "Bus: 2%", "Car: 2%", "Moped: 1%", "Others: 5%".
    Here, we have limited the labels to 5, so other labels would be clubbed with "Others".
    How can we effectively visualize data when the quantity of one specific "label" is significantly smaller than the others, to the extent that it needs to be grouped as "Others"? This will be an issue if the user is interested in this specific label.

One possible solution to this could be to use tooltip such that user can hover over the Others section and see the distribution of labels. In this way, the representation of information is better.

@iantei
Copy link
Contributor

iantei commented Oct 24, 2023

In relation to the design presented in the below link:
#83 (comment)

The propose design is to bin different charts into one under the basis of some common parameter.
For example,
Group 1 : (number of trips):
number of trips for each mode
number of commute trips for each mode
number of trips under 10 miles for each mode

I understand you've a better understanding of the usage of these charts, but I am just trying to understand if this is what the end-user of public-dashboard would like to see? Would it be appropriate to involve them or their inputs in the redesign process of public dashboard?
My rationale for this comes up with the fact that if the end-users are not comparing between these charts which we plan to bin, it would just lead to a cluttered visualization instead.

@iantei
Copy link
Contributor

iantei commented Oct 25, 2023

Screenshot 2023-10-19 at 3 29 59 PM

We’re using date specific snapshot of charts. Meaning we have a chart for a particular date with the associated metrics.
Currently, to understand the change in the proportion over time, user would have to launch multiple charts, align it adjacent to one another and compare.

So, binning multiple charts into a single one in case of stacked bar chart representation, might lead to problem in comparing different individual charts.

This might involve quite a bit of changes and not sure how relevant it would be to the requirements from the end-users, but if we could provide a timeline of changes - like using a stream chart as showcased here. The users might be able to compare different variation of the metric over a period of time more easily.

So, the question is - is it more relevant for users to compare between two similar chart - metrics OR understand the change in an individual metric over a period of time?

@iantei
Copy link
Contributor

iantei commented Oct 25, 2023

Currently, whenever there is an issue with plotting a particular chart for a selected metric and date.

We are depicting its as following:
image

I think this causes unnecessary real estate occupation in the public-dashboard page.

Now we are planning to move away from pie chart and likely into stacked bar chart. In this case, there could be similar scenario, where certain individual chart inside binned charts cannot be generated. What would be the right approach to address it?

@iantei
Copy link
Contributor

iantei commented Oct 25, 2023

I will also point out that there is an existing implementation of the metrics using stacked bar charts in #78 so if we do choose to go with that, the implementation should be pretty simple and not take much time.

I couldn't identify the actual implementation of stacked bar charts in #78 . I looked into the plots.py file for different charts implementations, there are three variations of bar charts:


1. def barplot_mode(data,x,y,plot_title,file_name)
2. def barplot_mode2(data,x,y,y2,plot_title,file_name)
3. def barplot_day(data,x,y,plot_title,file_name)
 and also a histogram - def proportion_hist_plot(data,x_col,plot_title,ylab,file_name)

But I could not see any implementation of stacked bar charts. Could you please suggest me if I am looking in wrong file/place?

@shankari
Copy link
Contributor Author

Check the notebooks. There should be a new notebook that is not in master.
The feature never made it to production, so it was never put into plots.py

@shankari
Copy link
Contributor Author

wrt questions like

Now we are planning to move away from pie chart and likely into stacked bar chart. In this case, there could be similar scenario, where certain individual chart inside binned charts cannot be generated. What would be the right approach to address it?

Figure it out! Read up on visualization materials, including any classes that you took, and make a couple of proposals, write out their pros and cons and then we can pick between them.

@shankari
Copy link
Contributor Author

shankari commented Oct 25, 2023

One possible solution to this could be to use tooltip such that user can hover over the Others section and see the distribution of labels. In this way, the representation of information is better.

We cannot use tooltips. The plots are static images. Please remember the software architecture.

@shankari
Copy link
Contributor Author

shankari commented Oct 25, 2023

My only reservation with the stacked bar chart is with scenarios where we have many labels, we might not be able to illustrate the error bars properly.
How can we effectively visualize data when the quantity of one specific "label" is significantly smaller than the others, to the extent that it needs to be grouped as "Others"? This will be an issue if the user is interested in this specific label.

This is the same behavior as the current pie chart, so I don't see any issue with it.

I understand you've a better understanding of the usage of these charts, but I am just trying to understand if this is what the end-user of public-dashboard would like to see? Would it be appropriate to involve them or their inputs in the redesign process of public dashboard?

The end-users are relying on us as technical/visualization experts, so you should figure out what is recommended from a visualization perspective (not easiest to implement, but most correct from a visualization perspective) and then we can see if we have any feedback. We should probably start with feedback from the internal team (UI team) and then can maybe ask others in the group.

This might involve quite a bit of changes and not sure how relevant it would be to the requirements from the end-users, but if we could provide a timeline of changes - like using a stream chart as showcased here. The users might be able to compare different variation of the metric over a period of time more easily.

We do have timeseries plots right now to see the variance over time, but they are typically for a single metric, which makes them not super confusing, and allows us to support error bars. The problem with the stream chart is that it can get very confusing very quickly. We had people try multiple superimposed timeseries (with and without error bars, #49) before and it was very very messy.

ebike usage over time income line conf

We may want to include those as well although again as a stacked chart, but I think it would be too confusing for most of our partners.

@iantei
Copy link
Contributor

iantei commented Oct 31, 2023

Check the notebooks. There should be a new notebook that is not in master.
The feature never made it to production, so it was never put into plots.py

Found it!

width = 0.5
fig, ax = plt.subplots(1,1, figsize=(15,6))
running_total_mini = [0,0]
running_total_long = [0,0]
fig_data_mini = plot_data[plot_data['Dataset']=='Minipilot']
fig_data_long = plot_data[plot_data['Dataset']=='Long Term']

for mode in pd.unique(fig_data_mini.Mode):
    mini = fig_data_mini[fig_data_mini['Mode']==mode]
    long = fig_data_long[fig_data_long['Mode']==mode]

#     labels = mini['Trip Type']
#     vals = mini['Proportion']*100
#     vals_str = [round(v,1) if v>1 else '' for v in vals]
#     bar = ax[0].barh(labels, vals, width, left=running_total_mini, label=mode)
#     ax[0].bar_label(bar, label_type='center', labels=vals_str, rotation=90)
#     running_total_mini[0] = running_total_mini[0]+vals.iloc[0]
#     running_total_mini[1] = running_total_mini[1]+vals.iloc[1]

    labels = long['Trip Type']
    vals = long['Proportion']*100
    vals_str = [round(v,1) if v>1 else '' for v in vals]
    bar = ax.barh(labels, vals, width, left=running_total_long, label=mode)
    ax.bar_label(bar, label_type='center', labels=vals_str, rotation=90)
    running_total_long[0] = running_total_long[0]+vals.iloc[0]
    running_total_long[1] = running_total_long[1]+vals.iloc[1]

file_name='CanBikeCO_report_mode_share%s'%file_suffix
ax.set_title('Minipilot')
ax.set_title('Mode Share')
ax.legend(bbox_to_anchor=(1,1), fancybox=True, shadow=True)
plt.subplots_adjust(bottom=0.25)
fig.tight_layout()
plt.show()
fig.savefig(SAVE_DIR+file_name+".png", bbox_inches='tight')

image

@iantei
Copy link
Contributor

iantei commented Oct 31, 2023

Summarization of my investigation for representation of parts to whole charts (Composition charts):

There can be three categories based on the representation of each categories of data.

  1. Pie Chart or Donut Chart
    Both Pie Chart and Donut Charts are almost identical.
    Pie Chart represents different categories of data as sector(s) in a circle, while donut chart represents different categories of data as sector(s) in a ring - providing space in between to fill in extra information. Implementing error bars or uncertainty in these does not sound intuitive.
  2. Stacked Bar Charts or 100% Stacked Bar Charts
    These use bar to represent the data in different categories. We can incorporate error bars in this implementation to showcase the uncertainty.
  3. Tree Map or Marimekko Chart (Mosaic Chart) or Waffle Chart
    Tree Map is used for representation of hierarchical data. This is not quite applicable in our case.
    Marimekko Chart is similar to a stacked bar, except the it's x-axis can capture another dimension of the values.
    Waffle Chart uses blocks and represent a part of the whole representation in box format. This implementation will not make our visualization more compact. And representing error bars in this representation does not look intuitive.

As mentioned in the previous design suggestion, Stacked Bar Charts sound like a good candidate to represent part of a whole relationship, while also representing the uncertainty with error bars.

References:

  1. https://datavizproject.com/function/part-to-whole/
  2. https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization#sm.00000rjskeeastdqstj1u6344w1xx
  3. https://medium.com/@wenzhao.li1989/top-50-matplotlib-visualizations-the-master-plots-with-full-python-code-f4f110611257
  4. https://guides.temple.edu/c.php?g=939377&p=6769862
  5. Details about different types of Stacked Bar Charts https://www.sciencedirect.com/science/article/pii/S2468502X18300287

@iantei
Copy link
Contributor

iantei commented Oct 31, 2023

We want to use 100% Stacked Bar Chart to show part of a whole relationship. But there is some issue while trying to bin the existing multiple pie charts in a stacked bar chart.

The below two charts are not direct translation of one another, rather a representation of similar dataset in two pie chart and a single 100% stacked bar chart.

Number of Trips in different Pie Charts 100% Stacked Chart Implementation
NumberOfTrips 100%StackedChartImpl
It is clear that the total sample of data is 22,554 on the left, while it is 4655 on the right. The percentage of individual segment in each chart showcase proportion to the total sample of data. Similar percentage representation of Car on top and bottom can be complete misinterpretation of the actual data. Even though it represents identical proportion of distribution in their dataset, the inference of comparison between two can be wrong.

Since, the total survey sample is not identical. In the above example, Total number of trips is 22554, while Total number of commute trips is 4655. Using proportion on the X-axis in the above showcased Stacked Bar Chart might give ambiguous representation of the actual data representation. For example, 20% of 22554 should not be represented same as 20% of 4655 on the X-axis. But the above representation gives that wrong interpretation.

Marimekko chart might be a solution to showcase the different total survey sample representation on the Y-axis. With Mosaic (or Marimekko chart), we can have a variable on the Y-axis to represent the width of the bars in proportion to the percentage of total trip for each categories. But this will lead to a more cluttered chart representation.

Another alternative is to use Stacked Charts which represent the actual number of total data as the representation instead of proportion on the X-axis. In this case - the issue would arise whenever the total sample data is disproportionate. In that case the Number of commute trips bar would be really shorter compared to the Total Number of Trips bar, thereby resulting in not proper visualization of data for the shorter bar's categories.

@shankari
Copy link
Contributor Author

shankari commented Nov 9, 2023

I really don't understand this comment. wrt

It is clear that the total sample of data is 22,554 on the left, while it is 4655 on the right. The percentage of individual segment in each chart showcase proportion to the total sample of data.

I don't see this. The proportion of drove_alone appears to be the same in the two pie charts as well although the numbers are totally different.

For example, 20% of 22554 should not be represented same as 20% of 4655 on the X-axis.

Why? If we are comparing the "parts of the whole" for two bars, then 20% should be represented the same way, just like it is in the pie chart.

Even though it represents identical proportion of distribution in their dataset, the inference of comparison between two can be wrong.

What is the wrong inference?

The reason that the it is "clear that the total sample of data is 22,554 on the left, while it is 4655 on the right" is because we include the numbers in addition to the percentage. A trivial fix would be to include the number along with the % in the stacked bar chart as well.

I will also point out that there is an existing implementation of the metrics using stacked bar charts in #78 so if we do choose to go with that, the implementation should be pretty simple and not take much time.

This just indicates that there is an implementation to work from, not that it is perfect, or that it cannot be changed.

@iantei
Copy link
Contributor

iantei commented Nov 13, 2023

There are few issues with the #78 canbikeco_report.ipynb.
A few of the issues like taking only three return values instead of four for scaffolding.load_viz_notebook could be resolved by adding few changes to the code, while going forward it requires Can Do Colorado eBike Program - en.csv file, which I could not find in the repo.
I could still infer to the existing plot for stacked bar charts since the output are also committed to the PR, but I couldn't change the code to perform quick prototype for above case.

@shankari
Copy link
Contributor Author

@iantei you need to look through the notebook more carefully. The Can Do Colorado eBike Program - en.csv file is required for the demographic plots. It is not needed for the basic mode/purpose plots that we have in the public dashboard. You should be able to comment out irrelevant code to do a quick prototype.

@iantei
Copy link
Contributor

iantei commented Dec 14, 2023

Here's the execution of canbikeco_report.ipynb with a few tweaks and changes.
Here are the individual representation of Mode Share representation over Work Trips vs All Trips:

Work Trips All Trips
Screenshot 2023-12-14 at 9 52 13 AM Screenshot 2023-12-14 at 9 52 04 AM

Here is the aggregated representation of both the above charts into a single one:

Aggregate Representation
Screenshot 2023-12-14 at 9 51 54 AM

Why? If we are comparing the "parts of the whole" for two bars, then 20% should be represented the same way, just like it is in the pie chart.

Even though it represents identical proportion of distribution in their dataset, the inference of comparison between two can be wrong.

What is the wrong inference?

The reason that the it is "clear that the total sample of data is 22,554 on the left, while it is 4655 on the right" is because we include the numbers in addition to the percentage. A trivial fix would be to include the number along with the % in the stacked bar chart as well.

Calculation

From the above data and aggregate representation chart, I had the following concern:
All Trips - 1.4% representation is equal to 19.
Work Trips - 2.6% representation is equal to 10.

Since, we're representing both these bars together. The disparity in the % representation to the actual number might give wrong impression at glance to the end user. However, as you suggested "including number along with the % in the stacked bar chart" would be a probable fix.

And as represented above, it would be ideal for the end user to compare the similar metric related charts if we bucket them together, and represent it rather than create two stacked bar charts and place it next to each other for comparison.

@iantei
Copy link
Contributor

iantei commented Dec 18, 2023

Here's a sample prototype of the 100% stacked bar chart:

Group (number of trips):

number of trips for each mode
number of commute trips for each mode
number of trips under 10 miles for each mode

image

@iantei
Copy link
Contributor

iantei commented Dec 18, 2023

The approach I have taken for this implementation is:

  1. Create a data frame from for each of these metrics and provide a new column to categorize different metrics.
  2. Combine all of these data frame into one.
  3. Create 100% stacked bar chart.

@shankari
Copy link
Contributor Author

wrt: #86 (comment)
what are your thoughts on how this looks, wrt previous concerns about:

Since, we're representing both these bars together. The disparity in the % representation to the actual number might give wrong impression at glance to the end user. However, as you suggested "including number along with the % in the stacked bar chart" would be a probable fix.

Are you recommending that we use stacked bar charts, or not?
If you have an alternative, what would it look like with this dataset?

I will also note that the graph looks quite ugly. There is a lot of space between the bars, and the numbers are essentially not visible. Were you able to use the modified graphs created by @Abby-Wheelis?

@iantei
Copy link
Contributor

iantei commented Dec 19, 2023

wrt: #86 (comment)
what are your thoughts on how this looks, wrt previous concerns about:

Since, we're representing both these bars together. The disparity in the % representation to the actual number might give wrong impression at glance to the end user. However, as you suggested "including number along with the % in the stacked bar chart" would be a probable fix.

Are you recommending that we use stacked bar charts, or not?
If you have an alternative, what would it look like with this dataset?

Our goal is to identify charts which would suffice our two requirements:

  1. Keep the representation of part-to-whole relationship as currently being represented by Pie chart.
  2. For future enhancement, the charts should represent uncertainty in the metrics. 100% Stacked Bar chart would support error bars for the same.

Accounting to these two requirements, 100% Stacked Bar chart seemed like the ideal candidate.

Adding to part of my previous comment, I think 100% Stacked Bar chart with the label of number alongside its percentage provides enough justification that percentage representation in one bar is not directly comparable to the other one. For instance in the below example, Comparing Commute Trip against Total Trip for Gas Car, drove alone label:

Trip Type Percentage of Each Trip Number of Each Trip
Commute Trip 28% 651
Total Trip 26.1% 2052

The end user could get a wrong impression, accounting the percentage for Gas Car, drove alone is higher in Commute Trip in comparison with Total Trip, but providing the number alongside the percentage fixes that issue.

Moreover, this representation of 100% stacked bar chart gives the end user a good way to compare mode of commute is more popular between each of these trips.

image

I will also note that the graph looks quite ugly. There is a lot of space between the bars, and the numbers are essentially not visible.

I modified the width of the bars and changed the font size, it seems better now.

Were you able to use the modified graphs created by @Abby-Wheelis?

I took reference from the same notebook canbikeco_report.ipynb from which @Abby-Wheelis took reference from, and incorporated her way of using the numbers alongside the % representation.

@Abby-Wheelis
Copy link
Member

@iantei and I discussed a design concern this afternoon - should we continue to bin modes with a low representation into "Other"?

Often, there are modes that only have a very small percentage of trips, and currently the approach in the pie charts is to bin the modes with less than 2% share of trips into the other category. If we continue this through to the stacked bar charts to minimize change on the dashboard, we run into a bit of a problem. To give a concrete example:

We have a dataset with 5,000 trips, 50 of them of labeled "ebike" (1%). Of all the trips, 500 are labeled "work" and it happens that of those 500 "work" trips, 45 are "ebike" (9%), if we enforce the "modes < 2% are 'other'" rule, and have a stacked bar chart showing total and commute trips side-by-side, then it would appear as if there were 0 ebike trips in the total set of trips, but 9% in the work trips.

If you know about the "other binning" this is easy to understand, but I think it has the potential to throw people off.

If we completely forego the binning, however, this raises concerns around crowding for segment labels and future additions of error bars. Some of this could be overcome by omitting elements if the bar segment is too small (like I did with segment labels in the CanBikeCO paper code). There is also concern for the end user, will it be too big a change if the geometry AND content of the dashboard changes overnight?

@iantei if I missed anything from our conversation please feel free to add on!!

I will point out that this is already possible on the dashboard, so maybe it's not as much of a concern as I feel like it is:

@Abby-Wheelis
Copy link
Member

@iantei @shankari from our discussion today, for handling the design concern in the comment above:

non-graphical options:

  • Change the label of "Other" to a concatenation of modes within that category
    • Restrict the concatenation to only things that appear in the list to prevent overwhelm via user-inputted modes
    • might look something like: "Other + Bike + Unicycle"
  • show a table underneath the chart with a breakdown of "Other"
    • could show: Other: 11, Bike: 3, etc in a small table

graphical/visual options:

  • adding an "explosion" to show the breakdown of the "other" category
    • may be too much to show bars in the same chart, but could keep them in their own charts (then user can compare between charts)
  • keep all modes:
    • with bars, it may be a little easier to show modes with low count - see recent example below

Other concerns:
Error bars - not planned at this point for labeled metrics, so we don't need to worry about this for right now, only concern right now is error bars for sensed metrics

Next steps:

  1. Perform time-bound experiment (~1 day) testing "exploding" the other category to judge how difficult it is and see what that would look like
  2. Gather feedback from wider group in order to decide on course of action

This is what lots of modes with low counts might look like (example from data analysis work for usaid-laos-ev), a limit has been places such that modes with low count (which is most of them in this case) do not have count and % labels, but you can use the legend to see visually that mode has a low count (in an ideal world colors would not repeat, but time was limited when I made this):image

@iantei
Copy link
Contributor

iantei commented Feb 18, 2024

Here's the comparison of categorizing "Others" (with less occurrence) vs showing all modes available alongside the current representation as pie chart in the public dashboard.

For the usaid-laos-ev dataset, with date 11/2023.

Type Chart
Current Representation PieChart_New
With Others With_Others
With All Modes With_AllModes

@iantei
Copy link
Contributor

iantei commented Feb 19, 2024

Representation in exploded bar chart (without the line of projection)
Stacked_BarChart

Figuring out how to project the lines from "Others" in left bar chart to the right one.

@iantei
Copy link
Contributor

iantei commented Feb 19, 2024

Dataset - USAID-laos-ev 2023-11.

Representation in exploded bar chart (with the line of projection)
Stacked_BarChart_projection

@iantei
Copy link
Contributor

iantei commented Feb 20, 2024

An interesting observation for the above representation. Dataset used - usaid-laos-ev All

Date Charts
All Total_Data
2023-09 2023-09_ExplodingBarChart

Though informative, for 'All' Date, on the left chart - which is represented as the exploding version of 'Other' category on the left "Total Trips" bar chart, there are many labels with less proportion, therefore this will lead to inelegant representation of the charts.

And there will be certain cases, where there are no 'Other' label altogether, therefore there will just be a single stacked bar chart. This might represent non-uniform representation for these charts. Some charts showing up with two bar charts while some having just one.

@Abby-Wheelis
Copy link
Member

Good point about the case where there are no "Other" modes - we should be sure whatever solution we come up with handles this as elegantly as possible!

I would be very curious to see what other people think about the "With All Modes" option, since we don't need error bars the fact that many modes show up as a small sliver seems like it might not be too much of an issue.

Exploding the "Other" category is also very representative, and I think there would be a way to make sure charts with not "Others" have a blank space next to them, rather than the current widening. However, if we are still going to have similar cases with the slivers of modes with only 1-2 count I question if that representation is "worth" the complication of exploding bars and separating the charts.

I look forward to hearing the opinions of the rest of the team in our meeting this afternoon!

@shankari
Copy link
Contributor Author

@Abby-Wheelis we also discussed today that the exploded graphs here display the percentages for the explosion, which is incorrect. IMHO, we should instead have it as the percentage of the original, or at least see what that looks.

Screenshot 2024-02-15 at 1 50 28 PM7cef6e52bb1f18de1d917fe943db83390cb9bd2f75cce05d8db700721914cb8c

@iantei
Copy link
Contributor

iantei commented Feb 20, 2024

These represent the original percentage:

Scaled percentage of Other Chart
Original Fixed_org
50% Fixed_500
100% Fixed_1000

@iantei
Copy link
Contributor

iantei commented Feb 20, 2024

As per the discussion from the afternoon meeting, we would be proceeding with the option of binning charts based on certain category, while keeping the expanded version of Other, while also giving the user an option to select a dropdown for table with all information regarding the label's count and proportion.

@iantei
Copy link
Contributor

iantei commented Feb 21, 2024

On the account of proceeding with the above proposed solution.

I would like to understand a few considerations which we need to make.

  1. How can we illustrate the detailed text information in the charts? We can likely represent this in the table we’re planning to represent.
    Information like Number of trips for each mode (selected by users) Based on 1359 confirmed trips from 34 testers and participants of 5893 total trips from 61 users () which are present in each of the pie-chart.
  2. Currently, we have charts which I feel should be displayed separately accounting that we don’t have similarity with other charts. I feel we can categorize them under 7 different bins, which are represented below.

Bin 1: Based on trip count

  • Number of trips
  • Number of commute trips
  • Trip count under 10 miles

Bin 2: Based on trip count (sensed)

  • Trip count under 10 miles (sensed)
  • Number of trips (sensed)

Bin 3: Based on purpose(mode specific)

  • e-bike specific trip count by purpose

Bin 4: Based on Replaced mode (mode specific)

  • e-bike specific trip miles by replaced mode

Bin 5: Based on Mileage

  • Trip miles by mode

Bin 6: Based on Purpose

  • Trip count by purpose

Bin 7: Based on Mileage (Sensed)

  • Trip miles by mode (sensed)
  • Trip miles by land mode (sensed)
Categories of different charts binned together
Classification_Binning

@Abby-Wheelis
Copy link
Member

Notes from meeting today:

Proposed combinations - keep the same metrics together and show up to three bars: total(sensed mode), inferred, and labeled

  • trip counts: sensed, labeled, inferred
  • short trips: sensed, labeled, inferred
  • commute trips: labeled and inferred
  • miles by mode: sensed, sensed (land), inferred, labeled
  • count by purpose: labeled and mode specific labeled
  • mode specific replaced mode: labeled and inferred

Steps:

  • Design of charts and text
  • Combining charts into 100% stacked bars (labeled and sensed metrics only)
  • Adding inferred trip bars
  • Adding error bars to inferred and sensed bars

@Abby-Wheelis
Copy link
Member

Design considerations for text:

Text for these chart/table pairs will need to consider total number of trips, number of trips with inferences, and number of trips with labels, likely alongside rates.

Bar labels - goal to be clear on source of labels

  • sensed by OpenPATH
  • inferred from history
  • labeled by user

example:

Number of trips taken by each mode:
Based on 5,000 total trips from 25 testers and participants
of which 2,500 (50%) also have inferred labels from 20 testers and participants 
and 1,000 (20%) of which have confirmed labels from 2 testers and participants

Concerns about this volume of text:

  • too much data for someone to take in at once
  • 3 bars and 5 lines of text can be overwhelming

Concerns of obscuring this text:

  • sense of scale matters -- what if only 2 users have labeled? We need to know the sample size
  • worried that end users may not access the table, and we don't want them to miss key information

Compromise options:

  • putting the bulk of the text below the chart, so just the main title is above, allows user to digest visual quicker
  • only including these trip counts / user counts in the table, but ALWAYS showing the table and the chart together (ex: I click "Number of trips by mode" in the dropdown and both a chart and a table appear)

Depending on how early mocking and tests go, we can make a final decision on where to put the text about counts of trips and users for each bar.

@iantei
Copy link
Contributor

iantei commented Feb 23, 2024

Proposed combinations - keep the same metrics together and show up to three bars: total(sensed mode), inferred, and labeled

  • trip counts: sensed, labeled, inferred
  • short trips: sensed, labeled, inferred
  • commute trips: labeled and inferred
  • miles by mode: sensed, sensed (land), inferred, labeled
  • count by purpose: labeled and mode specific labeled
  • mode specific replaced mode: labeled and inferred

Accounting to this design approach, I will merge all three notebooks generic_metrics , generic_metrics_sensed and mode_specific_metrics into a single notebook. Transform all existing pie_chart representation into a 100% Stacked Bar Chart.

On the contrary, I do see the case of creating a super notebook, and losing the ability to just execute a single notebook for each specifics. For example, if I just wanted to run for sensed charts I would just need to execute one notebook i.e. generic_metrics_sensed. Now, I will have execute this single notebook for all cases.

@Abby-Wheelis
Copy link
Member

I do see the case of creating a super notebook, and losing the ability to just execute a single notebook for each specifics.

I think that keeping the notebook organized and maybe including some markdown text to separate sections and make notes for people running the notebook could help with this concern.

If we were to decide that it is an important feature for testing, we could think about ways to control what metrics are displayed - for example have a way for all of the charts to run, but only show the "sensed" bars. We do want to get automated tests implemented at some point though, so this may be a good argument for automated testing - if we have testing then it may be less of a concern that we can't perform tests one at a time ourselves.

@iantei
Copy link
Contributor

iantei commented Mar 26, 2024

Prototype sample for the Stacked Bar Chart title text, adhering to the above proposal:

Changes Chart with Title
Before Before_Final
After After_Final

@iantei
Copy link
Contributor

iantei commented Apr 10, 2024

Changes with the usage of subplots instead of a consolidated data frame to validate the proposed change.

Changes Charts
Before (Consolidated Datafram) Old_00
After (Using subplots) New_00

Note:
For the After (using subplots):

  1. The color changes have not been incorporated, therefore we have same color choice.
  2. The smaller % of labels have not been merged into the 'Other'

@shankari Does this change adhere to your proposal of using subplots?

@shankari
Copy link
Contributor Author

@iantei compare the charts - they are very similar except that the x axis is repeated. Do you think that looks good? I don't 😄
I don't see anything else that needs to be fixed

The color changes have not been incorporated, therefore we have same color choice.

The "same color choice" is not due to the color changes not being incorporated, since the sensed mode charts are not affected by color changes anyway. It is because the two bar charts are generated separately.

While incorporating the color changes, I would suggest using the basemode mapping to ensure that the colors are consistent. Similar to the phone code, we should use the same color palette for CAR in the sensed data, and drove_alone and shared_ride in the inferred/labeled data.

I am fine with deferring that change if it is too complex to include now. Ideally, the change would be in e-mission-common since it is the same logic for python and javascript.

@Abby-Wheelis
Copy link
Member

Future work relevant to this project as discussed in our meeting today based on #123 and the review of it:

  • incorporate base-mode color mapping, likely to involve emcommon
  • Evaluate placement of the html table associated with the stacked bar charts:
    • where should it go - individual cell, replace image in current cell, or be appended to/cause the current cell to grow?
    • once we choose placement, would it make sense to remove the alttext from these images? If so, we must talk to comms before we could make this change

@iantei
Copy link
Contributor

iantei commented Apr 18, 2024

The color selection of Walking from Sensed Labels and Gas Car, drove alone are identical in exception with usage of tab20b - in which the text within cannot be read easily.

Color Tab Chart
tab20 tab20
tab20b tab20b
tab20c tab20c
tab10 tab10

This might be addressed easily once the base-mode color implementation is incorporated.

@Abby-Wheelis
Copy link
Member

The color selection of Walking from Sensed Labels and Gas Car, drove alone are identical in exception with usage of tab20b - in which the text within cannot be read easily.

For now, I think it's alright if the colors repeat but are in different legends, that is much better than if they repeat within the same legend, which we should make sure to avoid. I think you're right that the color-mode mapping would make the color maps a little easier to manage (and perhaps more intuitive to look at)

@shankari
Copy link
Contributor Author

shankari commented May 4, 2024

At a high level, each plot that we generate in the public dashboard has the following steps:

  • load (done once from scaffolding)
  • filter
  • aggregate
  • plot

plotting involves some internal pre-processing to make the results meaningful. We convert everything to percentages, combine small entries for the charts and save both charts and related text. All of this is standard and should be encapsulated into a standard function, and was already encapsulated by @iantei

The challenge comes with the filtering and aggregating preprocessing. @iantei had structured those as process functions (e.g. process_cutoff or process_distance_metrics or process_dataframe). While it is important to keep the preprocessing simple, I would argue that the pandas library does a good job of providing the appropriate abstractions to preprocess in 1-2 lines of code. As a concrete example, process_cutoff(expanded_ct, cutoff) is just as concise as expanded_ct.query('distance < cutoff') and much less clear, at least for those who are familiar with pandas. It also avoids a proliferation of extremely simple wrapper functions. If we do want to encapsulate pre-processing functionality (e.g. having a standard query, we can define simple lambdas that we can pass in to the dataframe processing functions while still keeping the goals clear.

Right now, given that all the preprocessing fits that template, I pass in some preprocessing templates, and handle both preprocessing and plotting in the same function. However, as we expand the public dashboard codebase to handle surveys, we may need more complex preprocessing that goes beyond groupby and agg. In that case, we may need to refactor to more explicitly split out the preprocessing and the plotting.

Concretely

plot_and_text_stacked_bar_chart(expanded_ct, "Mode_confirm", {distance_col: 'count'}, \"Labeled by user\\n (Confirmed trips)\", ax[0], text_results[0], colors_mode, debug_df
->
def plot_and_text_stacked_bar_chart(df, df_col, agg_query, bar_label, ax, text_result, colors, debug_df):
        ....
        grouped_df = df.groupby(df_col).agg(agg_query).reset_index().set_axis(['label', 'vals'], axis='columns').sort_values(by='vals', ascending=False)
....
        bar = ax.barh(y=bar_label, width=mode_prop, height=bar_height, left=bar_width, label=label, color=colors[label])

can be transformed into

processed_df = expanded_ct.groupby("Mode_confirm").agg({distance_col: 'count'}).reset_index().set_axis(['label', 'vals'], axis='columns').sort_values(by='vals', ascending=False)
->
def plot_and_text_stacked_bar_chart(processed_df, bar_label, ax, text_result, colors, debug_df):
...
        bar = ax.barh(y=bar_label, width=mode_prop, height=bar_height, left=bar_width, label=label, color=colors[label])
...

Or if the reset_index/set_axis/sort_values, recur every time, as

processed_df = expanded_ct.groupby("Mode_confirm").agg({distance_col: 'count'})
->
def plot_and_text_stacked_bar_chart(processed_df, bar_label, ax, text_result, colors, debug_df):
...
        label_val_df = processed_df.reset_index().set_axis(['label', 'vals'], axis='columns').sort_values(by='vals', ascending=False)
        ....
        bar = ax.barh(y=bar_label, width=mode_prop, height=bar_height, left=bar_width, label=label, color=colors[label])
...

This still makes it clear what we are doing here (plotting the mode_confirm property as counts) without having to look at the implementation of a library function, and keeping the plot code nicely encapsulated.

@Abby-Wheelis I am now checking what kind of preprocessing you do for the survey code...

@shankari
Copy link
Contributor Author

shankari commented May 5, 2024

@iantei note that the two options to split pre-processing and plotting will get us back to the original template for the plot call, which took in labels and values instead of a dataframe. There's a reason we used that structure in the first place 😄
Simply replacing pie_chart_mode() with stacked_bar_chart_mode(..., ax) or `stacked_bar_chart(,,,,ax,colors) would have generated a relatively clean and easy to review PR.

shankari added a commit to iantei/em-public-dashboard that referenced this issue May 5, 2024
@shankari
Copy link
Contributor Author

Closing this since it has moved to production.
Cleanup issue is at: e-mission/e-mission-docs#1051

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants