Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend PCA Visualizer with Component-Feature Strength #615

Closed
bbengfort opened this issue Sep 17, 2018 · 19 comments
Closed

Extend PCA Visualizer with Component-Feature Strength #615

bbengfort opened this issue Sep 17, 2018 · 19 comments
Labels
level: intermediate python coding expertise required type: feature a new visualizer or utility for yb

Comments

@bbengfort
Copy link
Member

bbengfort commented Sep 17, 2018

Describe the solution you'd like

Provide an optional heatmap and color bar underneath the PCA visualizer (by shifting the lower axes) that shows the magnitude of each feature value to the component. This provides an explanation of which features are contributing the most to which component.

Is your feature request related to a problem? Please describe.

Although we have the biplot mode to plot feature strengths, they can sometimes be visually overlapping or unintelligible, particularly if there is a large number of features.

Examples

image from ios

Code to generate this:

fig, ax = plt.subplots(figsize=(8, 4))
plt.imshow(pca.components_, interpolation = 'none', cmap = 'plasma')
feature_names = list(cancer.feature_names)

ax.set_xticks(np.arange(-.5, len(feature_names)));
ax.set_yticks(np.arange(0.5, 2));
ax.set_xticklabels(feature_names, rotation=90, ha='left', fontsize=12);
ax.set_yticklabels(['First PC', 'Second PC'], va='bottom', fontsize=12);

plt.colorbar(orientation='horizontal', ticks=[pca.components_.min(), 0,
                                              pca.components_.max()], pad=0.65);

Though we will probably want to use the pcolormesh rather than imshow as in Rank2D, ClassificationReport and ConfusionMatrix. Additionally it might be a tad nicer if the color bar was above the feature plot so that the axes names were the last thing in the chart.

Notes

This idea comes from page 55-56 of Data Science Documentation. I would be happy to include a citation to this in our documentation. (HTML version is here). @mapattacker any thoughts?

See also #476 for other updates to the PCA visualizer.

@bbengfort bbengfort added type: feature a new visualizer or utility for yb level: intermediate python coding expertise required hacktoberfest labels Sep 17, 2018
@mapattacker
Copy link

mapattacker commented Sep 19, 2018

nice library! The existing biplot in this package already gives feature strengths but might not be suitable for everyone, and can be difficult to visualise when there are too many features overlapping each other. this is indeed a good additional feature to add in for PCA @bbengfort .

@rebeccabilbro
Copy link
Member

Thank you @mapattacker; we really enjoyed reading through your documentation!

@stoff3l
Copy link

stoff3l commented Oct 3, 2018

Is someone already working on this? If not, can I give it a shot? I haven't worked on an issue before, so this might be my first.

@bbengfort
Copy link
Member Author

bbengfort commented Oct 4, 2018

@stoff3l you're more than welcome! Looking forward to your PR! Let me know if you have any questions.

@stoff3l
Copy link

stoff3l commented Oct 16, 2018

Hi there, I unfortunately don't have as much time as I thought I'd have to dedicate to this. Might be good if someone else can pick this up (seeing that it's still hacktober). Apologies

@rebeccabilbro
Copy link
Member

No worries @stoff3l, thanks for your interest in contributing and feel free to check back in when you have more bandwidth!

@dnabanita7
Copy link
Contributor

Can I be assigned this issue?

@wagner2010
Copy link
Contributor

Hi @naba7 . We don't assign issues out however you are always welcome to work on the problem yourself and submit a PR. Be sure to check out the referenced issues above to fully grasp what it is that we are looking for. Thanks.

@dnabanita7
Copy link
Contributor

dnabanita7 commented Feb 14, 2019 via email

@dnabanita7
Copy link
Contributor

@bbengfort Will I update the docs of pca.rst or update pca.py ?
If I update pca.py,then will I make changes in the draw method? In the given example,cancer dataset id chosen.What will I do to get the feature names?

@dnabanita7
Copy link
Contributor

The changes I have made
def draw(self, **kwargs):
X = self.pca_features_
if self.proj_dim == 2:
plt.colorbar(orientation='horizontal', ticks=[pca.components_.min(), 0,
pca.components_.max()], pad=0.65)
fig, ax = plt.subplots(figsize=(8, 4))
plt.imshow(pca.components_, interpolation = 'none', cmap = 'plasma')
feature_names = list(self.pca_features_)

        ax.set_xticks(np.arange(-.5, len(feature_names)))
        ax.set_yticks(np.arange(0.5, 2))
        ax.set_xticklabels(feature_names, rotation=90, ha='left', fontsize=12)
        ax.set_yticklabels(['First PC', 'Second PC'], va='bottom', fontsize=12)

        #self.ax.scatter(X[:, 0], X[:, 1], c=self.color, cmap=self.colormap)
        if self.proj_features:
            x_vector = self.pca_components_[0]
            y_vector = self.pca_components_[1]
            max_x = max(X[:, 0])
            max_y = max(X[:, 1])
            for i in range(self.pca_components_.shape[1]):
                self.ax.arrow(
                    x=0, y=0,
                    dx=x_vector[i] * max_x,
                    dy=y_vector[i] * max_y,
                    color='r', head_width=0.05,
                    width=0.005,
                )
                self.ax.text(
                    x_vector[i] * max_x * 1.05,
                    y_vector[i] * max_y * 1.05,
                    self.features_[i], color='r'
                )

@bbengfort
Copy link
Member Author

Hi @naba7 - for this issue, we would have to edit yellowbrick/features/pca.py and implement the component feature strengths on the PCADecomposition. The best bet would be to make this an independent method because as you noted in your code snippet it's going to require additional axes, which can be tricky when working with visualizers, plus it should be optional for the user.

This feature is probably going to be a little complicated and will require a little back and forth in order to implement successfully. I would recommend writing out some example code either in a notebook or in a Python script that uses one of the yellowbrick datasets, so we can get a better feel for how this should be implemented more generally in the visualizer.

Before we go too far down this road, however, I would strongly suggest that we clear up the forking and branching issues that you're currently having and resolve your currently open PR, otherwise this will likely be a mess that will be difficult to resolve. It looked like @rebeccabilbro made some excellent suggestions that might help you untangle what seems to be a git-related knot?

We certainly appreciate your enthusiasm and I just want to make sure that you're set up to be successful as a YB contributor!

@dnabanita7
Copy link
Contributor

Thank a lot .It is all because of all the members and fellow contributors who are so supportive and active.

credit

This image shows that after scaling the credit data,they merged to one point rather than getting scattered.

colorbar

The colorbar data is as above

The code to PcaVisualizer on credit dataset is as follows:

PCA_Visualizer.zip

@bbengfort
Copy link
Member Author

Hi @naba7 - thank you for providing the code snippet and trying the example! I've updated the code to use the new yellowbrick.datasets module - this is available in the develop branch, and if you have set up yellowbrick according to the contributor's guide, then it should be available to you. Alternatively, you should be able to put PCA_Visualizer.py in the root yellowbrick project and it should be able to import the new dataset module.

Here is my updated code snippet for you:

https://gist.github.com/bbengfort/bf59c1f33b1e523ea1f4774bd3272876

When using the new dataset module, the target is excluded and that results in the following PCA image:

screenshot 2019-02-18 11 00 42

Which is a tad better!

Unfortunately, this code snippet creates two figures; here is the second figure (using imshow):

screenshot 2019-02-18 11 00 47

Ideally, we'd like the strengths, the colorbar, and the scatter plot all in the same figure but in different axes. This is a fairly tricky problem - but you're definitely pushing this issue forward and we really appreciate it!

I've been using make axes locateable to do this, see also the following StackOverflow questions:

However, this is a bit tricky. Perhaps for this initial experimentation phase you might want to try GridSpec?

Thanks again for all your hard work on this!

@dnabanita7
Copy link
Contributor

Thank you @bbengfort, for figuring out what went wrong and yes I have set up yb according to contributor's guide.I will keep this in mind to use yellowbrick.datasets from next time. I will try on this issue as well as try GridSpec too.This is really interesting,amazing and fun to work and get guidance and support from you.OSS is love.

@dnabanita7
Copy link
Contributor

Proposal:
We can normalize the axis values and use divider.make_axes_locatable to create colorbar alongside

@bbengfort
Copy link
Member Author

@naresh-bachwani this is what you're currently working on right? Let's make sure this gets closed when it's finished!

@naresh-bachwani
Copy link
Contributor

@bbengfort I think that we have achieved all of its tasks. Is there anything that I am missing?

@bbengfort
Copy link
Member Author

I don't think so - let's close this as fixed in #884 and #937

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level: intermediate python coding expertise required type: feature a new visualizer or utility for yb
Projects
None yet
Development

No branches or pull requests

7 participants