Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProjectionVisualizer: unifying functionality of PCA and Manifold #874

Closed
bbengfort opened this issue Jun 4, 2019 · 4 comments
Closed

ProjectionVisualizer: unifying functionality of PCA and Manifold #874

bbengfort opened this issue Jun 4, 2019 · 4 comments
Labels
level: expert deep knowledge of packages required priority: medium can wait until after next release type: feature a new visualizer or utility for yb type: technical debt work to optimize or generalize code
Milestone

Comments

@bbengfort
Copy link
Member

One of the basic high-dimensional visualization techniques that Yellowbrick makes use of is to decompose or project a high dimensional space into 2 or 3 dimensions to display the data as a scatter plot. Projections of this kind reduce the amount of space between points (decreasing sparsity) but can still give us some intuition of structures in the higher dimensionality. Currently, we have three primary decomposition methods that use this technique:

  • Manifold: wraps a non-linear transformer from sklearn.manifold to produce embeddings
  • PCA: uses linear principal component analysis to decompose to lower dimensionality
  • Text: TSNE and UMAP visualizers do the same as Manifold but with text-specific helpers

These visualizers have a lot of shared functionality that can be combined to streamline these kinds of visualizations and make it easier to extend them (e.g. to add ICA, Fast PCA, etc. to the PCA decompositions, or to extend the text visualizers to use the manifold visualizations).

I propose we create a ProjectionVisualizer base class or mixin that knows how to:

  • Wrap a transformer to project X into X' of shape (n_instances, 2) or (n_instances, 3)
  • Create a scatter plot for 2D or 3D plots (implemented in PCA)
  • Identify the type of the target and add colors (implemented in Manifold)
  • Subselect the features to use in X for the projection

This shared functionality could then be easily used by PCA, Manifold, etc.

The following notes about the class hierarchy:

  • The MultiFeatureVisualizer produces a self.features_ attribute on fit() which is useful in PCA for biplots and to understand the original feature set.
  • The DataVisualizer produces self.classes_ from y and is supposed to "provide helper functionality related to target identification" but does not currently implement this yet (it is implemented on Manifold)
  • yellowbrick.contrib.ScatterVisualizer might be valuable to be moved to yellowbrick.draw.scatter and use as a mixin to handle part of these cases; though I don't necessarily want to confuse things too much.
  • The JointPlot visualizer would also benefit from the target color handling things from above.

This implies that the ProjectionVisualizer is a DataVisualizer and that the DataVisualizer needs to be updated to handle the target identification stuff that is in Manifold. It also implies that JointPlot should be a DataVisualizer as well.

More investigation on this topic is necessary, but I wanted to propose this solution to allow for further discussion by @DistrictDataLabs/team-oz-maintainers and @naresh-bachwani who is working on PCA this summer.

@bbengfort bbengfort added level: expert deep knowledge of packages required priority: medium can wait until after next release type: feature a new visualizer or utility for yb type: technical debt work to optimize or generalize code labels Jun 4, 2019
@rebeccabilbro rebeccabilbro added this to the v1.1 milestone Jun 4, 2019
@naresh-bachwani
Copy link
Contributor

Thanks, @bbengfort for summarizing this. This makes things simplified for me.

@bbengfort
Copy link
Member Author

This might also be useful for #889

bbengfort pushed a commit that referenced this issue Jul 2, 2019
Updates the DataVisualizer to perform target type identification as implemented in Manifold. This was an original requirement of the DataVisualizer but remained unimplemented since ParallelCoordinates and RadViz were the only main library subclasses. This is the first step in the ProjectionVisualizer high-dimensional visualization base class.

Related to #874
bbengfort pushed a commit that referenced this issue Jul 17, 2019
This is the first major step toward completing #874: the implementation of a ProjectionVisualizer base class to unify functionality of decomposition visualizers that use PCA and Manifold and to extend support to other decomposition methods. In a follow up PR, we will reorganize this class and extend the functionality in Manifold and PCA.
@rebeccabilbro
Copy link
Member

Just a note that this issue would have the potential to close (or at least address portions of) a lot of existing issues:

@bbengfort
Copy link
Member Author

This was finished #930 and #937

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level: expert deep knowledge of packages required priority: medium can wait until after next release type: feature a new visualizer or utility for yb type: technical debt work to optimize or generalize code
Projects
None yet
Development

No branches or pull requests

3 participants