You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In their work from 2018, Han and colleagues describe LabelTransfer, a system that can annotate point-based representations of collections of textual documents. They use TF-IDF to infer a set of characteristic phrases and construct the visualizations like those shown below (Figs. 1 and 2 from their paper, see the citation below):
In the figure above, we see two designs, one with phrases that overlap the data points and the other where they try to avoid the overlap. In the past, our group has also designed systems for annotations of point-based visualizations, and here is our implementation in a single-cell add-on of Orange:
Here, annotations are based on identified clusters of points, where clusters are identified in the embedding space.
Proposed Solution
I here propose the implementation of the Annotated Corpus Map. The widget takes text corpus on the input. We also assume that the data already contains two columns with x and y coordinates, for instance, from the t-SNE widget.
For the clustering, the choices are (1) DBSCAN (with its parameter epsilon, just like in the Annotator widget in the single-cell add-on), (2) Gaussian mixture models (e.g., GaussianMixture for scikit-learn, with a number of clusters set manually but by default set to the number of clusters with maximal silhouette), and (3) a clustering variable. Number 3 is enabled only if the data contains a discrete attribute, possibly in meta variables.
The widget can use a procedure to find characteristic terms from the documents implemented in Orange's text add-on. The user can select up to five characteristic words to be displayed for each cluster (default 3). Hovering over the cluster labels displays a tooltip with scores and up to ten characteristic words.
The text was updated successfully, but these errors were encountered:
Background
In their work from 2018, Han and colleagues describe LabelTransfer, a system that can annotate point-based representations of collections of textual documents. They use TF-IDF to infer a set of characteristic phrases and construct the visualizations like those shown below (Figs. 1 and 2 from their paper, see the citation below):
In the figure above, we see two designs, one with phrases that overlap the data points and the other where they try to avoid the overlap. In the past, our group has also designed systems for annotations of point-based visualizations, and here is our implementation in a single-cell add-on of Orange:
Here, annotations are based on identified clusters of points, where clusters are identified in the embedding space.
Proposed Solution
I here propose the implementation of the Annotated Corpus Map. The widget takes text corpus on the input. We also assume that the data already contains two columns with x and y coordinates, for instance, from the t-SNE widget.
For the clustering, the choices are (1) DBSCAN (with its parameter epsilon, just like in the Annotator widget in the single-cell add-on), (2) Gaussian mixture models (e.g., GaussianMixture for scikit-learn, with a number of clusters set manually but by default set to the number of clusters with maximal silhouette), and (3) a clustering variable. Number 3 is enabled only if the data contains a discrete attribute, possibly in meta variables.
The widget can use a procedure to find characteristic terms from the documents implemented in Orange's text add-on. The user can select up to five characteristic words to be displayed for each cluster (default 3). Hovering over the cluster labels displays a tooltip with scores and up to ten characteristic words.
The text was updated successfully, but these errors were encountered: