Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotated Corpus Map #799

Closed
BlazZupan opened this issue Mar 22, 2022 · 1 comment
Closed

Annotated Corpus Map #799

BlazZupan opened this issue Mar 22, 2022 · 1 comment
Assignees
Labels
enhancement feast This may require a few weeks of work

Comments

@BlazZupan
Copy link
Contributor

Background

In their work from 2018, Han and colleagues describe LabelTransfer, a system that can annotate point-based representations of collections of textual documents. They use TF-IDF to infer a set of characteristic phrases and construct the visualizations like those shown below (Figs. 1 and 2 from their paper, see the citation below):

image

In the figure above, we see two designs, one with phrases that overlap the data points and the other where they try to avoid the overlap. In the past, our group has also designed systems for annotations of point-based visualizations, and here is our implementation in a single-cell add-on of Orange:

image

Here, annotations are based on identified clusters of points, where clusters are identified in the embedding space.

Proposed Solution

I here propose the implementation of the Annotated Corpus Map. The widget takes text corpus on the input. We also assume that the data already contains two columns with x and y coordinates, for instance, from the t-SNE widget.

For the clustering, the choices are (1) DBSCAN (with its parameter epsilon, just like in the Annotator widget in the single-cell add-on), (2) Gaussian mixture models (e.g., GaussianMixture for scikit-learn, with a number of clusters set manually but by default set to the number of clusters with maximal silhouette), and (3) a clustering variable. Number 3 is enabled only if the data contains a discrete attribute, possibly in meta variables.

The widget can use a procedure to find characteristic terms from the documents implemented in Orange's text add-on. The user can select up to five characteristic words to be displayed for each cluster (default 3). Hovering over the cluster labels displays a tooltip with scores and up to ten characteristic words.

@ajdapretnar ajdapretnar added enhancement feast This may require a few weeks of work labels Apr 1, 2022
@VesnaT VesnaT self-assigned this Apr 4, 2022
@VesnaT VesnaT closed this as completed May 23, 2022
@wvdvegte
Copy link

The citation seems to be missing. I found the paper here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement feast This may require a few weeks of work
Projects
None yet
Development

No branches or pull requests

4 participants