Annotated Corpus Map #799

BlazZupan · 2022-03-22T19:59:35Z

Background

In their work from 2018, Han and colleagues describe LabelTransfer, a system that can annotate point-based representations of collections of textual documents. They use TF-IDF to infer a set of characteristic phrases and construct the visualizations like those shown below (Figs. 1 and 2 from their paper, see the citation below):

In the figure above, we see two designs, one with phrases that overlap the data points and the other where they try to avoid the overlap. In the past, our group has also designed systems for annotations of point-based visualizations, and here is our implementation in a single-cell add-on of Orange:

Here, annotations are based on identified clusters of points, where clusters are identified in the embedding space.

Proposed Solution

I here propose the implementation of the Annotated Corpus Map. The widget takes text corpus on the input. We also assume that the data already contains two columns with x and y coordinates, for instance, from the t-SNE widget.

For the clustering, the choices are (1) DBSCAN (with its parameter epsilon, just like in the Annotator widget in the single-cell add-on), (2) Gaussian mixture models (e.g., GaussianMixture for scikit-learn, with a number of clusters set manually but by default set to the number of clusters with maximal silhouette), and (3) a clustering variable. Number 3 is enabled only if the data contains a discrete attribute, possibly in meta variables.

The widget can use a procedure to find characteristic terms from the documents implemented in Orange's text add-on. The user can select up to five characteristic words to be displayed for each cluster (default 3). Hovering over the cluster labels displays a tooltip with scores and up to ten characteristic words.

wvdvegte · 2023-07-25T14:53:33Z

The citation seems to be missing. I found the paper here.

ajdapretnar added enhancement feast This may require a few weeks of work labels Apr 1, 2022

VesnaT self-assigned this Apr 4, 2022

VesnaT mentioned this issue Apr 19, 2022

Annotated Corpus Map: New widget #818

Merged

3 tasks

VesnaT closed this as completed May 23, 2022

wvdvegte mentioned this issue Mar 24, 2023

Documentation for Annotated Corpus Map #958

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotated Corpus Map #799

Annotated Corpus Map #799

BlazZupan commented Mar 22, 2022

wvdvegte commented Jul 25, 2023

Annotated Corpus Map #799

Annotated Corpus Map #799

Comments

BlazZupan commented Mar 22, 2022

wvdvegte commented Jul 25, 2023