Annotated corpus map: add output option with keywords (cluster labels) per cluster #997

wvdvegte · 2023-08-18T10:09:29Z

Is your feature request related to a problem? Please describe.
After generating clusters with Annotated Corpus Map, I'd like to create a table describing the clusters, e.g. average year of publication, most frequently occurring publisher, etc., but also the words that are typical for the documents in the cluster (cluster labels in Annotated Corpus Map). I can do most of this with Group By, but there is no way to extract the characteristic words per directly.

Describe the solution you'd like
Add an output option for Annotated Corpus Map with keywords (cluster labels) per cluster - preferably with a user-definable maximum (not just the 5 that are produced when cranking up the 'Cluster labels' slider).

Describe alternatives you've considered
Let's say I have 10 clusters, I could do Select Rows to select a cluster, then Extract Keywords per cluster. I have to do this 10x in parallel, then Concatenate and Group By Source ID to get an overview of the words per cluster which I could merge with the other grouped data per cluster. But this gives me slightly different keywords, and the replication needed to treat each cluster in parallel makes this a cumbersome workaround - especially because it hardly allows me to vary the number of clusters (which necessitates adding/removing parallel branches of Select Rows -> Extract Keywords)

wvdvegte · 2023-08-28T12:11:54Z

I noticed that the latest version now has a Scores output, which at least gives access to the most typical words per cluster. However, the only way to get the most typical words per cluster seems to be to show them in a Data Table and then sort descending by the column Score(Cn). This still has to be repeated manually for each cluster if I want listings per cluster.
Also, I noticed that the order of words by highest score is different from the one shown in the cluster labels in the visualization. Which makes me wonder how the order in the cluster label is determined. In my dataset, there are two words with the same score and different p-values, but the one with the highest p-value (least significance) comes first in the cluster label...

wvdvegte mentioned this issue Jul 22, 2024

Annotated corpus map: provide corrected instead of uncorrected p-value in Scores output #1077

Open

wvdvegte mentioned this issue Aug 22, 2024

Annotated Corpus Map: suggestion for more meaningful Scores output #1079

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotated corpus map: add output option with keywords (cluster labels) per cluster #997

Annotated corpus map: add output option with keywords (cluster labels) per cluster #997

wvdvegte commented Aug 18, 2023 •

edited

Loading

wvdvegte commented Aug 28, 2023

Annotated corpus map: add output option with keywords (cluster labels) per cluster #997

Annotated corpus map: add output option with keywords (cluster labels) per cluster #997

Comments

wvdvegte commented Aug 18, 2023 • edited Loading

wvdvegte commented Aug 28, 2023

wvdvegte commented Aug 18, 2023 •

edited

Loading