Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotated corpus map: add output option with keywords (cluster labels) per cluster #997

Open
wvdvegte opened this issue Aug 18, 2023 · 1 comment

Comments

@wvdvegte
Copy link

wvdvegte commented Aug 18, 2023

Is your feature request related to a problem? Please describe.
After generating clusters with Annotated Corpus Map, I'd like to create a table describing the clusters, e.g. average year of publication, most frequently occurring publisher, etc., but also the words that are typical for the documents in the cluster (cluster labels in Annotated Corpus Map). I can do most of this with Group By, but there is no way to extract the characteristic words per directly.

Describe the solution you'd like
Add an output option for Annotated Corpus Map with keywords (cluster labels) per cluster - preferably with a user-definable maximum (not just the 5 that are produced when cranking up the 'Cluster labels' slider).

Describe alternatives you've considered
Let's say I have 10 clusters, I could do Select Rows to select a cluster, then Extract Keywords per cluster. I have to do this 10x in parallel, then Concatenate and Group By Source ID to get an overview of the words per cluster which I could merge with the other grouped data per cluster. But this gives me slightly different keywords, and the replication needed to treat each cluster in parallel makes this a cumbersome workaround - especially because it hardly allows me to vary the number of clusters (which necessitates adding/removing parallel branches of Select Rows -> Extract Keywords)

@wvdvegte
Copy link
Author

I noticed that the latest version now has a Scores output, which at least gives access to the most typical words per cluster. However, the only way to get the most typical words per cluster seems to be to show them in a Data Table and then sort descending by the column Score(Cn). This still has to be repeated manually for each cluster if I want listings per cluster.
Also, I noticed that the order of words by highest score is different from the one shown in the cluster labels in the visualization. Which makes me wonder how the order in the cluster label is determined. In my dataset, there are two words with the same score and different p-values, but the one with the highest p-value (least significance) comes first in the cluster label...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant