Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotated Corpus Map: suggestion for more meaningful Scores output #1079

Open
wvdvegte opened this issue Aug 22, 2024 · 0 comments
Open

Annotated Corpus Map: suggestion for more meaningful Scores output #1079

wvdvegte opened this issue Aug 22, 2024 · 0 comments

Comments

@wvdvegte
Copy link

wvdvegte commented Aug 22, 2024

Is your feature request related to a problem? Please describe.
This suggestion is related to #997 and #1077.
The current Scores output of Annotated Corpus Map gives some access to the most typical words per cluster. However, the only way to extract the most typical words per cluster seems to be to show them in a Data Table and then sort descending by the column Score(Cn). If I want listings per cluster, this has to be repeated manually for each cluster.

Describe the solution you'd like
An output like this would be more useful, at least for me:
image
It would allow to extract the most meaningful words per cluster using common widgets - for instance: Select Rows to filter for highest-ranked scores per cluster, and Group By to get a concatenated list of characteristic words per cluster and averages of Score and p-value:
image

The first table screenshot is already filtered with Select Rows to show only the top-10 per cluster, but it would make sense to keep everything above a certain Score (fixed or user-definable) and p-value (which is the FDR threshold that is user-definable already)

Describe alternatives you've considered
The attached workflow contains a Python Script that transforms the current Scores output to the suggested output, with the above additional steps added downstream.

Sorted output from Annotated Corpus Map 2.ows.zip

Note: It seems not to be uncommon that multiple words in a cluster have the same score. If that is the case, it would make sense to rank them from lowest to highest p-value (as in the first screenshot). However, I noticed that in the cluster labels of the Annotated Corpus Map visualization, words with the same score are not ranked based on p-value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant