-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotated Corpus Map: suggestion for more meaningful Scores output #1079
Comments
We discussed this quite extensively today and we couldn't reach a very unanimous conclusion. So my vote goes for stays as is. Perhaps @janezd can give his two cents? @SanchoSamba I would still fix the issue of Annotated Corpus Map not ranking words with the same score based on p-value. |
I must admit I hadn't thought of using Melt. I tried this:
which gives me something like this (with a slightly different dataset and therefore different clusters and words): Indeed, I see no way of getting the most characteristic words first (as we also see in the nice visualization of the corpus map), I can only get them alphabetically, which doesn't seem to make a lot of sense, since the concatenation can still contain words with a low score and/or high FDR. But was this indeed what you did? Now I completely agree with you that an average p-value doesn't make sense (I actually didn't use these numbers and didn't give them a thought). Instead, in the processing after my Python script it would make more sense, when Group-by would calculate the minimum of the score and the maximum of the p-value. So the second screenshot in my first post should have been like this: I realize now my point is that the visualization of Annotated Corpus Map 'promises' a list of the most characteristic words ranked by score (and p-value) which unfortunately has a maximum length of 5 words, which is actually very hard to extract in further processing - with or without more flexibility, such as > 5 words. |
Is your feature request related to a problem? Please describe.
This suggestion is related to #997 and #1077.
The current Scores output of Annotated Corpus Map gives some access to the most typical words per cluster. However, the only way to extract the most typical words per cluster seems to be to show them in a Data Table and then sort descending by the column Score(Cn). If I want listings per cluster, this has to be repeated manually for each cluster.
Describe the solution you'd like
An output like this would be more useful, at least for me:
It would allow to extract the most meaningful words per cluster using common widgets - for instance: Select Rows to filter for highest-ranked scores per cluster, and Group By to get a concatenated list of characteristic words per cluster and averages of Score and p-value:
The first table screenshot is already filtered with Select Rows to show only the top-10 per cluster, but it would make sense to keep everything above a certain Score (fixed or user-definable) and p-value (which is the FDR threshold that is user-definable already)
Describe alternatives you've considered
The attached workflow contains a Python Script that transforms the current Scores output to the suggested output, with the above additional steps added downstream.
Sorted output from Annotated Corpus Map 2.ows.zip
Note: It seems not to be uncommon that multiple words in a cluster have the same score. If that is the case, it would make sense to rank them from lowest to highest p-value (as in the first screenshot). However, I noticed that in the cluster labels of the Annotated Corpus Map visualization, words with the same score are not ranked based on p-value.
The text was updated successfully, but these errors were encountered: