Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotated Corpus Map: suggestion for more meaningful Scores output #1079

Open
wvdvegte opened this issue Aug 22, 2024 · 2 comments
Open

Annotated Corpus Map: suggestion for more meaningful Scores output #1079

wvdvegte opened this issue Aug 22, 2024 · 2 comments
Assignees

Comments

@wvdvegte
Copy link

wvdvegte commented Aug 22, 2024

Is your feature request related to a problem? Please describe.
This suggestion is related to #997 and #1077.
The current Scores output of Annotated Corpus Map gives some access to the most typical words per cluster. However, the only way to extract the most typical words per cluster seems to be to show them in a Data Table and then sort descending by the column Score(Cn). If I want listings per cluster, this has to be repeated manually for each cluster.

Describe the solution you'd like
An output like this would be more useful, at least for me:
image
It would allow to extract the most meaningful words per cluster using common widgets - for instance: Select Rows to filter for highest-ranked scores per cluster, and Group By to get a concatenated list of characteristic words per cluster and averages of Score and p-value:
image

The first table screenshot is already filtered with Select Rows to show only the top-10 per cluster, but it would make sense to keep everything above a certain Score (fixed or user-definable) and p-value (which is the FDR threshold that is user-definable already)

Describe alternatives you've considered
The attached workflow contains a Python Script that transforms the current Scores output to the suggested output, with the above additional steps added downstream.

Sorted output from Annotated Corpus Map 2.ows.zip

Note: It seems not to be uncommon that multiple words in a cluster have the same score. If that is the case, it would make sense to rank them from lowest to highest p-value (as in the first screenshot). However, I noticed that in the cluster labels of the Annotated Corpus Map visualization, words with the same score are not ranked based on p-value.

@ajdapretnar
Copy link
Collaborator

We discussed this quite extensively today and we couldn't reach a very unanimous conclusion.
I feel like we should keep the current Scores output as is. It shows all the available information that can be used for subsequent preprocessing. However, I do agree that further processing to achieve your desired outcome is cumbersome. I noticed you can avoid using Python Script and use Melt instead. Then use Select Rows and Group By (please note that a mean p-value is statistically very shaky). So overall, the above can already be achieved using three Orange widgets, with the exception of selecting the top 10 words for each cluster.

So my vote goes for stays as is. Perhaps @janezd can give his two cents?

@SanchoSamba I would still fix the issue of Annotated Corpus Map not ranking words with the same score based on p-value.

@wvdvegte
Copy link
Author

I must admit I hadn't thought of using Melt. I tried this:

  1. Melt with words as row identifier and exclude zero values (to get rid of rows with score == 0, since the p-value never really reaches 0 anyway)
  2. Select rows with value is not 1, to get rid of rows with p-value == 1
  3. Group by item, concatenate words and average value.

which gives me something like this (with a slightly different dataset and therefore different clusters and words):

Image

Indeed, I see no way of getting the most characteristic words first (as we also see in the nice visualization of the corpus map), I can only get them alphabetically, which doesn't seem to make a lot of sense, since the concatenation can still contain words with a low score and/or high FDR. But was this indeed what you did?
Of course it would also be possible to split the table from Merge two (1) rows addressing the score and (2) rows addressing the corrected p-value, then selecting by putting a minimum on the score in the first one and a maximum on the p-value in the second one, then merge them back - which unfortunately requires more widgets.

Now I completely agree with you that an average p-value doesn't make sense (I actually didn't use these numbers and didn't give them a thought). Instead, in the processing after my Python script it would make more sense, when Group-by would calculate the minimum of the score and the maximum of the p-value. So the second screenshot in my first post should have been like this:

Image

I realize now my point is that the visualization of Annotated Corpus Map 'promises' a list of the most characteristic words ranked by score (and p-value) which unfortunately has a maximum length of 5 words, which is actually very hard to extract in further processing - with or without more flexibility, such as > 5 words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants