Annotated Corpus Map: suggestion for more meaningful Scores output #1079

wvdvegte · 2024-08-22T14:49:36Z

Is your feature request related to a problem? Please describe.
This suggestion is related to #997 and #1077.
The current Scores output of Annotated Corpus Map gives some access to the most typical words per cluster. However, the only way to extract the most typical words per cluster seems to be to show them in a Data Table and then sort descending by the column Score(Cn). If I want listings per cluster, this has to be repeated manually for each cluster.

Describe the solution you'd like
An output like this would be more useful, at least for me:

It would allow to extract the most meaningful words per cluster using common widgets - for instance: Select Rows to filter for highest-ranked scores per cluster, and Group By to get a concatenated list of characteristic words per cluster and averages of Score and p-value:

The first table screenshot is already filtered with Select Rows to show only the top-10 per cluster, but it would make sense to keep everything above a certain Score (fixed or user-definable) and p-value (which is the FDR threshold that is user-definable already)

Describe alternatives you've considered
The attached workflow contains a Python Script that transforms the current Scores output to the suggested output, with the above additional steps added downstream.

Sorted output from Annotated Corpus Map 2.ows.zip

Note: It seems not to be uncommon that multiple words in a cluster have the same score. If that is the case, it would make sense to rank them from lowest to highest p-value (as in the first screenshot). However, I noticed that in the cluster labels of the Annotated Corpus Map visualization, words with the same score are not ranked based on p-value.

ajdapretnar · 2025-01-20T12:43:19Z

We discussed this quite extensively today and we couldn't reach a very unanimous conclusion.
I feel like we should keep the current Scores output as is. It shows all the available information that can be used for subsequent preprocessing. However, I do agree that further processing to achieve your desired outcome is cumbersome. I noticed you can avoid using Python Script and use Melt instead. Then use Select Rows and Group By (please note that a mean p-value is statistically very shaky). So overall, the above can already be achieved using three Orange widgets, with the exception of selecting the top 10 words for each cluster.

So my vote goes for stays as is. Perhaps @janezd can give his two cents?

@SanchoSamba I would still fix the issue of Annotated Corpus Map not ranking words with the same score based on p-value.

wvdvegte · 2025-01-20T16:33:53Z

I must admit I hadn't thought of using Melt. I tried this:

Melt with words as row identifier and exclude zero values (to get rid of rows with score == 0, since the p-value never really reaches 0 anyway)
Select rows with value is not 1, to get rid of rows with p-value == 1
Group by item, concatenate words and average value.

which gives me something like this (with a slightly different dataset and therefore different clusters and words):

Indeed, I see no way of getting the most characteristic words first (as we also see in the nice visualization of the corpus map), I can only get them alphabetically, which doesn't seem to make a lot of sense, since the concatenation can still contain words with a low score and/or high FDR. But was this indeed what you did?
Of course it would also be possible to split the table from Merge two (1) rows addressing the score and (2) rows addressing the corrected p-value, then selecting by putting a minimum on the score in the first one and a maximum on the p-value in the second one, then merge them back - which unfortunately requires more widgets.

Now I completely agree with you that an average p-value doesn't make sense (I actually didn't use these numbers and didn't give them a thought). Instead, in the processing after my Python script it would make more sense, when Group-by would calculate the minimum of the score and the maximum of the p-value. So the second screenshot in my first post should have been like this:

I realize now my point is that the visualization of Annotated Corpus Map 'promises' a list of the most characteristic words ranked by score (and p-value) which unfortunately has a maximum length of 5 words, which is actually very hard to extract in further processing - with or without more flexibility, such as > 5 words.

janezd assigned SanchoSamba Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotated Corpus Map: suggestion for more meaningful Scores output #1079

Annotated Corpus Map: suggestion for more meaningful Scores output #1079

wvdvegte commented Aug 22, 2024 •

edited

Loading

ajdapretnar commented Jan 20, 2025

wvdvegte commented Jan 20, 2025

Annotated Corpus Map: suggestion for more meaningful Scores output #1079

Annotated Corpus Map: suggestion for more meaningful Scores output #1079

Comments

wvdvegte commented Aug 22, 2024 • edited Loading

ajdapretnar commented Jan 20, 2025

wvdvegte commented Jan 20, 2025

wvdvegte commented Aug 22, 2024 •

edited

Loading