-
-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Statistics widget #503
Conversation
Codecov Report
@@ Coverage Diff @@
## master #503 +/- ##
==========================================
+ Coverage 63.84% 65.41% +1.57%
==========================================
Files 59 61 +2
Lines 6322 6621 +299
Branches 829 872 +43
==========================================
+ Hits 4036 4331 +295
- Misses 2151 2154 +3
- Partials 135 136 +1 |
For the reference, I will put down possible feature construction methods we could consider including:
|
We need to discuss whether we compute statistic based on tokens (n-grams) or raw text. In my opinion, to be consistent with other widgets, statistics must be computed on tokens (n-grams), but on the other side, it is stupid to count words in n-grams. On the other side, if we do that on raw text, preprocessing does not have an effect on the result. What do you think @ajdapretnar? Here is the list of proposed features where the ?? means we need to decide on what basis we need to implement that (n-grams or raw text).
Here are some features that need further discussion:
|
0f10398
to
199cb24
Compare
I would do all on raw text, except for POS tags. For POS tags I would check if POS tags are in tokens, else give an error/warning. Count POS tags: I meant each one individually. The user could select which POS tag to count. There could be an option for all, which would give a column per tag. Stopword count: I don't know why I thought of that. I agree it might be complex. We could just ignore it. Total document length: yes, totally the same as char count, with added white space, but who cares about that? |
ce54a2c
to
b288809
Compare
The widget should be finished now. Maybe two things to discuss:
|
Good point on vowel count. How about providing an line edit that where the user would input the vowels? With a, e, i, o, u as default? Separated by a comma? |
9aa0356
to
4ba9782
Compare
4ba9782
to
7aac113
Compare
@ajdapretnar it is a great idea. I used this solution for both vowels and consonants. From my side widget is ready now. The documentation is missing and will be added when we agree that widget is ok. |
7aac113
to
7e8261f
Compare
This is a good widget! One small issue I found is that it is impossible to count just 'verbs' with POS tag. Verbs have several different POS tags, VB for base form, VBD for past tense, VVZ for 3rd person singular, and so on. With the current implementation only exact matches are computed. Would it be possible to have either all verbs or specific tags? 🤔 Perhaps this is too complicated, I'm just thinking aloud. |
Also, unfortunately it doesn't work with Predictions.
|
Thank you for the report. I probably ruined a compute value functionality. :( Regarding POS tag: I suggest making something similar to vowels and consonants (users can specify more POS tags - they are comma separated). Is it ok? |
7e8261f
to
0135940
Compare
I can't replicate this error with standard workflows anymore. I have updated Orange to the latest version(s) (also widget-base and canvas). Before merging, we need documentation. I can do it or you can, either way. :) |
@ajdapretnar I can make the documentation but I am currently quite busy, so if you have a bit more time you are welcome to add the documentation. |
Sure, no problem. |
0135940
to
5221714
Compare
I considered all suggestions and also changed word count/char count to average word length - hope it is ok. |
I added the documentation. However, the standard create_widget_catalogue script for building widgets.json for some reason removed Topic Modelling from the list. I manually added it back to widgets.json. 🤷 |
69d696c
to
2543856
Compare
Thank you for the documentation. It is great. I fixed what you suggested. Now I think it could be merged. |
2543856
to
964dd6c
Compare
Issue
Implements #229
Description of changes
Statistics widget and tests.
Documentation still need to be written. It will be added when we agree that the widget is ok.
Includes