Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Statistics widget #503

Merged
merged 3 commits into from
Apr 22, 2020
Merged

Conversation

PrimozGodec
Copy link
Collaborator

@PrimozGodec PrimozGodec commented Mar 1, 2020

Issue

Implements #229

Description of changes

Statistics widget and tests.

Documentation still need to be written. It will be added when we agree that the widget is ok.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov-io
Copy link

codecov-io commented Mar 1, 2020

Codecov Report

Merging #503 into master will increase coverage by 1.57%.
The diff coverage is 98.34%.

@@            Coverage Diff             @@
##           master     #503      +/-   ##
==========================================
+ Coverage   63.84%   65.41%   +1.57%     
==========================================
  Files          59       61       +2     
  Lines        6322     6621     +299     
  Branches      829      872      +43     
==========================================
+ Hits         4036     4331     +295     
- Misses       2151     2154       +3     
- Partials      135      136       +1

@ajdapretnar
Copy link
Collaborator

For the reference, I will put down possible feature construction methods we could consider including:

  • character count
  • word density (word count / character count + 1) [check reference]
  • punctuation count
  • total document length
  • capital count
  • vowel count
  • consonant count
  • percent of unique words
  • count POS tags
  • stopword count

@biolab biolab deleted a comment from nikicc Mar 12, 2020
@PrimozGodec
Copy link
Collaborator Author

We need to discuss whether we compute statistic based on tokens (n-grams) or raw text. In my opinion, to be consistent with other widgets, statistics must be computed on tokens (n-grams), but on the other side, it is stupid to count words in n-grams. On the other side, if we do that on raw text, preprocessing does not have an effect on the result. What do you think @ajdapretnar? Here is the list of proposed features where the ?? means we need to decide on what basis we need to implement that (n-grams or raw text).

  1. word count -> ??
  2. Proposed new feature: n-grams count -> number of n-grams
  3. character count -> ??
  4. word density (word count / character count + 1) [check reference] -> this one depends on 1, 3
  5. punctuation count -> ??
  6. capital count -> ??
  7. vowel count -> ??
  8. consonant count -> ??
  9. per cent of unique words -> ??
  10. starts with -> ??
  11. ends with -> ??
  12. contains -> ??
  13. regex -> ??

Here are some features that need further discussion:

  • count POS tags -> what exactly do you mean with this feature. Is this number of each pos tag. So one column for each of the POS tas.
  • stopword count -> this one is a bit complex since it is required to provide a dropdown with language selection and opening file with stop words.
  • total document length -> how is it different to characters count?

@ajdapretnar
Copy link
Collaborator

I would do all on raw text, except for POS tags. For POS tags I would check if POS tags are in tokens, else give an error/warning.

Count POS tags: I meant each one individually. The user could select which POS tag to count. There could be an option for all, which would give a column per tag.

Stopword count: I don't know why I thought of that. I agree it might be complex. We could just ignore it.

Total document length: yes, totally the same as char count, with added white space, but who cares about that?

@PrimozGodec PrimozGodec force-pushed the statistics-widget branch 3 times, most recently from ce54a2c to b288809 Compare March 17, 2020 12:47
@PrimozGodec
Copy link
Collaborator Author

The widget should be finished now. Maybe two things to discuss:

  • characters count: currently counts all alphanumerical characters (without spaces, punctuations, ...). Is it what we want?
  • for vowel count, the situation is a bit difficult. In English, vowels are a, e, i, o, u and sometimes y. Y is currently a newer threated as a vowel since we wanted to cover also other languages. To cover all languages correctly here we should discuss which languages are covered and the do some research about vowels.

@ajdapretnar
Copy link
Collaborator

Good point on vowel count. How about providing an line edit that where the user would input the vowels? With a, e, i, o, u as default? Separated by a comma?

@PrimozGodec PrimozGodec changed the title [WIP] Statistics widget [ENH] Statistics widget Mar 18, 2020
@PrimozGodec
Copy link
Collaborator Author

@ajdapretnar it is a great idea. I used this solution for both vowels and consonants.

From my side widget is ready now. The documentation is missing and will be added when we agree that widget is ok.

@ajdapretnar
Copy link
Collaborator

This is a good widget!

One small issue I found is that it is impossible to count just 'verbs' with POS tag. Verbs have several different POS tags, VB for base form, VBD for past tense, VVZ for 3rd person singular, and so on. With the current implementation only exact matches are computed. Would it be possible to have either all verbs or specific tags? 🤔 Perhaps this is too complicated, I'm just thinking aloud.

@ajdapretnar
Copy link
Collaborator

Also, unfortunately it doesn't work with Predictions.

Traceback (most recent call last):  File "/Users/ajda/orange/orange-canvas-core/orangecanvas/scheme/signalmanager.py", line 936, in __process_next    if self.__process_next_helper(use_max_active=True):  File "/Users/ajda/orange/orange-canvas-core/orangecanvas/scheme/signalmanager.py", line 974, in __process_next_helper    self.process_node(selected_node)  File "/Users/ajda/orange/orange-canvas-core/orangecanvas/scheme/signalmanager.py", line 605, in process_node    self.send_to_node(node, signals_in)  File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 792, in send_to_node    self.process_signals_for_widget(node, widget, signals)  File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 833, in process_signals_for_widget    widget.handleNewSignals()  File "/Users/ajda/orange/orange3/Orange/widgets/evaluate/owpredictions.py", line 184, in handleNewSignals    self._call_predictors()  File "/Users/ajda/orange/orange3/Orange/widgets/evaluate/owpredictions.py", line 209, in _call_predictors    pred, prob = predictor(classless_data, Model.ValueProbs)  File "/Users/ajda/orange/orange3/Orange/base.py", line 378, in __call__    data = data_to_model_domain()  File "/Users/ajda/orange/orange3/Orange/base.py", line 352, in data_to_model_domain    return data.transform(self.domain)  File "/Users/ajda/orange/orange3/Orange/data/table.py", line 520, in transform    return type(self).from_table(domain, self)  File "/Users/ajda/orange/orange3-text/orangecontrib/text/corpus.py", line 433, in from_table    t = super().from_table(domain, source, row_indices)  File "/Users/ajda/orange/orange3/Orange/data/table.py", line 463, in from_table    variables=domain.attributes)  File "/Users/ajda/orange/orange3/Orange/data/table.py", line 397, in get_columns    col_array = match_density(col(source))  File "/Users/ajda/orange/orange3/Orange/preprocess/transformation.py", line 30, in __call__    data = Table.from_table(domain, data)  File "/Users/ajda/orange/orange3/Orange/data/table.py", line 463, in from_table    variables=domain.attributes)  File "/Users/ajda/orange/orange3/Orange/data/table.py", line 397, in get_columns    col_array = match_density(col(source))  File "/Users/ajda/orange/orange3-text/orangecontrib/text/widgets/owstatistics.py", line 355, in __call__    return self.function(data, self.pattern, lambda: True)[0]TypeError: 'NoneType' object is not subscriptable

@PrimozGodec
Copy link
Collaborator Author

Thank you for the report. I probably ruined a compute value functionality. :(

Regarding POS tag: I suggest making something similar to vowels and consonants (users can specify more POS tags - they are comma separated). Is it ok?

@ajdapretnar
Copy link
Collaborator

I can't replicate this error with standard workflows anymore. I have updated Orange to the latest version(s) (also widget-base and canvas).

Before merging, we need documentation. I can do it or you can, either way. :)

@PrimozGodec
Copy link
Collaborator Author

@ajdapretnar I can make the documentation but I am currently quite busy, so if you have a bit more time you are welcome to add the documentation.

@ajdapretnar
Copy link
Collaborator

Sure, no problem.

@PrimozGodec
Copy link
Collaborator Author

I considered all suggestions and also changed word count/char count to average word length - hope it is ok.

@ajdapretnar
Copy link
Collaborator

I added the documentation. However, the standard create_widget_catalogue script for building widgets.json for some reason removed Topic Modelling from the list. I manually added it back to widgets.json. 🤷

@PrimozGodec
Copy link
Collaborator Author

PrimozGodec commented Apr 17, 2020

Thank you for the documentation. It is great. I fixed what you suggested. Now I think it could be merged.

@ajdapretnar ajdapretnar merged commit f69b6e2 into biolab:master Apr 22, 2020
@PrimozGodec PrimozGodec deleted the statistics-widget branch March 29, 2023 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants