[ENH] Statistics widget #503

PrimozGodec · 2020-03-01T12:00:15Z

Issue

Implements #229

Description of changes

Statistics widget and tests.

Documentation still need to be written. It will be added when we agree that the widget is ok.

Includes

Code changes
Tests
Documentation

codecov-io · 2020-03-01T12:07:07Z

Codecov Report

Merging #503 into master will increase coverage by 1.57%.
The diff coverage is 98.34%.

@@            Coverage Diff             @@
##           master     #503      +/-   ##
==========================================
+ Coverage   63.84%   65.41%   +1.57%     
==========================================
  Files          59       61       +2     
  Lines        6322     6621     +299     
  Branches      829      872      +43     
==========================================
+ Hits         4036     4331     +295     
- Misses       2151     2154       +3     
- Partials      135      136       +1

ajdapretnar · 2020-03-02T08:32:20Z

For the reference, I will put down possible feature construction methods we could consider including:

character count
word density (word count / character count + 1) [check reference]
punctuation count
total document length
capital count
vowel count
consonant count
percent of unique words
count POS tags
stopword count

PrimozGodec · 2020-03-12T17:13:27Z

We need to discuss whether we compute statistic based on tokens (n-grams) or raw text. In my opinion, to be consistent with other widgets, statistics must be computed on tokens (n-grams), but on the other side, it is stupid to count words in n-grams. On the other side, if we do that on raw text, preprocessing does not have an effect on the result. What do you think @ajdapretnar? Here is the list of proposed features where the ?? means we need to decide on what basis we need to implement that (n-grams or raw text).

word count -> ??
Proposed new feature: n-grams count -> number of n-grams
character count -> ??
word density (word count / character count + 1) [check reference] -> this one depends on 1, 3
punctuation count -> ??
capital count -> ??
vowel count -> ??
consonant count -> ??
per cent of unique words -> ??
starts with -> ??
ends with -> ??
contains -> ??
regex -> ??

Here are some features that need further discussion:

count POS tags -> what exactly do you mean with this feature. Is this number of each pos tag. So one column for each of the POS tas.
stopword count -> this one is a bit complex since it is required to provide a dropdown with language selection and opening file with stop words.
total document length -> how is it different to characters count?

ajdapretnar · 2020-03-12T20:59:24Z

I would do all on raw text, except for POS tags. For POS tags I would check if POS tags are in tokens, else give an error/warning.

Count POS tags: I meant each one individually. The user could select which POS tag to count. There could be an option for all, which would give a column per tag.

Stopword count: I don't know why I thought of that. I agree it might be complex. We could just ignore it.

Total document length: yes, totally the same as char count, with added white space, but who cares about that?

PrimozGodec · 2020-03-17T12:54:18Z

The widget should be finished now. Maybe two things to discuss:

characters count: currently counts all alphanumerical characters (without spaces, punctuations, ...). Is it what we want?
for vowel count, the situation is a bit difficult. In English, vowels are a, e, i, o, u and sometimes y. Y is currently a newer threated as a vowel since we wanted to cover also other languages. To cover all languages correctly here we should discuss which languages are covered and the do some research about vowels.

ajdapretnar · 2020-03-17T13:35:35Z

Good point on vowel count. How about providing an line edit that where the user would input the vowels? With a, e, i, o, u as default? Separated by a comma?

PrimozGodec · 2020-03-18T20:13:38Z

@ajdapretnar it is a great idea. I used this solution for both vowels and consonants.

From my side widget is ready now. The documentation is missing and will be added when we agree that widget is ok.

ajdapretnar · 2020-03-30T14:19:08Z

This is a good widget!

One small issue I found is that it is impossible to count just 'verbs' with POS tag. Verbs have several different POS tags, VB for base form, VBD for past tense, VVZ for 3rd person singular, and so on. With the current implementation only exact matches are computed. Would it be possible to have either all verbs or specific tags? 🤔 Perhaps this is too complicated, I'm just thinking aloud.

ajdapretnar · 2020-03-30T14:20:35Z

Also, unfortunately it doesn't work with Predictions.

Traceback (most recent call last): File "/Users/ajda/orange/orange-canvas-core/orangecanvas/scheme/signalmanager.py", line 936, in __process_next if self.__process_next_helper(use_max_active=True): File "/Users/ajda/orange/orange-canvas-core/orangecanvas/scheme/signalmanager.py", line 974, in __process_next_helper self.process_node(selected_node) File "/Users/ajda/orange/orange-canvas-core/orangecanvas/scheme/signalmanager.py", line 605, in process_node self.send_to_node(node, signals_in) File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 792, in send_to_node self.process_signals_for_widget(node, widget, signals) File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 833, in process_signals_for_widget widget.handleNewSignals() File "/Users/ajda/orange/orange3/Orange/widgets/evaluate/owpredictions.py", line 184, in handleNewSignals self._call_predictors() File "/Users/ajda/orange/orange3/Orange/widgets/evaluate/owpredictions.py", line 209, in _call_predictors pred, prob = predictor(classless_data, Model.ValueProbs) File "/Users/ajda/orange/orange3/Orange/base.py", line 378, in __call__ data = data_to_model_domain() File "/Users/ajda/orange/orange3/Orange/base.py", line 352, in data_to_model_domain return data.transform(self.domain) File "/Users/ajda/orange/orange3/Orange/data/table.py", line 520, in transform return type(self).from_table(domain, self) File "/Users/ajda/orange/orange3-text/orangecontrib/text/corpus.py", line 433, in from_table t = super().from_table(domain, source, row_indices) File "/Users/ajda/orange/orange3/Orange/data/table.py", line 463, in from_table variables=domain.attributes) File "/Users/ajda/orange/orange3/Orange/data/table.py", line 397, in get_columns col_array = match_density(col(source)) File "/Users/ajda/orange/orange3/Orange/preprocess/transformation.py", line 30, in __call__ data = Table.from_table(domain, data) File "/Users/ajda/orange/orange3/Orange/data/table.py", line 463, in from_table variables=domain.attributes) File "/Users/ajda/orange/orange3/Orange/data/table.py", line 397, in get_columns col_array = match_density(col(source)) File "/Users/ajda/orange/orange3-text/orangecontrib/text/widgets/owstatistics.py", line 355, in __call__ return self.function(data, self.pattern, lambda: True)[0]TypeError: 'NoneType' object is not subscriptable

PrimozGodec · 2020-03-30T16:29:22Z

Thank you for the report. I probably ruined a compute value functionality. :(

Regarding POS tag: I suggest making something similar to vowels and consonants (users can specify more POS tags - they are comma separated). Is it ok?

ajdapretnar · 2020-04-14T07:57:41Z

I can't replicate this error with standard workflows anymore. I have updated Orange to the latest version(s) (also widget-base and canvas).

Before merging, we need documentation. I can do it or you can, either way. :)

PrimozGodec · 2020-04-14T10:14:58Z

@ajdapretnar I can make the documentation but I am currently quite busy, so if you have a bit more time you are welcome to add the documentation.

ajdapretnar · 2020-04-14T10:22:32Z

Sure, no problem.

orangecontrib/text/widgets/owstatistics.py

PrimozGodec · 2020-04-14T17:46:59Z

I considered all suggestions and also changed word count/char count to average word length - hope it is ok.

orangecontrib/text/widgets/tests/test_owstatistics.py

orangecontrib/text/widgets/owstatistics.py

ajdapretnar · 2020-04-17T13:23:28Z

I added the documentation. However, the standard create_widget_catalogue script for building widgets.json for some reason removed Topic Modelling from the list. I manually added it back to widgets.json. 🤷

PrimozGodec · 2020-04-17T15:39:36Z

Thank you for the documentation. It is great. I fixed what you suggested. Now I think it could be merged.

biolab deleted a comment from nikicc Mar 12, 2020

PrimozGodec force-pushed the statistics-widget branch from 0f10398 to 199cb24 Compare March 12, 2020 19:12

PrimozGodec force-pushed the statistics-widget branch 3 times, most recently from ce54a2c to b288809 Compare March 17, 2020 12:47

PrimozGodec force-pushed the statistics-widget branch from 9aa0356 to 4ba9782 Compare March 18, 2020 17:21

PrimozGodec changed the title ~~[WIP] Statistics widget~~ [ENH] Statistics widget Mar 18, 2020

PrimozGodec force-pushed the statistics-widget branch from 4ba9782 to 7aac113 Compare March 18, 2020 17:29

PrimozGodec force-pushed the statistics-widget branch from 7aac113 to 7e8261f Compare March 19, 2020 08:37

PrimozGodec force-pushed the statistics-widget branch from 7e8261f to 0135940 Compare April 7, 2020 14:08

ajdapretnar reviewed Apr 14, 2020

View reviewed changes