-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Automated Readability Index (ARI). Closes #20 #46
Add Automated Readability Index (ARI). Closes #20 #46
Conversation
Add ARI to visualization module. Add unit tests to test_visualization. Additionally import numpy in visualization and test_visualization to be able to return NaNs in Series.
Hi @hf2000510, thank you for your PR and welcome! In general, Texthero's source code should be as minimal, as fast and as concise as possible. Probably, we can change a bit the code to make it more clear and easy to read and probably faster. Learn from textstat Textstat, a python toolkit that provide some text statistics has already implemented ari. Their solution is a bit more concise:
We can probably learn from them. Pandas way If you look carefully at almost all Texthero's functions, when not strictly necessary, we try to avoid using For the ARI function, we can probably obtain the same results by first computing the Pandas Series for
Check input is string Using Instead, we should probably use
Extra comments For correctly achieving this function, we need to implement basically three sub-tasks:
For For This was quite a long PR review, hope you got some interesting and useful hints. Let me know your feedback! |
Thanks for your help! I've converted this to a draft pull request and opened #51 to first implement a count_sentences function, which makes sense independently, and also makes the automated readability index implementation easier. Will finish this when count_sentences is done. |
* Added Remove Tags and Replace Tags * removed contributor
* README.md * updated README
* added replace hashtags and remove hashtag * Fixed the Documentation * Preprocessing Hashtag Regex as a raw string
* Add count_sentences function to nlp.py Also add tests for the function to test_nlp.py * Implement suggestions from pull request. Add more tests, change style (docstring, tests naming). Remove unicode-casting to avoid unexpected behaviour. * Add link to spacy documentation. Additionally update index tests, they're cleaner now. Co-authored-by: Henri Froese <[email protected]>
Now incorporates suggested changes. Input checking done with pd.api.types.is_string_dtype. Not a permanent solution, will be improved by jbesomi#60 etc. Co-authored-by: Maximilian Krahn <[email protected]>
New pull request from jbesomi#46 as we had some Git problems. Input checking done with pd.api.types.is_string_dtype. Not a permanent solution, will be improved by jbesomi#60 etc. Co-authored-by: Maximilian Krahn <[email protected]>
We had some Git trouble (as you can probably see above 🥉 ) so we closed this and moved the PR to #74 , sorry about that |
The function (added to visualization.py) returns a new series where each entry corresponds to the ARI of the given series at that position. It uses exactly the wikipedia formula & description. Numpy is imported so NaNs can be returned for invalid entries.
Also added unit tests to test_visualization.py.