Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Automated Readability Index (ARI). Closes #20 #46

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 133 additions & 63 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,151 @@
# Contributing
# CONTRIBUTING

Hello there!
Hello and welcome to Texthero!

Thank you for being here. Texthero is maintained by [jbesomi](https://github.com/jbesomi). He is glad to receive your help.
This document contains all the important information you need to get started contributing.

## Getting started

If you feel you want to help and do not know where to start, you may start with the `good first issue` [issues](https://github.com/jbesomi/texthero/issues).
## Vision

## Development workflow
In case you are interested in the Texthero's vision as well as the core-principle, have a look at [PURPOSE.md](./PURPOSE.md)

The next steps will guide you towards making contributions to this repository. You just have to follows step-by-step. If anything is not clear or you have an idea on how to improve this document, feel free to edit it and open a pull request.

In case you need a more broad vision on how contributions work on Github, please refers to the [Github Guides](https://guides.github.com/). For getting started, read also [Creating a pull request from a fork](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request-from-a-fork).
## Quality

If you are used to the Github workflow, you can find at the end of this document a summary of the most important parts.
Texthero's main goal is to make the NLP-developer life _easier_. It does so by
1. Provide a simple-yet-complete tool for NLP and text analytics
2. Empower the NLP developer with great documentation, simple getting started docs as well as (work in progress) clear and concise tutorials (blog).

1. Fork the repository
Click the `fork` button in the GitHub repository; this will create a copy of Texthero in your Github account.
To achieve all of this, Texthero's code and documentation must be of high quality. Having a clean, readable, and **tested** code drastically reduces the likelihood of introducing bugs, and having great documentation will facilitate the work of many NLP developers as well as the work of Texther's maintainers.

1. Clone the repository
To do that, you need to have [git](https://git-scm.com/) installed. Open the terminal and type

## Shift-left testing

Texthero follows an approach known as shift-left testing. According to [Wikipedia](https://en.wikipedia.org/wiki/Shift-left_testing):

> Shift-left testing is an approach to software testing and system testing in which testing is performed earlier in the lifecycle.

Shift-left testing reduces the number of bugs by attempting to solve the problem at the origin. Often many programming defects are not uncovered and fixed until after significant effort has been wasted on their implementation. Texthero's attempt to avoid this kind of issue.


## Improve documentation!

A very important yet not particularly complex task consists in improving the documentation: many Texthero's users will be deeply grateful for your effort.

For instance, as of now, [texthero.representation.nmf](https://texthero.org/docs/api/texthero.representation.nmf) is very poor.

> Interested in improving this? It's pretty easy. Just copy-paste the docstring from texthero.representation.nmf and replace 'pca' with 'nmf' :D


## How to create a successful Pull Request on Texthero

Making sure your pull requests do not break the code and bring something valuable to the project means that only _high quality_ pull requests are approved.

The following link gives some advice on how to submit a successful pull request.

1. Submit a successful PR is not hard. Have a look at all [previous PR](https://github.com/jbesomi/texthero/pulls?q=is%3Apr+is%3Aclosed) already approved.
1. **Extensively test your code**. Think at all possible edge cases. Look at similar tests for ideas.
1. In most cases, there exist an example of function or docstring very similar to your specific use-case. Before writing your own-code, look at what the other functions look like.
1. Before submitting, **test locally** that you pass all tests (see below under `testing`).
1. Respect the best practice (see below `best practice`)
1. Make sure your code is black-formatted (`./format.sh`, see `formatting`)

<!--
1. Make use of the PR template (see `PR template` ) -->


## Ask questions!

We are there for you! If everything is unclear, just ask. We will do our best to answer you quickly.

## Propose new ideas!

Texthero is there for the NLP-community. If you have an idea on how we can improve it, let us know by opening a new [issues](https://github.com/jbesomi/texthero/issues). We will be glad to hear from you!

## Best practices

1. Read and respect the [numpydoc docstring guide](https://numpydoc.readthedocs.io/en/latest/format.html). Look at the code for similarity.
1. Give to your branch a meaningful name. Avoid using the master branch.

## Good first issue

If this is your first time contributing to Texthero, you might start by choosing a `good first issue` [issues](https://github.com/jbesomi/texthero/issues).


## Testing

As you understood, Texthero is serious about testing. We strongly encourage contributors to embrace [test-driven development (TDD)](https://en.wikipedia.org/wiki/Test-driven_development).

Tests are made with `unittest` from the python standard library: [Unit testing framework](https://docs.python.org/3/library/unittest.html)

To execute all tests, you can simply
```
$ cd scripts
$ ./tests.sh
```

Calling `./test.sh` is equivalent to execute form the _root_ `python3 -m unittest discover -s tests -t .`


**Important.** If you worked on a bug, you should add a test that checks the bug is not present anymore. This is extremely useful as it avoids to re-introduce the same bug again in the future.


### Passing doctests

When executing `./test.sh` it will also check that the Examples in the docstrings are correct (doctests).

Passing doctests might be a bit annoying sometimes. Let's look at this example for instance:

```
File "/home/travis/build/jbesomi/texthero/texthero/preprocessing.py", line 700, in texthero.preprocessing.remove_tags
Failed example:
hero.remove_tags(s)
Expected:
0 instagram texthero
dtype: object
Got:
0 instagram texthero
dtype: object
```

The docstring failed? Why? The reason is that somewhere in the `Example` section of docstring, we missed one or more white spaces ` `.

### Travis CI

When you submit your code, all code will be tested on different operating systems using Travis CI: [TRAVIS CI texthero](https://travis-ci.com/github/jbesomi/texthero).

Make sure you pass all your test locally before opening a pull request!

## Formatting

Before submitting, make sure your code is formatted. Code formatting is done with [black](https://github.com/psf/black).

```
$ git clone [email protected]:YOUR_USERNAME/texthero.git
cd scripts
./format.sh
```

Travis CI will check that the whole code is black-formatted. Make sure you format before submitting!

> It's handy to install the black formatter directly on your IDE.


## Development workflow

In case you need a more broad vision on how contributions work on Github, please refers to the [Github Guides](https://guides.github.com/). For getting started, you might find [Creating a pull request from a fork](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request-from-a-fork) useful.

1. Fork the repository

1. Clone the repository

1. Connect your cloned repository to the _original_ repo

```
$ cd texthero
$ git remote add upstream [email protected]:jbesomi/texthero.git
```

> This first step needs to be done only once. If in the future you will want to make new changes, make sure your repository is synchronized with respect to the upstream: [Syncing a fork](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork).
> This first step needs to be done only once. But, in the future when you will want to make new changes, make sure your repository is synchronized with respect to the upstream: [Syncing a fork](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork).

1. Install texthero locally and his dev-dependencies

Expand All @@ -44,7 +157,7 @@ $ pip install -e .

> The `-e` will install the python package in 'development' mode. That way your changes will take effect immediately without the need to reinstall the package again.

1. Install development dependencies
1. Install development dependencies (only required if you want to change the website doc)

Development dependencies need to be installed to update the website documentation, i.e the content in texthero.org.

Expand All @@ -54,7 +167,6 @@ In most cases, you **do not need** to update this. Changes from pull requests wi
pip install -e '.[dev]'
```


1. Create a new working branch

You can name it as you wish. A good practice is to give the branch a meaningful name so others know what you are working on.
Expand All @@ -77,17 +189,6 @@ Before opening a new pull-request, you should make sure that all tests still pas

**Important.** If you worked on a bug, you should add a test that checks the bug is not present anymore. This is extremely useful as it avoids to re-introduce the same bug again in the future.

In this part, you need to execute:
- `./format.sh` that will format all code with `black`
- `./test.sh` that will test all unittests and doctests.

> In the scripts folder there is also a `check.sh` shell script. Other than executing all tests, `check.sh` script will format again all the repository code and [update the documentation](#documentation) with the new changes. In most cases, you don't need to execute this one. To properly execute the check command, you need to make sure you have installed all the required dependencies, in particular Sphinx.

```
cd scripts
./format.sh
./test.sh
```

1. Open a Pull Request (PR)

Expand All @@ -100,32 +201,14 @@ The time to submit the PR has come. Head to your forked repository on Github. Th
- `./formath.sh`
- format all code with [black](https://github.com/psf/black)
- `./check.sh`
- format the code with black (`format.sh`)
- update the Sphinx documentation for the website
- Format the code with black (`format.sh`)
- Update the Sphinx documentation for the website
- Execute all test with `unittest` (`check.sh`)
- **This is the only and main file that must be called.**

## Good to know

1. Passing doctests might be a bit annoying sometimes. Let's look at this example for instance:

```
File "/home/travis/build/jbesomi/texthero/texthero/preprocessing.py", line 700, in texthero.preprocessing.remove_tags
Failed example:
hero.remove_tags(s)
Expected:
0 instagram texthero
dtype: object
Got:
0 instagram texthero
dtype: object
```

The docstring failed but it's not particularly clear why, right? Here, the reason is that somewhere on the docstring `Example`, we missed one or more white spaces ` `.

## Conventions

### Documentation and website
## Documentation: docstring

Texthero docstring follows [NumPy/SciPy](https://numpydoc.readthedocs.io/en/latest/format.html) docstring style. For example:

Expand Down Expand Up @@ -154,24 +237,11 @@ def remove_digits(input: pd.Series, only_blocks=True) -> pd.Series:
...
```


### Git commits

- Strive for atomicity: 1 commit = 1 context.
- Write messages in the present tense `Add XYZ support`
- You can reference relevant issues using a hashtag plus the number of the issue. Example: `#1`


## Test-driven development

Texthero is serious about testing. We strongly encourage contributors to embrace [test-driven development (TDD)](https://en.wikipedia.org/wiki/Test-driven_development).

Tests are made with `unittest` from the python standard library: [Unit testing framework](https://docs.python.org/3/library/unittest.html)

To execute all tests, you can simply
```
$ cd scripts
$ ./tests.sh
```

Calling `./test.sh` is equivalent to execute form the _root_ `python3 -m unittest discover -s tests -t .`
**Work in progress:** this document is a work in progress. If you spot a mistake or you want to make something clear, open a pull request!
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Texthero is free, open-source and [well documented](https://texthero.org/docs) (

We hope you will find pleasure working with Texthero as we had during his development.

<h2 align="center">Hablas español? क्या आप भारतीय बोलते हैं? 日本語が話せるのか?</h2>
<h2 align="center">Hablas español? क्या आप हिंदी बोलते हैं? 日本語が話せるのか?</h2>

Texthero has been developed for the whole NLP community. We know how hard is to deal with different NLP tools (NLTK, SpaCy, Gensim, TextBlob, Sklearn): that's why we developed Texthero, to simplify things.

Expand Down Expand Up @@ -312,14 +312,14 @@ If you have just other questions or inquiry drop me a line at jonathanbesomi__AT
- [Dan Keefe](https://github.com/Peritract)
- [Christian Claus](https://github.com/cclauss)
- [bobfang1992](https://github.com/bobfang1992)
- [Ishan Arora](https://github.com/ishanarora04)


<h2 align="center"><a href="./LICENSE">License</a></h2>

The MIT License (MIT)

Texthero is licensed under the following MIT license: The MIT License (MIT)
Copyright © 2020 Texthero, https://texthero.org
Copyright (c) 2020 Texthero

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
32 changes: 32 additions & 0 deletions tests/test_nlp.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import pandas as pd
import numpy as np
from texthero import nlp

from . import PandasTestCase
Expand Down Expand Up @@ -36,3 +37,34 @@ def test_noun_chunks(self):
[[("Today", "NP", 0, 5), ("such a beautiful day", "NP", 9, 29)]]
)
self.assertEqual(nlp.noun_chunks(s), s_true)

"""
Count sentences.
"""

def test_count_sentences(self):
s = pd.Series("I think ... it counts correctly. Doesn't it? Great!")
s_true = pd.Series(3)
self.assertEqual(nlp.count_sentences(s), s_true)

def test_count_sentences_numeric(self):
s = pd.Series([13.0, 42.0])
self.assertRaises(TypeError, nlp.count_sentences, s)

def test_count_sentences_missing_value(self):
s = pd.Series(["Test.", np.nan])
self.assertRaises(TypeError, nlp.count_sentences, s)

def test_count_sentences_index(self):
s = pd.Series(["Test"], index=[5])
counted_sentences_s = nlp.count_sentences(s)
t_same_index = pd.Series([""], index=[5])

self.assertTrue(counted_sentences_s.index.equals(t_same_index.index))

def test_count_sentences_wrong_index(self):
s = pd.Series(["Test", "Test"], index=[5, 6])
counted_sentences_s = nlp.count_sentences(s)
t_different_index = pd.Series(["", ""], index=[5, 7])

self.assertFalse(counted_sentences_s.index.equals(t_different_index.index))
38 changes: 38 additions & 0 deletions tests/test_preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -259,3 +259,41 @@ def test_tokenize_with_phrases(self):
self.assertEqual(
preprocessing.tokenize_with_phrases(s, min_count=3, threshold=1), s_true
)

"""
Test replace and remove tags
"""

def test_replace_tags(self):
s = pd.Series("Hi @tag, we will replace you")
s_true = pd.Series("Hi TAG, we will replace you")

self.assertEqual(preprocessing.replace_tags(s, symbol="TAG"), s_true)

def test_remove_tags_alphabets(self):
s = pd.Series("Hi @tag, we will remove you")
s_true = pd.Series("Hi , we will remove you")

self.assertEqual(preprocessing.remove_tags(s), s_true)

def test_remove_tags_numeric(self):
s = pd.Series("Hi @123, we will remove you")
s_true = pd.Series("Hi , we will remove you")

self.assertEqual(preprocessing.remove_tags(s), s_true)

"""
Test replace and remove hashtags
"""

def test_replace_hashtags(self):
s = pd.Series("Hi #hashtag, we will replace you")
s_true = pd.Series("Hi HASHTAG, we will replace you")

self.assertEqual(preprocessing.replace_hashtags(s, symbol="HASHTAG"), s_true)

def test_remove_hashtags(self):
s = pd.Series("Hi #hashtag_trending123, we will remove you")
s_true = pd.Series("Hi , we will remove you")

self.assertEqual(preprocessing.remove_hashtags(s), s_true)
Loading