Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update getting_started.md #118

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
142 changes: 107 additions & 35 deletions website/docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@ Texthero is a python package to let you work efficiently and quickly with text d

## Overview

Given a dataset with structured data, it's easy to have a quick understanding of the underline data. Oppositely, given a dataset composed of text-only, it's harder to have a quick undertanding of the data. Texthero help you there, providing utility functions to quickly **clean the text data**, **map it into a vector space** and gather from it **primary insights**.
Given a dataset with structured data, it's easy to have a quick understanding of the underlying data. Oppositely, given a dataset composed of text-only, it's harder to have a quick undertanding of the data. Texthero help you there, providing utility functions to quickly **clean the text data**, **tokenize it**, **map it into a vector space** and gather from it **primary insights**.

##### Pandas integration

One of the main pillar of texthero is that is designed from the ground-up to work with **Pandas Dataframe** and **Series**.

Most of texthero methods, simply apply transformation to Pandas Series. As a rule of thumb, the first argument and the return ouputs of almost all texthero methods are either a Pandas Series or a Pandas DataFrame.
Most of texthero's methods simply apply a transformation to a Pandas Series. As a rule of thumb, the first argument and the ouput of almost all texthero methods are either a Pandas Series or a Pandas DataFrame.


##### Pipeline
Expand Down Expand Up @@ -46,7 +46,7 @@ The five different areas are _athletics_, _cricket_, _football_, _rugby_ and _te

The original dataset comes as a zip files with five different folder containing the article as text data for each topic.

For convenience, we createdThis script simply read all text data and store it into a Pandas Dataframe.
For convenience, we created this script simply read all text data and store it into a Pandas Dataframe.

Import texthero and pandas.

Expand Down Expand Up @@ -87,7 +87,7 @@ Recently, Pandas has introduced the pipe function. You can achieve the same resu
df['clean_text'] = df['text'].pipe(hero.clean)
```

> Tips. When we need to define a new column returned from a function, we prepend the name of the function to the column name. Example: df['tsne_col'] = df['col'].pipe(hero.tsne). This keep the code simple to read and permit to construct complex pipeline.
> Tips. When we need to define a new column returned from a function, we prepend the name of the function to the column name. Example: df['tsne_col'] = df['col'].pipe(hero.tsne). This keep the code simple to read and allows us to construct complex pipelines.

The default pipeline for the `clean` method is the following:

Expand Down Expand Up @@ -120,46 +120,77 @@ or alternatively
df['clean_text'] = df['clean_text'].pipe(hero.clean, custom_pipeline)
```

##### Tokenize

Next, we usually want to tokenize the text (_tokenizing_ means splitting sentences/documents into separate words, the _tokens_). Of course, texthero provides an easy function for that!

```python
df['tokenized_text'] = hero.tokenize(df['clean_text'])
```


##### Preprocessing API

The complete preprocessing API can be found at the following address: [api preprocessing](/docs/api-preprocessing).
The complete preprocessing API can be found here: [api preprocessing](/docs/api-preprocessing).


### Representation

Once cleaned the data, the next natural is to map each document into a vector.
Once the data is cleaned and tokenized, the next natural step is to map each document to a vector so we can compare documents with mathematical methods to derive insights.

##### TFIDF representation

TFIDF is a formula to calculate the _relative importance_ of the words in a document, taking
into account the words' occurrences in other documents.

```python
df['tfidf_clean_text'] = hero.tfidf(df['clean_text'])
df = pd.concat([df, hero.tfidf(df['tokenized_text']])
henrifroese marked this conversation as resolved.
Show resolved Hide resolved
```

Now, we have calculated a vector for each document that tells us what words are characteristic for the document.
Usually, documents about similar topics use similar terms, so their tfidf-vectors will be similar too.

###### Usage of concat

Here you will have probably noticed something very odd. We didn't used the
assignement operator to insert the new created DataFrame. This is due to the
reason, that for each word in every document we created a new DataFrame column. This makes the insertion operation very expensive and therefore we recommend to use concat instead. Currently just the functions `count`, `term_frequency` and `tfidf` return that kind of DocumentTermDF. To read more about the different PandasTypes, introduced in this library, have a look at this tutorial.

##### Normalisation of the data

It is very imporant to normalize your data before you start to analyse them. The normalisation helps you to minimise the variance of your dataset, which is necessary to analyse your data further in a meaningful way, as outliers and and different ranges of numbers are now "handled". This is just a generalisation, as every clustering and dimension reduction algorithm works differently.

##### Dimensionality reduction with PCA

To visualize the data, we map each point to a two-dimensional representation with PCA. The principal component analysis algorithms returns the combination of attributes that better account the variance in the data.
We now want to visualize the data. However, the tfidf-vectors are very high-dimensional (i.e. every
document might have a tfidf-vector of length 100). Visualizing 100 dimensions is hard!

Thus, we perform dimensionality reduction (generating vectors with fewer entries from vectors with
many entries). For that, we can use PCA. PCA generates new vectors from the tfidf representation
that showcase the differences among the documents most strongly in fewer dimensions, often 2 or 3.

```python
df['pca_tfidf_clean_text'] = hero.pca(df['tfidf_clean_text'])
df['pca'] = hero.pca(df['tfidf'])
henrifroese marked this conversation as resolved.
Show resolved Hide resolved
```

##### All in one step

We can achieve all the three steps show above, _cleaning_, _tf-idf representation_ and _dimensionality reduction_ in a single step. Isn't fabulous?
We can achieve all the steps shown above, _cleaning_, _tokenizing_, _tf-idf representation_ and _dimensionality reduction_ in a single step. Isn't that fabulous?

```python
df['pca'] = (
df['text']
.pipe(hero.clean)
.pipe(hero.tfidf)
.pipe(hero.pca)
)
df['text']
.pipe(hero.clean)
.pipe(hero.tokenize)
.pipe(hero.tfidf)
.pipe(hero.normalize)
.pipe(hero.pca)
)
```

##### Representation API

The complete representation module API can be found at the following address: [api representation](/docs/api-representation).
The complete representation module API can be found here: [api representation](/docs/api-representation).

### Visualization

Expand All @@ -176,36 +207,73 @@ Also, we can "visualize" the most common words for each `topic` with `top_words`

```python
NUM_TOP_WORDS = 5
df.groupby('topic')['text'].apply(lambda x: hero.top_words(x)[:NUM_TOP_WORDS])
df.groupby('topic')['clean_text'].apply(lambda x: hero.top_words(x, normalize=True)[:NUM_TOP_WORDS])
```

```
topic
athletics said 0.010068
world 0.008900
year 0.008844
cricket test 0.008250
england 0.008001
first 0.007787
football said 0.009515
chelsea 0.006110
game 0.005950
rugby england 0.012602
said 0.008359
wales 0.007880
tennis 6 0.021047
said 0.013012
open 0.009834
athletics said 0.010330
world 0.009132
year 0.009075
olympic 0.007819
race 0.006392
cricket test 0.008492
england 0.008235
first 0.008016
cricket 0.007906
one 0.007760
football said 0.009709
chelsea 0.006234
game 0.006071
would 0.005866
club 0.005601
rugby england 0.012833
said 0.008512
wales 0.008025
ireland 0.007440
rugby 0.007245
tennis said 0.013993
open 0.010575
first 0.009608
set 0.009028
year 0.008447
Name: clean_text, dtype: float64
```


##### Visualization API

The complete visualization module API can be found at the following address: [api visualization](/docs/api-visualization).
The complete visualization module API can be found here: [api visualization](/docs/api-visualization).

## Quick look into hero typing

Texthero does introduce some different pandas series types for it's different categories of functions:
1. __TextSeries__: Every cell is a text, i.e. a string. For example,
`pd.Series(["test", "test"])` is a valid TextSeries. Those series will be the input and output type of the preprocessing functions like `clean`

2. __TokenSeries__: Every cell is a list of words/tokens, i.e. a list
of strings. For example, `pd.Series([["test"], ["token2", "token3"]])` is a valid TokenSeries. The NLP functions like `tfidf` do require a TokenSeries as an input. The function `tokenize` generates a TokenSeries

3. __VectorSeries__: Every cell is a vector representing text, i.e.
a list of floats. For example, `pd.Series([[1.0, 2.0], [3.0]])` is a valid VectorSeries. Most dimensionality reduction functions, like `pca` will take VectorSeries as an input and also return a VectorSeries.

4. **DocumentTermDF**: A DataFrame where the rows are the documents and the columns are the words/terms in all the documents. The columns are multiindexed with level one
being the content name (e.g. "tfidf"), level two being the individual features and their values.
For example,
`pd.DataFrame([[1, 2, 3], [4,5,6]], columns=pd.MultiIndex.from_tuples([("count", "hi"), ("count", "servus"), ("count", "hola")]))`
is a valid RepresentationSeries.

To get more detailed insights into this topic, you can have a look at the typing tutorial. But in general, if you use texthero with the common pipeline:
- cleaning the Series with functions from the preprocessing module
- tokenising the Series and then perform NLP functions
- calculating some Clustering
- reduce the dimension to display the data

you won't need to worry much about it, as the functions are build in the way, that the corresponding input and output types match.

## Summary

We saw how in just a couple of lines of code we can represent and visualize any text dataset. We went from knowing nothing regarding the dataset to see that there are 5 (quite) distinct areas representig each topic. We went _from zero to hero_.
We saw how in just a couple of lines of code we can represent and visualize any text dataset. We went from knowing nothing regarding the dataset to see that there are 5 (quite) distinct areas representing each a topic. We went _from zero to hero_.

```python
import texthero as hero
Expand All @@ -217,15 +285,19 @@ df = pd.read_csv(
df['pca'] = (
df['text']
.pipe(hero.clean)
.pipe(hero.tokenize)
.pipe(hero.tfidf)
.pipe(hero.pca)
henrifroese marked this conversation as resolved.
Show resolved Hide resolved
)

hero.scatterplot(df, col='pca', color='topic', title="PCA BBC Sport news")
```

![](/img/scatterplot_bccsport.svg)


##### Next section

By now, you should have understood the main building blocks of texthero.

In the next sections, we will review each module, see how we can tune the default settings and we will show other application where Texthero might come in handy.
In the next sections, we will review each module, see how we can tune the default settings and we will show other applications where Texthero might come in handy.