Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Visualisation Tutorial #175

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
env: PATH=/c/Python38:/c/Python38/Scripts:$PATH
install:
- pip3 install --upgrade pip # all three OSes agree about 'pip3'
- pip3 install black
- pip3 install black==19.10b0
- pip3 install ".[dev]" .
# 'python' points to Python 2.7 on macOS but points to Python 3.8 on Linux and Windows
# 'python3' is a 'command not found' error on Windows but 'py' works on Windows only
Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ install_requires =
# TODO pick the correct version.
[options.extras_require]
dev =
black>=19.10b0
black==19.10b0
pytest>=4.0.0
Sphinx>=3.0.3
sphinx-markdown-builder>=0.5.4
Expand Down
Binary file added website/docs/assets/wordcloud.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
60 changes: 60 additions & 0 deletions website/docs/getting-started-visulization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
id: getting-started-visualization
title: Getting started - Visualization
---

# Visualization

Now, before you start processing your dataset, you might want to visualize your data first in order to get the gist of the data and to choose, which NLP toolkit and models will be most suitable. The following tutorial will introduce two methods. Those will show you in a quick way the most frequent words in our dataset.

## 1. Top words


The easiest way to find out the most important, is, to have a look at
their absolute occurence in the set. This is simply how often a word/token occurs in your set. Before Texthero that easy task was quite complex to program. You first needed to write your own tokenizer, generate a DocumentTerm matrix with the CountVectorizer for example, then sum over one axis and sort in the end. This process is now simplyfied by Texthero.

```python
>>> # save the dataset in df
>>> import texthero as hero
>>> import pandas as pd
>>> df = pd.read_csv("https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv")
>>> # now we will extract all top words
>>> top_words = hero.top_words(df["text"])
>>> top_words.head()
the 12790
to 7051
a 5516
in 5271
and 5259
Name: text, dtype: int64
```

However, we can now see, that from the most common words we don't get so much information, as we have hoped for. This is, that the english language does not only contain relevant words with information, but also stopwords, which purpose is, to connect important words to gramaticaly complete sentences. To extract the most relevant parts of the texts, we will now first clean it with the texthero `clean` function and then look again at the topwords.

```python
>>> df["clean"] = hero.clean(df["text"])
>>> top_words = hero.top_words(df["clean"])
said 1338
first 790
england 749
game 681
one 671
Name: clean, dtype: int64
```

Now we can see, that for example most of the texts contain "england" and are about "games". That is now quite useful for further analysis and can be done in just two lines of code.

## 2. Wordcloud

But the data frame is still quite technical and less graphical. This can be improved by generating a WordCloud. Texthero has a build-in function, which calls the word_cloud package API to generate the picture. A wordcloud consits of the top words in our dataset arranged in a cloud, where the more frequent words as visualised bigger than the less frequent ones. When executing the following lines in a jupyter notebook, it will show you a wordcloud with the most common words
```python
>>> import texthero as hero
>>> import pandas as pd
>>> df = pd.read_csv("https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv")
>>> df["clean"] = hero.clean(df["text"])
>>> hero.wordcloud(df["clean"])
```

![](/img/wordcloud.png)

Here we can easily recognise the popular words from before, as they are printed bigger than the others.
3 changes: 2 additions & 1 deletion website/sidebars.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
{
"docs": {
"Getting Started": [
"getting-started"
"getting-started",
"getting-started-visualization"
]
},
"api": {
Expand Down