Skip to content

Commit

Permalink
Added Documentation build
Browse files Browse the repository at this point in the history
  • Loading branch information
x-tabdeveloping committed Nov 15, 2023
1 parent 99985ad commit 422e663
Show file tree
Hide file tree
Showing 55 changed files with 5,355 additions and 0 deletions.
Binary file added docs/_build/doctrees/dashboards.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file added docs/_build/doctrees/index.doctree
Binary file not shown.
Binary file not shown.
Binary file added docs/_build/doctrees/semantic_networks.doctree
Binary file not shown.
4 changes: 4 additions & 0 deletions docs/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: fdd474e5b6f510a34b712d030d4164e2
tags: 645f666f9bcd5a90fca523b33c5a78b7
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_build/html/_images/clustering_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_build/html/_images/network_screenshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
103 changes: 103 additions & 0 deletions docs/_build/html/_sources/dashboards.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
.. _dashboards:

Dashboards
==========

If you have multiple embedding models, or you want to explore the same embeddings with different tools, and you still want to have them in the same web application,
dashboards are here to help.

Dashboards are made up of a list of cards. Each card represents a page in the application.

Let's say for example that you want to examine the semantic relations in a corpus from multiple perspectives.
You could have a word-level GloVe model, and a semantic network app to go along with it.
You also want to have document-level clustering/projection app, but you can't decide whether you should use tf-idf representations or paragraph embeddings (Doc2Vec).

You can include all of these in a dashboard as cards.
Let's build all of this from scratch.

We will need gensim and glovpy, so let's install those:

.. code-block:: python
pip install glovpy gensim
First we load 20Newsgroups:

.. code-block:: python
from sklearn.datasets import fetch_20newsgroups
# Loading the dataset
newsgroups = fetch_20newsgroups(
remove=("headers", "footers", "quotes"),
)
corpus = newsgroups.data
Let's import all card types and initialize our cards to be an empty list:

.. code-block:: python
from embedding_explorer.cards import NetworkCard, ClusteringCard
cards = []
Then let's train a word embedding model and add it as a card to the dashboard.

.. code-block:: python
from glovpy import GloVe
from gensim.utils import tokenize
# Tokenizing the dataset
tokenized_corpus = [
list(tokenize(text, lower=True, deacc=True)) for text in corpus
]
# Training word embeddings
model = GloVe(vector_size=25)
model.train(tokenized_corpus)
# Adding a Semantic Network card to the dashboard
vocabulary = model.wv.index_to_key
embeddings = model.wv.vectors
cards.append(NetworkCard("GloVe Semantic Networks", corpus=vocabulary, embeddings=embeddings))
Next let's extract tf-idf representations of documents and add a clustering card to our cards.

.. code-block:: python
from sklearn.feature_extraction.text import TfidfVectorizer
# We are going to filter out stop words and all terms that occur in less than 10 documents.
embeddings = TfidfVectorizer(stop_words="english", min_df=10).fit_transform(corpus)
cards.append(ClusteringCard("tf-idf Clustering and Projection", embeddings=embeddings))
And for the last one we are going to train Doc2Vec representations.

.. code-block:: python
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
tagged_corpus = [TaggedDocument(tokens, [i]) for i, tokens in enumerate(tokenized_corpus)]
model = Doc2Vec(tagged_corpus)
embeddings = model.dv.vectors
cards.append(ClusteringCard("Doc2Vec Clustering and Projection"))
Then let's start the dashboard.

.. code-block:: python
from embedding_explorer import show_dashboard
show_dashboard(cards)
.. image:: _static/screenshot_dashboard.png
:width: 800
:alt: Dashboard.

API Reference
^^^^^^^^^^^^^

.. autofunction:: embedding_explorer.show_dashboard

.. autoclass:: embedding_explorer.cards.NetworkCard

.. autoclass:: embedding_explorer.cards.ClusteringCard
87 changes: 87 additions & 0 deletions docs/_build/html/_sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
.. embedding-explorer documentation master file, created by
sphinx-quickstart on Wed Nov 15 11:24:20 2023.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Getting Started
==============================================

embedding-explorer is a set of tools for interactive exploration of embedding models.
This website contains a user guide and API reference.

Installation
^^^^^^^^^^^^

You can install embedding-explorer from PyPI.

.. code-block::
pip install embedding-explorer
Usage
^^^^^
As an example let us train a word embedding model on a corpus and then investigate the semantic relations in this model using semantic networks.
We are going to train a GloVe model on the openly available 20Newsgroups dataset.

For this we will also need glovpy, so let's install that.
Glovpy essentially has the same API as gensim's word embedding models so this example is easily extensible to gensim models.

.. code-block::
pip install glovpy
Then we train an embedding model.
We do this by first loading the corpus, then tokenizing each text, then passing it to our embedding model.

.. code-block:: python
from gensim.utils import tokenize
from glovpy import GloVe
from sklearn.datasets import fetch_20newsgroups
# Loading the dataset
newsgroups = fetch_20newsgroups(
remove=("headers", "footers", "quotes"),
).data
# Tokenizing the dataset
tokenized_corpus = [
list(tokenize(text, lower=True, deacc=True)) for text in newsgroups
]
# Training word embeddings
model = GloVe(vector_size=25)
model.train(tokenized_corpus)
Now that we have trained a word embedding model,
we can start the semantic network explorer from embedding-explorer and interactively examine semantic relations in the corpus.

.. code-block:: python
from embedding_explorer import show_network_explorer
vocabulary = model.wv.index_to_key
embeddings = model.wv.vectors
show_network_explorer(vocabulary, embeddings=embeddings)
You will then be presented with a web application, in which you can query word association networks in the embedding model:

.. image:: _static/network_screenshot.png
:width: 800
:alt: Screenshot of Semantic Network.

.. toctree::
:maxdepth: 2
:caption: Contents:

semantic_networks
projection_clustering
dashboards



Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
112 changes: 112 additions & 0 deletions docs/_build/html/_sources/projection_clustering.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
.. _projection_clustering:

Projection and Clustering
=========================

embedding-explorer also comes with a built-in tool for projecting whole embedding spaces into two dimensions and investigating the natural clusters that arise in the data.
Since different projection or clustering techniques might produce different results, embedding-explorer lets you dynamically interact with all parameters and stages of the process.

The following steps are followed when you display an embedding space in the app:

* Embedding the corpus with an embedding model.
* Optional dimensionality reduction.
* Optional clustering of embeddings.
* Projection into 2D space.

.. image:: _static/clustering_overview.png
:width: 800
:alt: Schematic Overview of the Clustering Process.

In this tutorial I will be demonstrating how to investigate cluster structures and embeddings in an openly available corpus using tf-idf embeddings.

First let's load a corpus. In this example I will be using 20Newsgroups.

.. code-block:: python
from sklearn.datasets import fetch_20newsgroups
# Loading the dataset
newsgroups = fetch_20newsgroups(
remove=("headers", "footers", "quotes"),
)
corpus = newsgroups.data
Then we are going to embed this corpus using tf-idf weighted bag-of-words representations.

.. code-block:: python
from sklearn.feature_extraction.text import TfidfVectorizer
# We are going to filter out stop words and all terms that occur in less than 10 documents.
embeddings = TfidfVectorizer(stop_words="english", min_df=10).fit_transform(corpus)
We can then interactively project and cluster these embeddings by starting the web application.

.. code-block:: python
from embedding_explorer import show_clustering
show_clustering(embeddings=embeddings)
You will be presented with a page, where you can manually select the parameters and models involved in the clustering and projection process.

.. image:: _static/clustering_params_screenshot.png
:width: 800
:alt: Clustering and Projection Parameters.

Here are the results:

.. image:: _static/clustering_screenshot.png
:width: 800
:alt: Screenshot of clustering 20 newsgroups.

Metadata
^^^^^^^^

Of course these results are not very useful in absence of metadata about the individual data points.
We can fix this by passing along metadata about the corpus.

We unfortunately do not have to much metadata on 20Newsgroups by default, but we can accumulate a couple of useful pieces of information about the corpus into a dataframe.
It would be for example nice to be able to visualize the length of given texts, as well as to know which news group they belong to.
We can also add the first section of the text into the metadata, so we can see what the actual content of the text is when we hover over it.

.. code-block:: python
import pandas as pd
import numpy as np
# Extracting text lengths in number of characters.
lengths = [len(text) for text in corpus]
# Extracting first 400 characters from each text.
text_starts = [text[:400] for text in corpus]
# Extracting the group each text belongs to
# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = np.array(newsgroups.target_names)[newsgroups.target]
# We build a dataframe with the available metadata
metadata = pd.DataFrame(dict(length=lengths, text=text_starts, group=group_labels))
We can then pass this metadata along to the app.
We can also select what information should be shown when a data point is hovered over.

.. code-block:: python
show_clustering(embeddings=embeddings, metadata=metadata, hover_name="group", hover_data=["text", "length"])
.. image:: _static/clustering_hover_screenshot.png
:width: 800
:alt: Screenshot of hovering over a data point in clustering.

In the app you can also select how data points are labelled, sized and colored.

.. image:: _static/clustering_length_size.png
:width: 800
:alt: Screenshot of hovering over a data point in clustering.

API Reference
^^^^^^^^^^^^^

.. autofunction:: embedding_explorer.show_clustering
Loading

0 comments on commit 422e663

Please sign in to comment.