-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
99985ad
commit 422e663
Showing
55 changed files
with
5,355 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: fdd474e5b6f510a34b712d030d4164e2 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
.. _dashboards: | ||
|
||
Dashboards | ||
========== | ||
|
||
If you have multiple embedding models, or you want to explore the same embeddings with different tools, and you still want to have them in the same web application, | ||
dashboards are here to help. | ||
|
||
Dashboards are made up of a list of cards. Each card represents a page in the application. | ||
|
||
Let's say for example that you want to examine the semantic relations in a corpus from multiple perspectives. | ||
You could have a word-level GloVe model, and a semantic network app to go along with it. | ||
You also want to have document-level clustering/projection app, but you can't decide whether you should use tf-idf representations or paragraph embeddings (Doc2Vec). | ||
|
||
You can include all of these in a dashboard as cards. | ||
Let's build all of this from scratch. | ||
|
||
We will need gensim and glovpy, so let's install those: | ||
|
||
.. code-block:: python | ||
pip install glovpy gensim | ||
First we load 20Newsgroups: | ||
|
||
.. code-block:: python | ||
from sklearn.datasets import fetch_20newsgroups | ||
# Loading the dataset | ||
newsgroups = fetch_20newsgroups( | ||
remove=("headers", "footers", "quotes"), | ||
) | ||
corpus = newsgroups.data | ||
Let's import all card types and initialize our cards to be an empty list: | ||
|
||
.. code-block:: python | ||
from embedding_explorer.cards import NetworkCard, ClusteringCard | ||
cards = [] | ||
Then let's train a word embedding model and add it as a card to the dashboard. | ||
|
||
.. code-block:: python | ||
from glovpy import GloVe | ||
from gensim.utils import tokenize | ||
# Tokenizing the dataset | ||
tokenized_corpus = [ | ||
list(tokenize(text, lower=True, deacc=True)) for text in corpus | ||
] | ||
# Training word embeddings | ||
model = GloVe(vector_size=25) | ||
model.train(tokenized_corpus) | ||
# Adding a Semantic Network card to the dashboard | ||
vocabulary = model.wv.index_to_key | ||
embeddings = model.wv.vectors | ||
cards.append(NetworkCard("GloVe Semantic Networks", corpus=vocabulary, embeddings=embeddings)) | ||
Next let's extract tf-idf representations of documents and add a clustering card to our cards. | ||
|
||
.. code-block:: python | ||
from sklearn.feature_extraction.text import TfidfVectorizer | ||
# We are going to filter out stop words and all terms that occur in less than 10 documents. | ||
embeddings = TfidfVectorizer(stop_words="english", min_df=10).fit_transform(corpus) | ||
cards.append(ClusteringCard("tf-idf Clustering and Projection", embeddings=embeddings)) | ||
And for the last one we are going to train Doc2Vec representations. | ||
|
||
.. code-block:: python | ||
from gensim.models.doc2vec import Doc2Vec, TaggedDocument | ||
tagged_corpus = [TaggedDocument(tokens, [i]) for i, tokens in enumerate(tokenized_corpus)] | ||
model = Doc2Vec(tagged_corpus) | ||
embeddings = model.dv.vectors | ||
cards.append(ClusteringCard("Doc2Vec Clustering and Projection")) | ||
Then let's start the dashboard. | ||
|
||
.. code-block:: python | ||
from embedding_explorer import show_dashboard | ||
show_dashboard(cards) | ||
.. image:: _static/screenshot_dashboard.png | ||
:width: 800 | ||
:alt: Dashboard. | ||
|
||
API Reference | ||
^^^^^^^^^^^^^ | ||
|
||
.. autofunction:: embedding_explorer.show_dashboard | ||
|
||
.. autoclass:: embedding_explorer.cards.NetworkCard | ||
|
||
.. autoclass:: embedding_explorer.cards.ClusteringCard |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
.. embedding-explorer documentation master file, created by | ||
sphinx-quickstart on Wed Nov 15 11:24:20 2023. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
Getting Started | ||
============================================== | ||
|
||
embedding-explorer is a set of tools for interactive exploration of embedding models. | ||
This website contains a user guide and API reference. | ||
|
||
Installation | ||
^^^^^^^^^^^^ | ||
|
||
You can install embedding-explorer from PyPI. | ||
|
||
.. code-block:: | ||
pip install embedding-explorer | ||
Usage | ||
^^^^^ | ||
As an example let us train a word embedding model on a corpus and then investigate the semantic relations in this model using semantic networks. | ||
We are going to train a GloVe model on the openly available 20Newsgroups dataset. | ||
|
||
For this we will also need glovpy, so let's install that. | ||
Glovpy essentially has the same API as gensim's word embedding models so this example is easily extensible to gensim models. | ||
|
||
.. code-block:: | ||
pip install glovpy | ||
Then we train an embedding model. | ||
We do this by first loading the corpus, then tokenizing each text, then passing it to our embedding model. | ||
|
||
.. code-block:: python | ||
from gensim.utils import tokenize | ||
from glovpy import GloVe | ||
from sklearn.datasets import fetch_20newsgroups | ||
# Loading the dataset | ||
newsgroups = fetch_20newsgroups( | ||
remove=("headers", "footers", "quotes"), | ||
).data | ||
# Tokenizing the dataset | ||
tokenized_corpus = [ | ||
list(tokenize(text, lower=True, deacc=True)) for text in newsgroups | ||
] | ||
# Training word embeddings | ||
model = GloVe(vector_size=25) | ||
model.train(tokenized_corpus) | ||
Now that we have trained a word embedding model, | ||
we can start the semantic network explorer from embedding-explorer and interactively examine semantic relations in the corpus. | ||
|
||
.. code-block:: python | ||
from embedding_explorer import show_network_explorer | ||
vocabulary = model.wv.index_to_key | ||
embeddings = model.wv.vectors | ||
show_network_explorer(vocabulary, embeddings=embeddings) | ||
You will then be presented with a web application, in which you can query word association networks in the embedding model: | ||
|
||
.. image:: _static/network_screenshot.png | ||
:width: 800 | ||
:alt: Screenshot of Semantic Network. | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: Contents: | ||
|
||
semantic_networks | ||
projection_clustering | ||
dashboards | ||
|
||
|
||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` |
112 changes: 112 additions & 0 deletions
112
docs/_build/html/_sources/projection_clustering.rst.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
.. _projection_clustering: | ||
|
||
Projection and Clustering | ||
========================= | ||
|
||
embedding-explorer also comes with a built-in tool for projecting whole embedding spaces into two dimensions and investigating the natural clusters that arise in the data. | ||
Since different projection or clustering techniques might produce different results, embedding-explorer lets you dynamically interact with all parameters and stages of the process. | ||
|
||
The following steps are followed when you display an embedding space in the app: | ||
|
||
* Embedding the corpus with an embedding model. | ||
* Optional dimensionality reduction. | ||
* Optional clustering of embeddings. | ||
* Projection into 2D space. | ||
|
||
.. image:: _static/clustering_overview.png | ||
:width: 800 | ||
:alt: Schematic Overview of the Clustering Process. | ||
|
||
In this tutorial I will be demonstrating how to investigate cluster structures and embeddings in an openly available corpus using tf-idf embeddings. | ||
|
||
First let's load a corpus. In this example I will be using 20Newsgroups. | ||
|
||
.. code-block:: python | ||
from sklearn.datasets import fetch_20newsgroups | ||
# Loading the dataset | ||
newsgroups = fetch_20newsgroups( | ||
remove=("headers", "footers", "quotes"), | ||
) | ||
corpus = newsgroups.data | ||
Then we are going to embed this corpus using tf-idf weighted bag-of-words representations. | ||
|
||
.. code-block:: python | ||
from sklearn.feature_extraction.text import TfidfVectorizer | ||
# We are going to filter out stop words and all terms that occur in less than 10 documents. | ||
embeddings = TfidfVectorizer(stop_words="english", min_df=10).fit_transform(corpus) | ||
We can then interactively project and cluster these embeddings by starting the web application. | ||
|
||
.. code-block:: python | ||
from embedding_explorer import show_clustering | ||
show_clustering(embeddings=embeddings) | ||
You will be presented with a page, where you can manually select the parameters and models involved in the clustering and projection process. | ||
|
||
.. image:: _static/clustering_params_screenshot.png | ||
:width: 800 | ||
:alt: Clustering and Projection Parameters. | ||
|
||
Here are the results: | ||
|
||
.. image:: _static/clustering_screenshot.png | ||
:width: 800 | ||
:alt: Screenshot of clustering 20 newsgroups. | ||
|
||
Metadata | ||
^^^^^^^^ | ||
|
||
Of course these results are not very useful in absence of metadata about the individual data points. | ||
We can fix this by passing along metadata about the corpus. | ||
|
||
We unfortunately do not have to much metadata on 20Newsgroups by default, but we can accumulate a couple of useful pieces of information about the corpus into a dataframe. | ||
It would be for example nice to be able to visualize the length of given texts, as well as to know which news group they belong to. | ||
We can also add the first section of the text into the metadata, so we can see what the actual content of the text is when we hover over it. | ||
|
||
.. code-block:: python | ||
import pandas as pd | ||
import numpy as np | ||
# Extracting text lengths in number of characters. | ||
lengths = [len(text) for text in corpus] | ||
# Extracting first 400 characters from each text. | ||
text_starts = [text[:400] for text in corpus] | ||
# Extracting the group each text belongs to | ||
# Sklearn gives the labels back as integers, we have to map them back to | ||
# the actual textual label. | ||
group_labels = np.array(newsgroups.target_names)[newsgroups.target] | ||
# We build a dataframe with the available metadata | ||
metadata = pd.DataFrame(dict(length=lengths, text=text_starts, group=group_labels)) | ||
We can then pass this metadata along to the app. | ||
We can also select what information should be shown when a data point is hovered over. | ||
|
||
.. code-block:: python | ||
show_clustering(embeddings=embeddings, metadata=metadata, hover_name="group", hover_data=["text", "length"]) | ||
.. image:: _static/clustering_hover_screenshot.png | ||
:width: 800 | ||
:alt: Screenshot of hovering over a data point in clustering. | ||
|
||
In the app you can also select how data points are labelled, sized and colored. | ||
|
||
.. image:: _static/clustering_length_size.png | ||
:width: 800 | ||
:alt: Screenshot of hovering over a data point in clustering. | ||
|
||
API Reference | ||
^^^^^^^^^^^^^ | ||
|
||
.. autofunction:: embedding_explorer.show_clustering |
Oops, something went wrong.