Added Documentation build

centre-for-humanities-computing · Nov 15, 2023 · 422e663 · 422e663
1 parent 99985ad
commit 422e663
Show file tree

Hide file tree

Showing 55 changed files with 5,355 additions and 0 deletions.
diff --git a/docs/_build/doctrees/dashboards.doctree b/docs/_build/doctrees/dashboards.doctree
diff --git a/docs/_build/doctrees/environment.pickle b/docs/_build/doctrees/environment.pickle
diff --git a/docs/_build/doctrees/index.doctree b/docs/_build/doctrees/index.doctree
diff --git a/docs/_build/doctrees/projection_clustering.doctree b/docs/_build/doctrees/projection_clustering.doctree
diff --git a/docs/_build/doctrees/semantic_networks.doctree b/docs/_build/doctrees/semantic_networks.doctree
diff --git a/docs/_build/html/.buildinfo b/docs/_build/html/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: fdd474e5b6f510a34b712d030d4164e2
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/_build/html/_images/clustering_hover_screenshot.png b/docs/_build/html/_images/clustering_hover_screenshot.png
diff --git a/docs/_build/html/_images/clustering_length_size.png b/docs/_build/html/_images/clustering_length_size.png
diff --git a/docs/_build/html/_images/clustering_overview.png b/docs/_build/html/_images/clustering_overview.png
diff --git a/docs/_build/html/_images/clustering_params_screenshot.png b/docs/_build/html/_images/clustering_params_screenshot.png
diff --git a/docs/_build/html/_images/clustering_screenshot.png b/docs/_build/html/_images/clustering_screenshot.png
diff --git a/docs/_build/html/_images/network_screenshot.png b/docs/_build/html/_images/network_screenshot.png
diff --git a/docs/_build/html/_images/screenshot_dashboard.png b/docs/_build/html/_images/screenshot_dashboard.png
diff --git a/docs/_build/html/_images/sentence_trf_network_screenshot.png b/docs/_build/html/_images/sentence_trf_network_screenshot.png
diff --git a/docs/_build/html/_sources/dashboards.rst.txt b/docs/_build/html/_sources/dashboards.rst.txt
@@ -0,0 +1,103 @@
+.. _dashboards:
+
+Dashboards
+==========
+
+If you have multiple embedding models, or you want to explore the same embeddings with different tools, and you still want to have them in the same web application,
+dashboards are here to help.
+
+Dashboards are made up of a list of cards. Each card represents a page in the application.
+
+Let's say for example that you want to examine the semantic relations in a corpus from multiple perspectives.
+You could have a word-level GloVe model, and a semantic network app to go along with it.
+You also want to have document-level clustering/projection app, but you can't decide whether you should use tf-idf representations or paragraph embeddings (Doc2Vec).
+
+You can include all of these in a dashboard as cards.
+Let's build all of this from scratch.
+
+We will need gensim and glovpy, so let's install those:
+
+.. code-block:: python
+
+   pip install glovpy gensim
+
+First we load 20Newsgroups:
+
+.. code-block:: python
+
+   from sklearn.datasets import fetch_20newsgroups
+ 
+   # Loading the dataset
+   newsgroups = fetch_20newsgroups(
+       remove=("headers", "footers", "quotes"),
+   )
+   corpus = newsgroups.data
+
+Let's import all card types and initialize our cards to be an empty list:
+
+.. code-block:: python
+
+   from embedding_explorer.cards import NetworkCard, ClusteringCard
+ 
+   cards = []
+
+Then let's train a word embedding model and add it as a card to the dashboard.
+
+.. code-block:: python
+
+   from glovpy import GloVe
+   from gensim.utils import tokenize
+
+   # Tokenizing the dataset
+   tokenized_corpus = [
+       list(tokenize(text, lower=True, deacc=True)) for text in corpus
+   ]
+   # Training word embeddings
+   model = GloVe(vector_size=25)
+   model.train(tokenized_corpus)
+   # Adding a Semantic Network card to the dashboard
+   vocabulary = model.wv.index_to_key
+   embeddings = model.wv.vectors
+   cards.append(NetworkCard("GloVe Semantic Networks", corpus=vocabulary, embeddings=embeddings))
+
+Next let's extract tf-idf representations of documents and add a clustering card to our cards.
+
+.. code-block:: python
+
+   from sklearn.feature_extraction.text import TfidfVectorizer
+
+   # We are going to filter out stop words and all terms that occur in less than 10 documents.
+   embeddings = TfidfVectorizer(stop_words="english", min_df=10).fit_transform(corpus)
+   cards.append(ClusteringCard("tf-idf Clustering and Projection", embeddings=embeddings))
+
+And for the last one we are going to train Doc2Vec representations.
+
+.. code-block:: python
+
+   from gensim.models.doc2vec import Doc2Vec, TaggedDocument
+
+   tagged_corpus = [TaggedDocument(tokens, [i]) for i, tokens in enumerate(tokenized_corpus)]
+   model = Doc2Vec(tagged_corpus)
+   embeddings = model.dv.vectors
+   cards.append(ClusteringCard("Doc2Vec Clustering and Projection"))
+
+Then let's start the dashboard.
+
+.. code-block:: python
+
+   from embedding_explorer import show_dashboard
+
+   show_dashboard(cards)
+
+.. image:: _static/screenshot_dashboard.png
+    :width: 800
+    :alt: Dashboard.
+
+API Reference
+^^^^^^^^^^^^^
+
+.. autofunction:: embedding_explorer.show_dashboard
+
+.. autoclass:: embedding_explorer.cards.NetworkCard
+
+.. autoclass:: embedding_explorer.cards.ClusteringCard
diff --git a/docs/_build/html/_sources/index.rst.txt b/docs/_build/html/_sources/index.rst.txt
@@ -0,0 +1,87 @@
+.. embedding-explorer documentation master file, created by
+   sphinx-quickstart on Wed Nov 15 11:24:20 2023.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Getting Started
+==============================================
+
+embedding-explorer is a set of tools for interactive exploration of embedding models.
+This website contains a user guide and API reference.
+
+Installation
+^^^^^^^^^^^^
+
+You can install embedding-explorer from PyPI.
+
+.. code-block::
+
+   pip install embedding-explorer
+
+Usage
+^^^^^
+As an example let us train a word embedding model on a corpus and then investigate the semantic relations in this model using semantic networks.
+We are going to train a GloVe model on the openly available 20Newsgroups dataset.
+
+For this we will also need glovpy, so let's install that.
+Glovpy essentially has the same API as gensim's word embedding models so this example is easily extensible to gensim models.
+
+.. code-block::
+   
+   pip install glovpy
+
+Then we train an embedding model.
+We do this by first loading the corpus, then tokenizing each text, then passing it to our embedding model.
+
+.. code-block:: python
+
+   from gensim.utils import tokenize
+   from glovpy import GloVe
+   from sklearn.datasets import fetch_20newsgroups
+ 
+   # Loading the dataset
+   newsgroups = fetch_20newsgroups(
+       remove=("headers", "footers", "quotes"),
+   ).data
+   # Tokenizing the dataset
+   tokenized_corpus = [
+       list(tokenize(text, lower=True, deacc=True)) for text in newsgroups
+   ]
+ 
+   # Training word embeddings
+   model = GloVe(vector_size=25)
+   model.train(tokenized_corpus)
+
+Now that we have trained a word embedding model,
+we can start the semantic network explorer from embedding-explorer and interactively examine semantic relations in the corpus.
+
+.. code-block:: python
+
+   from embedding_explorer import show_network_explorer
+ 
+   vocabulary = model.wv.index_to_key
+   embeddings = model.wv.vectors
+   show_network_explorer(vocabulary, embeddings=embeddings)
+
+You will then be presented with a web application, in which you can query word association networks in the embedding model:
+
+.. image:: _static/network_screenshot.png
+    :width: 800
+    :alt: Screenshot of Semantic Network.
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+   semantic_networks
+   projection_clustering
+   dashboards
+
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
diff --git a/docs/_build/html/_sources/projection_clustering.rst.txt b/docs/_build/html/_sources/projection_clustering.rst.txt
@@ -0,0 +1,112 @@
+.. _projection_clustering:
+
+Projection and Clustering
+=========================
+
+embedding-explorer also comes with a built-in tool for projecting whole embedding spaces into two dimensions and investigating the natural clusters that arise in the data.
+Since different projection or clustering techniques might produce different results, embedding-explorer lets you dynamically interact with all parameters and stages of the process.
+
+The following steps are followed when you display an embedding space in the app:
+
+ * Embedding the corpus with an embedding model.
+ * Optional dimensionality reduction.
+ * Optional clustering of embeddings.
+ * Projection into 2D space.
+
+.. image:: _static/clustering_overview.png
+    :width: 800
+    :alt: Schematic Overview of the Clustering Process.
+
+In this tutorial I will be demonstrating how to investigate cluster structures and embeddings in an openly available corpus using tf-idf embeddings.
+
+First let's load a corpus. In this example I will be using 20Newsgroups.
+
+.. code-block:: python
+
+   from sklearn.datasets import fetch_20newsgroups
+ 
+   # Loading the dataset
+   newsgroups = fetch_20newsgroups(
+       remove=("headers", "footers", "quotes"),
+   )
+   corpus = newsgroups.data
+
+Then we are going to embed this corpus using tf-idf weighted bag-of-words representations.
+
+.. code-block:: python
+
+   from sklearn.feature_extraction.text import TfidfVectorizer
+
+   # We are going to filter out stop words and all terms that occur in less than 10 documents.
+   embeddings = TfidfVectorizer(stop_words="english", min_df=10).fit_transform(corpus)
+ 
+We can then interactively project and cluster these embeddings by starting the web application.
+
+.. code-block:: python
+
+   from embedding_explorer import show_clustering
+
+   show_clustering(embeddings=embeddings)
+
+You will be presented with a page, where you can manually select the parameters and models involved in the clustering and projection process.
+
+.. image:: _static/clustering_params_screenshot.png
+    :width: 800
+    :alt: Clustering and Projection Parameters.
+
+Here are the results:
+
+.. image:: _static/clustering_screenshot.png
+    :width: 800
+    :alt: Screenshot of clustering 20 newsgroups.
+
+Metadata
+^^^^^^^^
+
+Of course these results are not very useful in absence of metadata about the individual data points.
+We can fix this by passing along metadata about the corpus.
+
+We unfortunately do not have to much metadata on 20Newsgroups by default, but we can accumulate a couple of useful pieces of information about the corpus into a dataframe.
+It would be for example nice to be able to visualize the length of given texts, as well as to know which news group they belong to.
+We can also add the first section of the text into the metadata, so we can see what the actual content of the text is when we hover over it.
+
+.. code-block:: python
+
+   import pandas as pd
+   import numpy as np
+
+   # Extracting text lengths in number of characters.
+   lengths = [len(text) for text in corpus]
+
+   # Extracting first 400 characters from each text.
+   text_starts = [text[:400] for text in corpus]
+
+   # Extracting the group each text belongs to
+   # Sklearn gives the labels back as integers, we have to map them back to
+   # the actual textual label.
+   group_labels = np.array(newsgroups.target_names)[newsgroups.target]
+
+   # We build a dataframe with the available metadata
+   metadata = pd.DataFrame(dict(length=lengths, text=text_starts, group=group_labels))
+
+We can then pass this metadata along to the app.
+We can also select what information should be shown when a data point is hovered over.
+
+.. code-block:: python
+
+   show_clustering(embeddings=embeddings, metadata=metadata, hover_name="group", hover_data=["text", "length"])
+
+.. image:: _static/clustering_hover_screenshot.png
+    :width: 800
+    :alt: Screenshot of hovering over a data point in clustering.
+
+In the app you can also select how data points are labelled, sized and colored.
+
+.. image:: _static/clustering_length_size.png
+    :width: 800
+    :alt: Screenshot of hovering over a data point in clustering.
+
+API Reference
+^^^^^^^^^^^^^
+
+.. autofunction:: embedding_explorer.show_clustering