- {% block body %}{% endblock %}
-
- {% if self.comments()|trim %}
-
- {% block comments %}{% endblock %}
-
- {% endif%}
- diff --git a/docs/_static/js/custom.js b/docs/_static/js/custom.js index 19a2d02c7..23582c99a 100644 --- a/docs/_static/js/custom.js +++ b/docs/_static/js/custom.js @@ -1,13 +1,18 @@ function addGithubButton() { const div = `
`; - document.getElementById("github-button").innerHTML = div; + document.getElementsByClassName("logo")[0].parentElement.insertAdjacentHTML("afterend", div); } diff --git a/docs/_themes/sphinx_rtd_theme/__init__.py b/docs/_themes/sphinx_rtd_theme/__init__.py deleted file mode 100644 index 0f739cce4..000000000 --- a/docs/_themes/sphinx_rtd_theme/__init__.py +++ /dev/null @@ -1,33 +0,0 @@ -""" -Sphinx Read the Docs theme. - -From https://github.com/ryan-roemer/sphinx-bootstrap-theme. -""" - -from os import path - -import sphinx - -__version__ = "0.5.0" -__version_full__ = __version__ - - -def get_html_theme_path(): - """Return list of HTML theme paths.""" - cur_dir = path.abspath(path.dirname(path.dirname(__file__))) - return cur_dir - - -# See http://www.sphinx-doc.org/en/stable/theming.html#distribute-your-theme-as-a-python-package -def setup(app): - if sphinx.version_info >= (1, 6, 0): - # Register the theme that can be referenced without adding a theme path - app.add_html_theme("sphinx_rtd_theme", path.abspath(path.dirname(__file__))) - - if sphinx.version_info >= (1, 8, 0): - # Add Sphinx message catalog for newer versions of Sphinx - # See http://www.sphinx-doc.org/en/master/extdev/appapi.html#sphinx.application.Sphinx.add_message_catalog - rtd_locale_path = path.join(path.abspath(path.dirname(__file__)), "locale") - app.add_message_catalog("sphinx", rtd_locale_path) - - return {"parallel_read_safe": True, "parallel_write_safe": True} diff --git a/docs/_themes/sphinx_rtd_theme/breadcrumbs.html b/docs/_themes/sphinx_rtd_theme/breadcrumbs.html deleted file mode 100644 index cf64d1547..000000000 --- a/docs/_themes/sphinx_rtd_theme/breadcrumbs.html +++ /dev/null @@ -1,84 +0,0 @@ -{# Support for Sphinx 1.3+ page_source_suffix, but don't break old builds. #} - -{% if page_source_suffix %} -{% set suffix = page_source_suffix %} -{% else %} -{% set suffix = source_suffix %} -{% endif %} - -{% if meta is defined and meta is not none %} -{% set check_meta = True %} -{% else %} -{% set check_meta = False %} -{% endif %} - -{% if check_meta and 'github_url' in meta %} -{% set display_github = True %} -{% endif %} - -{% if check_meta and 'bitbucket_url' in meta %} -{% set display_bitbucket = True %} -{% endif %} - -{% if check_meta and 'gitlab_url' in meta %} -{% set display_gitlab = True %} -{% endif %} - -{% set display_vcs_links = display_vcs_links if display_vcs_links is defined else True %} - - diff --git a/docs/_themes/sphinx_rtd_theme/footer.html b/docs/_themes/sphinx_rtd_theme/footer.html deleted file mode 100644 index 4c4c2b429..000000000 --- a/docs/_themes/sphinx_rtd_theme/footer.html +++ /dev/null @@ -1,62 +0,0 @@ - - diff --git a/docs/_themes/sphinx_rtd_theme/layout.html b/docs/_themes/sphinx_rtd_theme/layout.html deleted file mode 100644 index 4ed2d47c0..000000000 --- a/docs/_themes/sphinx_rtd_theme/layout.html +++ /dev/null @@ -1,243 +0,0 @@ -{# TEMPLATE VAR SETTINGS #} -{%- set url_root = pathto('', 1) %} -{%- if url_root == '#' %}{% set url_root = '' %}{% endif %} -{%- if not embedded and docstitle %} - {%- set titlesuffix = " — "|safe + docstitle|e %} -{%- else %} - {%- set titlesuffix = "" %} -{%- endif %} -{%- set lang_attr = 'en' if language == None else (language | replace('_', '-')) %} -{%- set sphinx_writer = 'writer-html5' if html5_doctype else 'writer-html4' %} - - - - - - {{ metatags }} - - {% block htmltitle %} -{{ _('Your search did not match any documents. Please make sure that all words are spelled correctly and that you\'ve selected enough categories.') }}
- {% endif %} - {% endif %} -{{ context|e }}
-SentenceTransformer.similarity
](../../../docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.similarity).
+```{eval-rst}
+For small corpora (up to about 1 million entries), we can perform semantic search with a manual implementation by computing the embeddings for the corpus as well as for our query, and then calculating the `semantic textual similarity <../../../docs/sentence_transformer/usage/semantic_textual_similarity.html>`_ using :func:`SentenceTransformer.similarity util.semantic_search
](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) function.
+```{eval-rst}
+Instead of implementing semantic search by yourself, you can use the :func:`util.semantic_search util.semantic_search
](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) method, it is advisable to have the `query_embeddings` as well as the `corpus_embeddings` on the same GPU-device. This significantly boost the performance. Further, we can normalize the corpus embeddings so that each corpus embeddings is of length 1. In that case, we can use dot-product for computing scores.
-```python
-corpus_embeddings = corpus_embeddings.to("cuda")
-corpus_embeddings = util.normalize_embeddings(corpus_embeddings)
-
-query_embeddings = query_embeddings.to("cuda")
-query_embeddings = util.normalize_embeddings(query_embeddings)
-hits = util.semantic_search(query_embeddings, corpus_embeddings, score_function=util.dot_score)
+```{eval-rst}
+To get the optimal speed for the :func:`util.semantic_search util.semantic_search
](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search)).
+```{eval-rst}
+Searching a large corpus with millions of embeddings can be time-consuming if exact nearest neighbor search is used (like it is used by :func:`util.semantic_search util.semantic_search
](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) method. As model, we use [distilbert-multilingual-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking), which was trained to identify similar questions and supports 50+ languages. Hence, the user can input the question in any of the 50+ languages. This is a **symmetric search task**, as the search queries have the same length and content as the questions in the corpus.
+[semantic_search_quora_pytorch.py](semantic_search_quora_pytorch.py) [ [Colab version](https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing) ] shows an example based on the [Quora duplicate questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset. The user can enter a question, and the code retrieves the most similar questions from the dataset using `util.semantic_search`. As model, we use [distilbert-multilingual-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking), which was trained to identify similar questions and supports 50+ languages. Hence, the user can input the question in any of the 50+ languages. This is a **symmetric search task**, as the search queries have the same length and content as the questions in the corpus.
### Similar Publication Retrieval
[semantic_search_publications.py](semantic_search_publications.py) [ [Colab version](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06?usp=sharing) ] shows an example how to find similar scientific publications. As corpus, we use all publications that have been presented at the EMNLP 2016 - 2018 conferences. As search query, we input the title and abstract of more recent publications and find related publications from our copurs. We use the [SPECTER](https://huggingface.co/sentence-transformers/allenai-specter) model. This is a **symmetric search task**, as the paper in the corpus consists of title & abstract and we search for title & abstract.
diff --git a/examples/domain_adaptation/README.md b/examples/domain_adaptation/README.md
index a5deee55c..6837c7aad 100644
--- a/examples/domain_adaptation/README.md
+++ b/examples/domain_adaptation/README.md
@@ -62,8 +62,8 @@ GPL works in three phases:
- **Query Generation**: For a given text from our domain, we first use a T5 model that generates a possible query for the given text. E.g. when your text is *"Python is a high-level general-purpose programming language"*, the model might generate a query like *"What is Python"*. You can find various query generators on our [doc2query-hub](https://huggingface.co/doc2query).
- **Negative Mining**: Next, for the generated query *"What is Python"* we mine negative passages from our corpus, i.e. passages that are similar to the query but which a user would not consider relevant. Such a negative passage could be *"Java is a high-level, class-based, object-oriented programming language."*. We do this mining using dense retrieval, i.e. we use one of the existing text embedding models and retrieve relevant paragraphs for the given query.
-- **Pseudo Labeling**: It might be that in the negative mining step we retrieve a passage that is actually relevant for the query (like another definition for *"What is Python"*). To overcome this issue, we use a [Cross-Encoder](../applications/cross-encoder/README.html) to score all (query, passage)-pairs.
-- **Training**: Once we have the triplets *(generated query, positive passage, mined negative passage)* and the Cross-Encoder scores for *(query, positive)* and *(query, negative)* we can start training the text embedding model using [MarginMSELoss](../../docs/package_reference/sentence_transformer/losses.html#marginmseloss).
+- **Pseudo Labeling**: It might be that in the negative mining step we retrieve a passage that is actually relevant for the query (like another definition for *"What is Python"*). To overcome this issue, we use a [Cross-Encoder](../applications/cross-encoder/README.md) to score all (query, passage)-pairs.
+- **Training**: Once we have the triplets *(generated query, positive passage, mined negative passage)* and the Cross-Encoder scores for *(query, positive)* and *(query, negative)* we can start training the text embedding model using [MarginMSELoss](../../docs/package_reference/sentence_transformer/losses.md#marginmseloss).
The **pseudo labeling** step is quite important and which results in the increased performance compared to the previous method QGen, which treated passages just as positive (1) or negative (0). As we see in the following picture, for a generate query (*"what is futures contract"*), the negative mining step retrieves passages that are partly or highly relevant to the generated query. Using MarginMSELoss and the Cross-Encoder, we can identify these passages and teach the text embedding model that these passages are also relevant for the given query.
diff --git a/examples/training/adaptive_layer/README.md b/examples/training/adaptive_layer/README.md
index b904c40c8..e1203b68d 100644
--- a/examples/training/adaptive_layer/README.md
+++ b/examples/training/adaptive_layer/README.md
@@ -2,7 +2,7 @@
Embedding models are often encoder models with numerous layers, such as 12 (e.g. [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) or 6 (e.g. [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)). To get embeddings, every single one of these layers must be traversed. The [2D Matryoshka Sentence Embeddings](https://arxiv.org/abs/2402.14776v1) (2DMSE) preprint revisits this concept by proposing an approach to train embedding models that will perform well when only using a selection of all layers. This results in faster inference speeds at relatively low performance costs.
-```eval_rst
+```{eval-rst}
.. note::
The 2DMSE preprint was later updated and renamed to `ESE: Espresso Sentence Embeddings MatryoshkaLoss
-Additionally, this can be combined with the `AdaptiveLayerLoss` such that the resulting model can be reduced both in the size of the output dimensions, but also in the number of layers for faster inference. See also the [Adaptive Layers](../adaptive_layer/README.html) for more information on reducing the number of model layers. In Sentence Transformers, the combination of these two losses is called `Matryoshka2dLoss`, and a shorthand is provided for simpler training.
+Additionally, this can be combined with the `AdaptiveLayerLoss` such that the resulting model can be reduced both in the size of the output dimensions, but also in the number of layers for faster inference. See also the [Adaptive Layers](../adaptive_layer/README.md) for more information on reducing the number of model layers. In Sentence Transformers, the combination of these two losses is called `Matryoshka2dLoss`, and a shorthand is provided for simpler training.
```python
from sentence_transformers import SentenceTransformer
diff --git a/examples/training/ms_marco/README.md b/examples/training/ms_marco/README.md
index 1ceac4c3d..9f9e5cef3 100644
--- a/examples/training/ms_marco/README.md
+++ b/examples/training/ms_marco/README.md
@@ -5,7 +5,7 @@ This page shows how to **train** Sentence Transformer models on this dataset so
If you are interested in how to use these models, see [Application - Retrieve & Re-Rank](../../applications/retrieve_rerank/README.md).
-There are **pre-trained models** available, which you can directly use without the need of training your own models. For more information, see: [Pretrained Models > MSMARCO Passage Models](../../../docs/sentence_transformer/pretrained_models.html#msmarco-passage-models).
+There are **pre-trained models** available, which you can directly use without the need of training your own models. For more information, see: [Pretrained Models > MSMARCO Passage Models](../../../docs/sentence_transformer/pretrained_models.md#msmarco-passage-models).
## Bi-Encoder
@@ -18,7 +18,7 @@ This page describes two strategies to **train an bi-encoder** on the MS MARCO da
### MultipleNegativesRankingLoss
**Training code: [train_bi-encoder_mnrl.py](train_bi-encoder_mnrl.py)**
-```eval_rst
+```{eval-rst}
When we use :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, we provide triplets: ``(query, positive_passage, negative_passage)`` where ``positive_passage`` is the relevant passage to the query and ``negative_passage`` is a non-relevant passage to the query. We compute the embeddings for all queries, positive passages, and negative passages in the corpus and then optimize the following objective: The ``(query, positive_passage)` pair must be close in the vector space, while ``(query, negative_passage)`` should be distant in vector space.
To further improve the training, we use **in-batch negatives**:
@@ -32,7 +32,7 @@ One way to **improve training** is to choose really good negatives, also know as
We find these hard negatives in the following way: We use existing retrieval systems (e.g. lexical search and other bi-encoder retrieval systems), and for each query we find the most relevant passages. We then use a powerful [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) [Cross-Encoder](../../applications/cross-encoder/README.md) to score the found `(query, passage)` pairs. We provide scores for 160 million such pairs in our [MS MARCO Mined Triplet dataset collection](https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets-6644d6f1ff58c5103fe65f23).
-```eval_rst
+```{eval-rst}
For :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, we must ensure that in the triplet ``(query, positive_passage, negative_passage)`` that the ``negative_passage`` is indeed not relevant for the query. The MS MARCO dataset is sadly **highly redundant**, and even though that there is on average only one passage marked as relevant for a query, it actually contains many passages that humans would consider as relevant. We must ensure that these passages are **not passed as negatives**: We do this by ensuring a certain threshold in the CrossEncoder scores between the relevant passages and the mined hard negative. By default, we set a threshold of 3: If the ``(query, positive_passage)`` gets a score of 9 from the CrossEncoder, than we will only consider negatives with a score below 6 from the CrossEncoder. This threshold ensures that we actually use negatives in our triplets.
```
@@ -52,7 +52,7 @@ print(train_dataset[0])
### MarginMSE
**Training code: [train_bi-encoder_margin-mse.py](train_bi-encoder_margin-mse.py)**
-```eval_rst
+```{eval-rst}
:class:`~sentence_transformers.losses.MarginMSELoss` is based on the paper of `Hofstätter et al