From 6cc2816cb0d9ae0b454586a41ed7d1aceef4599b Mon Sep 17 00:00:00 2001 From: Jan Heinrich Merker Date: Mon, 2 Dec 2024 09:08:51 +0000 Subject: [PATCH] Improve notebook documentation --- .../baseline-retrieval-system.ipynb | 290 +++++++++++++----- 1 file changed, 210 insertions(+), 80 deletions(-) diff --git a/baseline-retrieval-system/baseline-retrieval-system.ipynb b/baseline-retrieval-system/baseline-retrieval-system.ipynb index e00f035..0e9fae7 100644 --- a/baseline-retrieval-system/baseline-retrieval-system.ipynb +++ b/baseline-retrieval-system/baseline-retrieval-system.ipynb @@ -4,59 +4,141 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# IR Lab WiSe 2024/2025: Baseline Retrieval System\n", + "# Information Retrieval Lab WiSe 2024/2025: Baseline Retrieval System\n", "\n", - "This jupyter notebook serves as baseline retrieval system that you can try to improve upon.\n", - "We will work in the MS MARCO scenario (retrieving passages of web documents). This Jupyter notebook serves as retrieval system and makes a submission to TIRA.\n", + "This Jupyter notebook serves as a baseline retrieval system that you can improve upon.\n", + "We use subsets of the MS MARCO datasets to retrieve passages of web documents.\n", + "We will show you how to create a software submission to TIRA from this notebook.\n", "\n", - "An overview of all corpora that we use in the course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset ids with which you can load the datasets are:\n", + "An overview of all corpora that we use in the current course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset IDs for loading the datasets are:\n", "\n", - "- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the 2019 and 2020 TREC Deep Learning tracks on the MS MARCO v1 passage dataset. You can use this dataset to develop your system.\n", - "- `ir-lab-wise-2024/subsampled-ms-marco-rag-20241202-training` (work in progress): A subsample of the 2024 TREC RAG track on the MS MARCO v2.1 passage dataset. You can use this dataset to develop your system.\n", - "- `ir-lab-wise-2024/ms-marco-rag-20241203-test`: (Not ready yet). The test corpus that we all developed together throughout the course on the MS MARCO v2.1 passage dataset. This dataset is the final test dataset, i.e., evaluation scores become available only after the submission deadline.\n" + "- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset. Use this dataset to tune your system(s).\n", + "- `ir-lab-wise-2024/subsampled-ms-marco-rag-20241202-training` (_work in progress_): A subsample of the TREC 2024 Retrieval-Augmented Generation track on the MS MARCO v2.1 passage dataset. Use this dataset to tune your system(s).\n", + "- `ir-lab-wise-2024/ms-marco-rag-20241203-test` (work in progress): The test corpus that we have created together in the course, based on the MS MARCO v2.1 passage dataset. We will use this dataset as the test dataset, i.e., evaluation scores become available only after the submission deadline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Step 1: Import Libraries\n", + "### Step 1: Import libraries\n", "\n", - "We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.\n", - "\n", - "Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines." + "We will use [tira](https://tira.io/), an information retrieval shared task platform, and [ir_dataset](https://ir-datasets.com/) for loading the datasets. Subsequently, we will build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine framework." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, we need to install the required libraries." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting tira>=0.0.139\n", + " Downloading tira-0.0.139-py3-none-any.whl (112 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m112.2/112.2 KB\u001b[0m \u001b[31m2.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: ir-datasets in /usr/local/lib/python3.10/dist-packages (0.5.5)\n", + "Requirement already satisfied: python-terrier==0.10.0 in /usr/local/lib/python3.10/dist-packages (0.10.0)\n", + "Requirement already satisfied: deprecated in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (1.2.14)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (1.26.2)\n", + "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (1.11.4)\n", + "Requirement already satisfied: pyjnius>=1.4.2 in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (1.6.1)\n", + "Requirement already satisfied: dill in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (0.3.7)\n", + "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (1.3.2)\n", + "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (2.1.3)\n", + "Requirement already satisfied: more-itertools in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (10.1.0)\n", + "Requirement already satisfied: matchpy in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (0.5.5)\n", + "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (1.3.2)\n", + "Requirement already satisfied: chest in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (0.2.3)\n", + "Requirement already satisfied: nptyping==1.4.4 in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (1.4.4)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (3.1.2)\n", + "Requirement already satisfied: pytrec-eval-terrier>=0.5.3 in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (0.5.6)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (2.31.0)\n", + "Requirement already satisfied: statsmodels in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (0.14.0)\n", + "Requirement already satisfied: ir-measures>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (0.3.3)\n", + "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (4.66.1)\n", + "Requirement already satisfied: wget in /usr/local/lib/python3.10/dist-packages (from python-terrier==0.10.0) (3.2)\n", + "Requirement already satisfied: typish>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from nptyping==1.4.4->python-terrier==0.10.0) (1.9.3)\n", + "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from tira>=0.0.139) (23.2)\n", + "Requirement already satisfied: docker==7.*,>=7.1.0 in /usr/local/lib/python3.10/dist-packages (from tira>=0.0.139) (7.1.0)\n", + "Requirement already satisfied: urllib3>=1.26.0 in /usr/local/lib/python3.10/dist-packages (from docker==7.*,>=7.1.0->tira>=0.0.139) (2.1.0)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->python-terrier==0.10.0) (3.6)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->python-terrier==0.10.0) (2023.11.17)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->python-terrier==0.10.0) (3.3.2)\n", + "Requirement already satisfied: lxml>=4.5.2 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (4.9.3)\n", + "Requirement already satisfied: lz4>=3.1.10 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (4.3.2)\n", + "Requirement already satisfied: trec-car-tools>=2.5.4 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (2.6)\n", + "Requirement already satisfied: warc3-wet>=0.2.3 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (0.2.3)\n", + "Requirement already satisfied: unlzw3>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (0.2.2)\n", + "Requirement already satisfied: inscriptis>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (2.3.2)\n", + "Requirement already satisfied: zlib-state>=0.1.3 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (0.1.6)\n", + "Requirement already satisfied: pyautocorpus>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (0.1.12)\n", + "Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (6.0.1)\n", + "Requirement already satisfied: beautifulsoup4>=4.4.1 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (4.12.2)\n", + "Requirement already satisfied: ijson>=3.1.3 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (3.2.3)\n", + "Requirement already satisfied: warc3-wet-clueweb09>=0.2.5 in /usr/local/lib/python3.10/dist-packages (from ir-datasets) (0.2.5)\n", + "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4>=4.4.1->ir-datasets) (2.5)\n", + "Requirement already satisfied: cwl-eval>=1.0.10 in /usr/local/lib/python3.10/dist-packages (from ir-measures>=0.3.1->python-terrier==0.10.0) (1.0.12)\n", + "Requirement already satisfied: cbor>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from trec-car-tools>=2.5.4->ir-datasets) (1.0.0)\n", + "Requirement already satisfied: heapdict in /usr/local/lib/python3.10/dist-packages (from chest->python-terrier==0.10.0) (1.0.1)\n", + "Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.10/dist-packages (from deprecated->python-terrier==0.10.0) (1.16.0)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->python-terrier==0.10.0) (2.1.3)\n", + "Requirement already satisfied: multiset<3.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from matchpy->python-terrier==0.10.0) (2.1.1)\n", + "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->python-terrier==0.10.0) (2023.3.post1)\n", + "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->python-terrier==0.10.0) (2.8.2)\n", + "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->python-terrier==0.10.0) (2023.3)\n", + "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->python-terrier==0.10.0) (3.2.0)\n", + "Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.10/dist-packages (from statsmodels->python-terrier==0.10.0) (0.5.4)\n", + "Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.2->statsmodels->python-terrier==0.10.0) (1.16.0)\n", + "Installing collected packages: tira\n", + " Attempting uninstall: tira\n", + " Found existing installation: tira 0.0.138\n", + " Uninstalling tira-0.0.138:\n", + " Successfully uninstalled tira-0.0.138\n", + "Successfully installed tira-0.0.139\n", + "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", + "\u001b[0m" + ] + } + ], "source": [ - "# You only need to execute this cell if you are using Google Golab.\n", - "# If you use GitHub Codespaces, everything is already installed.\n", "!pip3 install 'tira>=0.0.139' ir-datasets 'python-terrier==0.10.0'" ] }, { - "cell_type": "code", - "execution_count": 2, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "# Imports\n", - "from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run\n", - "from tira.rest_api_client import Client\n", - "import pyterrier as pt" + "Create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs)." ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 2, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8\n", + "\n", + "No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.\n" + ] + } + ], "source": [ - "# Create a REST client to the TIRA platform to load datasets and submit runs.\n", + "from tira.third_party_integrations import ensure_pyterrier_is_loaded\n", + "from tira.rest_api_client import Client\n", + "\n", "ensure_pyterrier_is_loaded()\n", "tira = Client()" ] @@ -65,66 +147,75 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Step 2: Load the Dataset" + "### Step 2: Load the dataset\n", + "\n", + "We load the dataset by its ir_datasets ID (as listed in the Readme). Just be sure to add the `irds:` prefix before the dataset ID to tell PyTerrier to load the data from ir_datasets." ] }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ - "# The dataset: a subsample of the 2019 and 2020 MS MARCO TREC Deep Learning Track\n", - "# This line creates an IRDSDataset object and registers it under the name provided as an argument.\n", - "pt_dataset = pt.get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')\n" + "from pyterrier import get_dataset\n", + "\n", + "pt_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Step 3: Build an Index\n", - "\n", + "### Step 3: Build an index\n", "\n", - "The type of the index object that we build is ``, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python." + "We will then create an index from the documents in the dataset we just loaded." ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 38%|███▊ | 25667/68261 [00:11<00:05, 7752.61it/s]" + "ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 38%|███▊ | 25893/68261 [00:13<00:11, 3756.37it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "09:50:42.446 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Adding an empty document to the index (6114613) - further warnings are suppressed\n" + "09:04:48.630 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Adding an empty document to the index (6114613) - further warnings are suppressed\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [00:15<00:00, 4312.86it/s] \n" + "ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [00:21<00:00, 3191.28it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "09:50:49.742 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents\n" + "09:05:00.637 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents\n" ] } ], "source": [ - "indexer = pt.IterDictIndexer(\"./index\", meta={'docno': 50, 'text': 4096}, overwrite=True)\n", + "from pyterrier import IterDictIndexer\n", + "\n", + "indexer = IterDictIndexer(\n", + " # Store the index in the `index` directory.\n", + " \"../data/index\",\n", + " meta={'docno': 50, 'text': 4096},\n", + " # If an index already exists there, then overwrite it.\n", + " overwrite=True,\n", + ")\n", "index = indexer.index(pt_dataset.get_corpus_iter())" ] }, @@ -132,42 +223,42 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Step 4: Define the Retrieval Pipeline\n", + "### Step 4: Define the retrieval pipeline\n", "\n", - "We will define a BM25 retrieval pipeline as baseline. For details, see:\n", - "\n", - "- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)\n", - "- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)" + "We will define a simple retrieval pipeline using just BM25 as a baseline. For details, refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io) or [tutorial](https://github.com/terrier-org/ecir2021tutorial)." ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ - "bm25 = pt.BatchRetrieve(index, wmodel=\"BM25\")" + "from pyterrier import BatchRetrieve\n", + "\n", + "bm25 = BatchRetrieve(index, wmodel=\"BM25\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Step 5: Create the Run\n" + "### Step 5: Create the run\n", + "In the next steps, we would like to apply our retrieval system to some topics, to prepare a 'run' file, containing the retrieved documents." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, let's have a short look at the first three topics:" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 6, "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "First, we have a short look at the first three topics:\n" - ] - }, { "data": { "text/html": [ @@ -220,30 +311,44 @@ "2 1043135 who killed nicholas ii of russia" ] }, - "execution_count": 10, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "print('First, we have a short look at the first three topics:')\n", - "\n", + "# The `'text'` argument below selects the topics `text` field as the query.\n", "pt_dataset.get_topics('text').head(3)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, retrieve results for all the topics (may take a while):" + ] + }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "run = bm25(pt_dataset.get_topics('text'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That's it for the retrieval. Here are the first 10 entries of the run:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Now we do the retrieval...\n", - "Done. Here are the first 10 entries of the run\n" - ] - }, { "data": { "text/html": [ @@ -382,16 +487,12 @@ "9 1030303 60002 3302257 9 17.832781 who is aziz hashim" ] }, - "execution_count": 11, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "print('Now we do the retrieval...')\n", - "run = bm25(pt_dataset.get_topics('text'))\n", - "\n", - "print('Done. Here are the first 10 entries of the run')\n", "run.head(10)" ] }, @@ -399,29 +500,58 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Step 6: Persist and Upload the Run for Subsequent Evaluations\n", + "### Step 6: Persist and upload run to TIRA\n", "\n", - "The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated for which we upload it to TIRA." + "The output of our retrieval system is a run file. This run file can later (and, e.g., in a different notebook or by a different person) be statistically evaluated. We will therefore first upload the run to TIRA." ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "The run file is normalized outside the TIRA sandbox, I will store it at \"../runs\".\n", - "Done. run file is stored under \"../runs/run.txt.gz\".\n", - "Run uploaded to TIRA. Claim ownership via: https://www.tira.io/claim-submission/65591c62-80ea-4700-81ec-b5c6ede07707\n" + "The run file is normalized outside the TIRA sandbox, I will store it at \"../data/runs\".\n", + "Done. run file is stored under \"../data/runs/run.txt.gz\".\n", + "Run uploaded to TIRA. Claim ownership via: https://www.tira.io/claim-submission/e27132b6-560f-4d40-8259-00f429f7b88b\n" ] } ], "source": [ - "persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs', upload_to_tira=pt_dataset)" + "from tira.third_party_integrations import persist_and_normalize_run\n", + "\n", + "persist_and_normalize_run(\n", + " run,\n", + " # Give your approach a short but descriptive name tag.\n", + " system_name='bm25-baseline', \n", + " default_output='../data/runs',\n", + " upload_to_tira=pt_dataset,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Click on the link in the cell output above to claim your submission on TIRA." ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Step 7: Improve\n", + "\n", + "Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] } ], "metadata": { @@ -440,7 +570,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.7" + "version": "3.10.12" } }, "nbformat": 4,