Skip to content

Commit

Permalink
Add experimental preprocessing scripts for semantic scholar dumps
Browse files Browse the repository at this point in the history
  • Loading branch information
s-jse authored Sep 5, 2024
1 parent 7bc0c8d commit 45fdf2d
Show file tree
Hide file tree
Showing 19 changed files with 524 additions and 120 deletions.
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,11 @@ Note that this index contains ~180M vector embeddings and therefore requires at
inv start-retriever --embedding-model BAAI/bge-m3 --retriever-port <port number>
```

3. Start WikiChat by passing in the URL of this retriever. For example:
```bash
inv demo --retriever-endpoint "http://0.0.0.0:<port number>/search"
```

Note that this server and its embedding model run on CPU, and do not require GPU. For better performance, on compatible systems you can add `--use-onnx` to use the ONNX version of the embedding model, to significantly lower the embedding latency.

### Option 3: Build your own index
Expand All @@ -228,7 +233,7 @@ inv index-wikipedia-dump --embedding-model BAAI/bge-m3 --workdir ./workdir --la
`block_type` and `language` are only used to provide filtering on search results. If you do not need them, you can simply set them to `block_type=text` and `language=en`.
The script will feed `full_section_title` and `content_string` to the embedding model to create embedding vectors.

See `wikipedia_preprocessing/preprocess_html_dump.py` for details on how this is implemented for Wikipedia HTML dumps.
See `preprocessing/preprocess_wikipedia_html_dump.py` for details on how this is implemented for Wikipedia HTML dumps.

1. Run the indexing command:

Expand All @@ -250,6 +255,11 @@ inv start-retriever --embedding-model BAAI/bge-m3 --retriever-port <port number>
curl -X POST 0.0.0.0:5100/search -H "Content-Type: application/json" -d '{"query": ["What is GPT-4?", "What is LLaMA-3?"], "num_blocks": 3}'
```

5. Start WikiChat by passing in the URL of this retriever. For example:
```bash
inv demo --retriever-endpoint "http://0.0.0.0:<port number>/search"
```


#### To upload a Qdrant index to 🤗 Hub:
1. Split the index into smaller parts:
Expand All @@ -258,8 +268,8 @@ tar -cvf - <path to the Qdrant index folder> | pigz -p 14 | split --bytes=10GB -
```

2. Upload the resulting parts:
```
retrieval/upload_folder_to_hf_hub.py --folder_path <path to the output folder> --repo_id <Repo ID on 🤗 Hub>
```bash
python retrieval/upload_folder_to_hf_hub.py --folder_path <path to the output folder> --repo_id <Repo ID on 🤗 Hub>
```


Expand Down
1 change: 0 additions & 1 deletion docs/search_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ The request body should be a JSON object with the following fields:
- `query`: A string or a list of strings representing the search queries.
- `num_blocks`: An integer representing the number of items to retrieve.
- `languages`: (Optional) A string or a list of strings representing the language codes to filter the search results.
- `block_types`: (Optional) A string or a list of strings representing the block types to filter the search results.

### Example
Search for the 3 most relevant text, table or infobox in the any of the 10 Wikipedia languages.
Expand Down
4 changes: 2 additions & 2 deletions pipelines/retriever.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def retrieval_results_to_list(results: dict):
results["score"],
results["last_edit_date"],
):
if block_type not in ["text", "table", "infobox", "list"]:
if block_type not in ["text", "table", "infobox"]:
logger.warning(
f"Found invalid block type {str(block_type)} for {passage}."
)
Expand Down Expand Up @@ -266,7 +266,7 @@ def _try_to_enforce_block_type_limits(
results: list[RetrievalResult], block_type_limits: dict, target_num: int
) -> list[RetrievalResult]:
block_type_limits_copy = {}
for k in ["table", "text", "list", "infobox"]:
for k in ["table", "text", "infobox"]:
if k not in block_type_limits:
block_type_limits_copy[k] = 1e6 # Infinity
else:
Expand Down
51 changes: 51 additions & 0 deletions preprocessing/block.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
from preprocessing.utils import (
extract_english_translations,
replace_except_first,
)


class Block:
"""
A paragraph, list, linearized table, or linearized Infobox
"""

content_string: str
article_title: str
full_section_title: str
block_type: str
language: str
last_edit_date: str
num_tokens: int

def __init__(
self,
content_string: str,
full_section_title: str,
block_type: str,
article_title: str = None,
language: str = None,
last_edit_date: str = None,
):
self.content_string = content_string.strip()
self.article_title = article_title
self.full_section_title = full_section_title
self.block_type = block_type
self.language = language
self.last_edit_date = last_edit_date
self.num_tokens = 0

def to_json(self, _id: int):
ret = self.__dict__
ret["id"] = _id
return ret

def deduplicate_translations(self) -> None:
"""
Deduplicates (in English: ...) from each block
"""
string = self.full_section_title + " | " + self.content_string
translation_parenthesis = set(extract_english_translations(string))
for t in translation_parenthesis:
string = replace_except_first(string, " " + t, "")

self.full_section_title, self.content_string = tuple(string.split(" | ", 1))
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,5 @@
word_list.append(w)

print("Extracted %d words from %s" % (len(word_list), word_frequency_page))
with open("./wikipedia_preprocessing/word_list.txt", "w") as f:
with open("./preprocessing/word_list.txt", "w") as f:
f.write("\n".join(word_list))
Loading

0 comments on commit 45fdf2d

Please sign in to comment.