tanda_search_qa_tool

Tools and utility scripts written as part of a submission to the Kaggle COVID-19 Open Research Dataset Challenge (CORD-19). A report, including examples of the tools in use, can be found in the task submission notebook (winner of a community contribution award) or in CORD19_submission.ipynb (although the kaggle notebook is reccomended as it is more readable).

The tool uses the Amazon Alexa Team's Transfer and Adapt Bert framework and a pre-trained TandA RoBERTa Large model also provided by the team. All files in the transformers directory (adaptations of Hugging Face's popular transformers package) are copied directly from the wqa_tanda repo / original arxiv paper where all the necessary Ts & Cs can be found. Other files are the authors own:

build_kaggle_summaries_df.py

Short script used to concatenate the summary tables provided by Kaggle into one large table. Used to attempt semi-supervised learning (no success) and analysis of model output.
cord_result_summarizer.py

Tool to build summary table entries from individual papers. Uses Roberta-Tanda model to identify sentences from paper to use as challenge and solution features, and regex / spacy POS & NER tagging to find study_type, strength_of_evidence and addressed_population.
cord_search_qa_tool.py

Tool to search Cord-19 corpora to identify papers that answer a given research question. Utilises regex search terms to identify possible results and Roberta-TandA model to identify papers with sentence-level answers to a research question within the abstract.
prep_metadata.py

A short script to clean metadata provided with Cord-19 dataset and add missing abstracts to papers that have available body text.
summarizer_helpers.py

Helper functions that facilitate the cord_search_qa_tool.
text_search_qa_tool.py

Generalised search tool from which the cord_search_qa_tool is built. Takes a corpora of texts and utilises regex search and Roberta-TandA sentence-level QA to find texts that contain an answer to a given question.

Requirements

Alongside the python packages in requirements.txt, the model requires the following to function:

a pre-trained variant of the TandA model, available here. Extract in the main directory of the project and refer to the directory name when initializing the respective tool. Note: the model should function with any of the pre-trained TandA models, but has been tested using the "transferred" Roberta base and large variants only.
The CORD-19 dataset (if wanting to use the cord_search_qa_tool and cord_result_summarizer) again extracted in the main directory of the project.
The en_core_web_sm spacy model. This must be downloaded separately after installing the spacy library by running python -m spacy download en_core_web_sm at the CLI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tanda_search_qa_tool

contents

Requirements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
transformers		transformers
CORD19_submission.ipynb		CORD19_submission.ipynb
README.md		README.md
build_kaggle_summaries_df.py		build_kaggle_summaries_df.py
cord_result_summarizer.py		cord_result_summarizer.py
cord_search_qa_tool.py		cord_search_qa_tool.py
prep_metadata.py		prep_metadata.py
requirements.txt		requirements.txt
summarizer_helpers.py		summarizer_helpers.py
text_search_qa_tool.py		text_search_qa_tool.py

samrelins/tanda_search_qa_tool

Folders and files

Latest commit

History

Repository files navigation

tanda_search_qa_tool

contents

Requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages