This is the repository of team Hooli, composed of Yiming Sun (ys3031), Minghao Li (ml4025), Yihan Lin (yl3820) and Yihao Li (yl3744).
Nucleus is a question-answering AI. Tell Nucleus a passage and some related questions, and you will get an answer shortly.
The key part of Nucleus is BERT, a fast and accurate deep neural network that could answer you question, if you simply provide a question and a related context.
Now, Nucleus has two different mode: context-related and context-free. In context-related mode, things go easier with the help of BERT. you provide a context to Nucleus and a question based on this context. Nucleus will tell you the answer of it.
In context-free mode, things become more interesting. At the very beginning we only planned the context-related mode. After we finished it, however, we decided to challenge ourselves, and here it comes the context-free mode.
In context-free mode, you don't need to provide a context, we do this for you - we use abundant wikipedia API to search the most possible page that may contain answer. Calling multiple APIs including Wikipedia API, rake_nltk, etc.
If you have any questions during the installation or operation of Nucleus, please feel free to open an issue.
Before you start, remember to create a python 3.6 virtual environment. we recommend virtualenv
. Then install all the packages inside requirements.txt
by doing:
pip install -r requirements.txt
Before you fully launch Nucleus, you need two more things:
you need to config six AWS credentials, and put them in a file named config.py
at root directory
cognito_userpool_id = <your_userpool_id>
cognito_app_client_id = <your_config_id>
database_user_name = <your_database_username>
database_endpoint = <your_database_endpoint>
port = <your_endpoint_number>
database_pwd = <your_database_password>
Download the model via https://1drv.ms/f/s!AtfKeiTxgnoqjt0M3lrLoowcsjbKcA
, name the whole dir as model_data
, and put it to <root>/models/bert
Please note that the r_net mode is now deprecated. You can try it if you want or you only have limited computation resources.
If you cannot download the model, please contact us at [email protected]
To run the test cases, direct into ./test folder, run the three files respectively to test bert mode and database methods.
To launch Nucleus, simply run:
python application.py
then open your browser and visit http://127.0.0.1:5000
. Please make sure you are not running any other web app on port 5000.
pre-commit
spacy
tensorflow
tqdm
ujson
flask
warrant
wikipedia
nltk
rake-nltk
mysql-connector-python
All the results, and files required by the professor, including pre-commit and post-commit config, unit test reports, bug-finder reports, are in the result
folder. These files should be read only
The basic workflow of our context-free mode is:
- a user submits a question at the frontend;
- Nucleus' backend extract keywords from the question, with rake_nltk API;
- these keywords are send to wikipedia API, which returns the pages of these keywords;
- we split these pages into a list of paragraphs, each of which is about 700 characters long;
- we put the list of paragraphs as contexts and the question into BERT model, and the model returns an answer and a confidence for each of question-context pair;
- we select the answer with the best confidence, and return it to the user.
https://github.com/google-research/bert https://github.com/HKUST-KnowComp/R-Net https://github.com/tensorflow/tensorflow https://github.com/pallets/flask https://github.com/goldsmith/Wikipedia https://github.com/capless/warrant https://github.com/csurfer/rake-nltk