This repo contains the code used by team 肉蛋葱鸡
in the Intermediary of 2020 iFLYTEK text-mining competition.
Results of (chinese, italian) parallel text extraction in Intermediary:
Rank | Team | Score |
---|---|---|
1 | 肉蛋葱鸡 | 7562.11126 |
2 | ====baseline==== | 6694.19942 |
3 | HNwaz8j8x | 2394.87172 |
Under the restriction of the competition that any additional data or any translation model/API should not be used, we treated this task as an unsupervised pair extraction problem. The crucial part is to project multi-language text into semantic space so that the nearest neighbors could be formed as parallel text pairs.
The main flow has 3 steps. Firstly, extract candidates text from raw html. Secondly, convert text to its vector representation in semantic space. Finally, use chinese
text as query and search the closest neighbor at italian
dataset in semantic space. Then use the search results to form (chinese, italian) parallel text pairs. All pairs were sorted by their distance in ascending order.
For text candidates extraction, hand-made rules were used to select candidates from raw html.
For text vector representation, bert-based model was used to project text to vector. We have tried xlm-roberta-base, m-USE, LaBSE and so on. LaBSE is the most suitable model in our experiment.
For the nearest neighbor searching, Scann was used to accelerate the searching process as there are millions of chinese and italian text candidates.
/. (root path)
/src (code)
/LaBSE (embedding model)
/mUSE (embedding model)
/text_json (cache result, json for string path in raw html)
/features_hdf5 (cache result, string feature)
/it (it dataset)
/zh (zh dataset)
The code has been tested on Ubuntu 18.04 using a single GPU. For CPU version and other information, see Scann.
- Get data in link, put it in
it
andzh
respectively.
-
Create conda environment for html parser and embedding model
conda create -n tf2 python=3.6 conda activate tf2
-
Install html pre-process dependent packages.
pip install beautifulsoup4 langdetect tqdm nltk jieba numpy==1.18.5
-
Install tensorhub and related packages for sentence embedding model.
conda install -c anaconda cudatoolkit conda install cudnn pip install tensorflow_hub bert-for-tf2 tensorflow_text
-
Prepare embedding model. Below is example using LaBSE as embedding model. Download sentence embedding model. Then set
LaBSE_PATH
path toMODEL_PATH
in./src/config.py
wget -c https://storage.googleapis.com/tfhub-modules/google/LaBSE/1.tar.gz tar -xzvf 1.tar.gz -C LaBSE_PATH
- Create new environment for scann. Download scann and install.
conda deactivate conda create -n scann python=3.6 conda activate scann wget https://storage.googleapis.com/scann/releases/1.0.0/scann-1.0.0-cp36-cp36m-linux_x86_64.whl pip install scann-1.0.0-cp36-cp36m-linux_x86_64.whl
The main steps were divided into three files for independent testing and adjustment. You could merge all steps into a single file as easy as pie.
Note that the default setting was used in the competition. If running the code cost too much time in your machine, you could adjust the experiment setting in
config.py
and searching setting inutl_scann.py
respectively.
cd src
conda activate tf2
python run_html_parser.py
This will generate zh_{EXP_NAME}.json
and it_{EXP_NAME}.json
in ./text_json
python run_embedding.py
This will generate zh_{EXP_NAME}.hdf5
and it_{EXP_NAME}.hdf5
in ./features_hdf5
conda activate scann
python run_candidates_search.py