You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Related to Model/Framework(s)
PyTorch/LanguageModeling/BERT
Describe the bug
BookCorpus no longer available from Smashwords.
To Reproduce
The following works perfectly.
git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples
cd PyTorch/LanguageModeling/BERT
bash scripts/docker/build.sh
bash scripts/docker/launch.sh
However, errors start here:
bash data/create_datasets_from_start.sh
root@dgxstation:/workspace/bert# bash data/create_datasets_from_start.sh
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
Working Directory: /workspace/bert/data
Action: download
Dataset Name: bookscorpus
Directory Structure:
{ 'download': '/workspace/bert/data/download',
'extracted': '/workspace/bert/data/extracted',
'formatted': '/workspace/bert/data/formatted_one_article_per_line',
'hdf5': '/workspace/bert/data/hdf5_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5',
'sharded': '/workspace/bert/data/sharded_training_shards_256_test_shards_256_fraction_0.2',
'tfrecord': '/workspace/bert/data/tfrecord_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5'}
0 files had already been saved in /workspace/bert/data/download/bookscorpus.
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
Gave up to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
Expected behavior
BookCorpus should download. This looks similar to:
Looks like https://www.smashwords.com/ has stepped up their anti-web crawling. In fact, after attempting to download my IP address is now blocked from their website. Users should be aware of this before we ask them to download the BookCorpus dataset lest they become banned, unaware of the consequences.
paulhendricks
changed the title
[PyTorch/LanguageModeling/BERT] What is the problem?
[PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden
May 28, 2020
Hi @swethmandava, the script run_pretraining_lamb.sh throws a bunch of errors because it is still referencing these datasets, is there another script to conduct the pre-training process on BERT with available datasets?
Related to Model/Framework(s)
PyTorch/LanguageModeling/BERT
Describe the bug
BookCorpus no longer available from Smashwords.
To Reproduce
The following works perfectly.
However, errors start here:
Expected behavior
BookCorpus should download. This looks similar to:
Looks like https://www.smashwords.com/ has stepped up their anti-web crawling. In fact, after attempting to download my IP address is now blocked from their website. Users should be aware of this before we ask them to download the BookCorpus dataset lest they become banned, unaware of the consequences.
Environment
Please provide at least:
The text was updated successfully, but these errors were encountered: