[PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden #536

paulhendricks · 2020-05-28T17:42:41Z

Related to Model/Framework(s)
PyTorch/LanguageModeling/BERT

Describe the bug
BookCorpus no longer available from Smashwords.

To Reproduce

The following works perfectly.

git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples
cd PyTorch/LanguageModeling/BERT
bash scripts/docker/build.sh
bash scripts/docker/launch.sh

However, errors start here:

bash data/create_datasets_from_start.sh

root@dgxstation:/workspace/bert# bash data/create_datasets_from_start.sh
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Working Directory: /workspace/bert/data
Action: download
Dataset Name: bookscorpus

Directory Structure:
{ 'download': '/workspace/bert/data/download',
  'extracted': '/workspace/bert/data/extracted',
  'formatted': '/workspace/bert/data/formatted_one_article_per_line',
  'hdf5': '/workspace/bert/data/hdf5_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5',
  'sharded': '/workspace/bert/data/sharded_training_shards_256_test_shards_256_fraction_0.2',
  'tfrecord': '/workspace/bert/data/tfrecord_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5'}

0 files had already been saved in /workspace/bert/data/download/bookscorpus.
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
 Gave up to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt

Expected behavior

BookCorpus should download. This looks similar to:

Looks like https://www.smashwords.com/ has stepped up their anti-web crawling. In fact, after attempting to download my IP address is now blocked from their website. Users should be aware of this before we ask them to download the BookCorpus dataset lest they become banned, unaware of the consequences.

Environment
Please provide at least:

Git commit: c76880b
Container version (e.g. pytorch:19.05-py3):

Step 1/15 : ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.03-py3
Step 2/15 : FROM nvcr.io/nvidia/tritonserver:20.03-py3-clientsdk as trt

GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 4x Tesla V100, DGX Station
CUDA driver version (e.g. 418.67): 418.126.02

The text was updated successfully, but these errors were encountered:

swethmandava · 2020-06-22T18:39:42Z

You can just ignore the bookscorpus files that are missing. They dont exist anymore on the web.

#247 #262

vilmara · 2020-07-01T02:14:45Z

Hi @swethmandava, the script run_pretraining_lamb.sh throws a bunch of errors because it is still referencing these datasets, is there another script to conduct the pre-training process on BERT with available datasets?

swethmandava · 2020-07-07T06:28:16Z

Could you open another bug with details of your errors? @vilmara

vilmara · 2020-07-14T16:31:28Z

I have found how to work only with English Wikipedia dataset which is still available, ignoring BookCorpus dataset

paulhendricks added the bug Something isn't working label May 28, 2020

paulhendricks changed the title ~~[PyTorch/LanguageModeling/BERT] What is the problem?~~ [PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden May 28, 2020

swethmandava closed this as completed Jun 22, 2020

vilmara mentioned this issue Jul 1, 2020

BERT - Link to download Wikipedia and BookCorpus datasets mlcommons/inference#643

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden #536

[PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden #536

paulhendricks commented May 28, 2020 •

edited

Loading

swethmandava commented Jun 22, 2020

vilmara commented Jul 1, 2020

swethmandava commented Jul 7, 2020

vilmara commented Jul 14, 2020 •

edited

Loading

[PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden #536

[PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden #536

Comments

paulhendricks commented May 28, 2020 • edited Loading

swethmandava commented Jun 22, 2020

vilmara commented Jul 1, 2020

swethmandava commented Jul 7, 2020

vilmara commented Jul 14, 2020 • edited Loading

paulhendricks commented May 28, 2020 •

edited

Loading

vilmara commented Jul 14, 2020 •

edited

Loading