Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError when I try to continue training #1102

Open
chrkell opened this issue Oct 26, 2023 · 1 comment
Open

IndexError when I try to continue training #1102

chrkell opened this issue Oct 26, 2023 · 1 comment

Comments

@chrkell
Copy link

chrkell commented Oct 26, 2023

The training process was interrupted the first time and when I tried to continue training with the same parameters I get an IndexError.

NOTE: Redirects are currently not supported in Windows or MacOs. [INFO:sockeye.utils] Sockeye: 3.1.34, commit 4c30942ddb523533bccb4d2cbb3e894e45b1db93, path /Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/sockeye/__init__.py [INFO:sockeye.utils] PyTorch: 1.13.1 (/Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/torch/__init__.py) [INFO:sockeye.utils] Command: /Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/sockeye/train.py --source ../sentence_parallel_files/src_train.txt --target ../sentence_parallel_files/tgt_train.txt --validation-source ../sentence_parallel_files/src_validation.txt --validation-target ../sentence_parallel_files/tgt_validation.txt --output ../small_model --shared-vocab --num-words 20000 --num-layers 3:3 --transformer-model-size 128 --transformer-attention-heads 4:4 --transformer-feed-forward-num-hidden 512 --embed-dropout 0.3 --label-smoothing 0.3 --optimized-metric bleu --checkpoint-interval 10 --max-samples 10000000 [INFO:sockeye.utils] Arguments: Namespace(allow_missing_params=False, amp=False, apex_amp=False, batch_sentences_multiple_of=8, batch_size=4096, batch_type='word', bow_task_pos_weight=10, bow_task_weight=1.0, bucket_scaling=False, bucket_width=8, cache_last_best_params=0, cache_metric='perplexity', cache_strategy='best', checkpoint_improvement_threshold=0.0, checkpoint_interval=10, clamp_to_dtype=False, config=None, decode_and_evaluate=500, decoder='transformer', deepspeed_bf16=False, deepspeed_fp16=False, device_id=0, dist=False, dry_run=False, dtype='float32', embed_dropout=(0.3, 0.3), encoder='transformer', end_of_prepending_tag=None, env=None, fixed_param_names=[], fixed_param_strategy=None, gradient_clipping_threshold=1.0, gradient_clipping_type='none', ignore_extra_params=False, initial_learning_rate=0.0002, keep_initializations=False, keep_last_params=-1, label_smoothing=0.3, label_smoothing_impl='mxnet', learning_rate_reduce_factor=0.9, learning_rate_reduce_num_not_improved=8, learning_rate_scheduler_type='plateau-reduce', learning_rate_warmup=0, length_task=None, length_task_layers=1, length_task_weight=1.0, lhuc=None, local_rank=None, loglevel='INFO', loglevel_secondary_workers='INFO', max_checkpoints=None, max_num_checkpoint_not_improved=None, max_num_epochs=None, max_samples=10000000, max_seconds=None, max_seq_len=(95, 95), max_updates=None, min_num_epochs=None, min_samples=None, min_updates=None, momentum=0.0, neural_vocab_selection=None, neural_vocab_selection_block_loss=False, no_bucketing=False, no_logfile=False, no_reload_on_learning_rate_reduce=False, num_embed=(None, None), num_layers=(3, 3), num_words=(20000, 20000), optimized_metric='bleu', optimizer='adam', optimizer_betas=(0.9, 0.999), optimizer_eps=1e-08, output='../small_model', overwrite_output=False, pad_vocab_to_multiple_of=8, params=None, prepared_data=None, quiet=False, quiet_secondary_workers=False, seed=1, shared_vocab=True, source='../sentence_parallel_files/src_train.txt', source_factor_vocabs=[], source_factors=[], source_factors_combine=[], source_factors_num_embed=[], source_factors_share_embedding=[], source_factors_use_source_vocab=[], source_vocab=None, stop_training_on_decoder_failure=False, target='../sentence_parallel_files/tgt_train.txt', target_factor_vocabs=[], target_factors=[], target_factors_combine=[], target_factors_num_embed=[], target_factors_share_embedding=[], target_factors_use_target_vocab=[], target_factors_weight=[1.0], target_vocab=None, tf32=True, transformer_activation_type=('relu', 'relu'), transformer_attention_heads=(4, 4), transformer_block_prepended_cross_attention=False, transformer_dropout_act=(0.1, 0.1), transformer_dropout_attention=(0.1, 0.1), transformer_dropout_prepost=(0.1, 0.1), transformer_feed_forward_num_hidden=(512, 512), transformer_feed_forward_use_glu=False, transformer_model_size=(128, 128), transformer_positional_embedding_type='fixed', transformer_postprocess=('dr', 'dr'), transformer_preprocess=('n', 'n'), update_interval=1, use_cpu=False, validation_source='../sentence_parallel_files/src_validation.txt', validation_source_factors=[], validation_target='../sentence_parallel_files/tgt_validation.txt', validation_target_factors=[], weight_decay=0.0, weight_tying_type='src_trg_softmax', word_min_count=(1, 1)) [INFO:__main__] Adjusting maximum length to reserve space for a BOS/EOS marker. New maximum length: (96, 96) [INFO:sockeye.utils] CUDA not available, defaulting to CPU device [INFO:__main__] Training Device: cpu [INFO:sockeye.utils] Random seed: 1 [INFO:sockeye.utils] PyTorch seed: 1 [INFO:sockeye.vocab] Vocabulary (20008 words) loaded from "/Users/christyman/Documents/Studium/Bachelorarbeit/ats_program/small_model/vocab.src.0.json" [INFO:sockeye.vocab] Vocabulary (20008 words) loaded from "/Users/christyman/Documents/Studium/Bachelorarbeit/ats_program/small_model/vocab.trg.0.json" [INFO:sockeye.data_io] =============================== [INFO:sockeye.data_io] Creating training data iterator [INFO:sockeye.data_io] =============================== [INFO:sockeye.data_io] 17000 sequences of maximum length (96, 96) in '/Users/christyman/Documents/Studium/Bachelorarbeit/ats_program/sentence_parallel_files/src_train.txt' and '/Users/christyman/Documents/Studium/Bachelorarbeit/ats_program/sentence_parallel_files/tgt_train.txt'. [INFO:sockeye.data_io] Mean training target/source length ratio: 0.72 (+-0.58) [INFO:sockeye.data_io] Tokens: source 331178 target 184162 [INFO:sockeye.data_io] Number of <unk> tokens: source 24133 target 8003 [INFO:sockeye.data_io] Vocabulary coverage: source 93% target 96% [INFO:sockeye.data_io] 17000 sequences across 12 buckets [INFO:sockeye.data_io] 1 sequences did not fit into buckets and were discarded [INFO:sockeye.data_io] Bucket (8, 8): 946 samples in 2 batches of 664, ~4072.4 target tokens/batch, trg/src length ratio: 1.37 (+-0.86) [INFO:sockeye.data_io] Bucket (16, 16): 6029 samples in 15 batches of 416, ~4097.1 target tokens/batch, trg/src length ratio: 0.94 (+-0.67) [INFO:sockeye.data_io] Bucket (24, 24): 5549 samples in 16 batches of 360, ~4099.9 target tokens/batch, trg/src length ratio: 0.60 (+-0.38) [INFO:sockeye.data_io] Bucket (32, 32): 2774 samples in 9 batches of 336, ~4109.9 target tokens/batch, trg/src length ratio: 0.45 (+-0.24) [INFO:sockeye.data_io] Bucket (40, 40): 1159 samples in 4 batches of 328, ~4134.1 target tokens/batch, trg/src length ratio: 0.36 (+-0.21) [INFO:sockeye.data_io] Bucket (48, 48): 406 samples in 2 batches of 312, ~4135.2 target tokens/batch, trg/src length ratio: 0.33 (+-0.33) [INFO:sockeye.data_io] Bucket (56, 56): 80 samples in 1 batches of 304, ~4062.2 target tokens/batch, trg/src length ratio: 0.28 (+-0.28) [INFO:sockeye.data_io] Bucket (64, 64): 32 samples in 1 batches of 296, ~4060.8 target tokens/batch, trg/src length ratio: 0.57 (+-1.45) [INFO:sockeye.data_io] Bucket (72, 72): 5 samples in 1 batches of 216, ~4060.8 target tokens/batch, trg/src length ratio: 0.27 (+-0.14) [INFO:sockeye.data_io] Bucket (80, 80): 3 samples in 1 batches of 112, ~3994.7 target tokens/batch, trg/src length ratio: 2.34 (+-3.06) [INFO:sockeye.data_io] Bucket (88, 88): 17 samples in 1 batches of 440, ~4089.4 target tokens/batch, trg/src length ratio: 0.11 (+-0.09) [INFO:sockeye.data_io] Created bucketed parallel data set. Introduced padding: source=17.3% target=54.0%) [INFO:sockeye.data_io] ================================= [INFO:sockeye.data_io] Creating validation data iterator [INFO:sockeye.data_io] ================================= [INFO:sockeye.data_io] 2125 sequences of maximum length (96, 96) in '/Users/christyman/Documents/Studium/Bachelorarbeit/ats_program/sentence_parallel_files/src_validation.txt' and '/Users/christyman/Documents/Studium/Bachelorarbeit/ats_program/sentence_parallel_files/tgt_validation.txt'. [INFO:sockeye.data_io] Mean training target/source length ratio: 0.73 (+-0.63) [INFO:sockeye.data_io] Tokens: source 40814 target 23208 [INFO:sockeye.data_io] Number of <unk> tokens: source 3513 target 1170 [INFO:sockeye.data_io] Vocabulary coverage: source 91% target 95% [INFO:sockeye.data_io] 2125 sequences across 12 buckets [INFO:sockeye.data_io] 0 sequences did not fit into buckets and were discarded [INFO:sockeye.data_io] Bucket (8, 8): 126 samples in 1 batches of 664, ~4072.4 target tokens/batch, trg/src length ratio: 1.34 (+-0.88) [INFO:sockeye.data_io] Bucket (16, 16): 755 samples in 2 batches of 416, ~4097.1 target tokens/batch, trg/src length ratio: 0.95 (+-0.67) [INFO:sockeye.data_io] Bucket (24, 24): 708 samples in 2 batches of 360, ~4099.9 target tokens/batch, trg/src length ratio: 0.63 (+-0.57) [INFO:sockeye.data_io] Bucket (32, 32): 348 samples in 2 batches of 336, ~4109.9 target tokens/batch, trg/src length ratio: 0.44 (+-0.17) [INFO:sockeye.data_io] Bucket (40, 40): 137 samples in 1 batches of 328, ~4134.1 target tokens/batch, trg/src length ratio: 0.37 (+-0.17) [INFO:sockeye.data_io] Bucket (48, 48): 40 samples in 1 batches of 312, ~4135.2 target tokens/batch, trg/src length ratio: 0.30 (+-0.16) [INFO:sockeye.data_io] Bucket (56, 56): 7 samples in 1 batches of 304, ~4062.2 target tokens/batch, trg/src length ratio: 0.29 (+-0.09) [INFO:sockeye.data_io] Bucket (64, 64): 1 samples in 1 batches of 296, ~4060.8 target tokens/batch, trg/src length ratio: 0.15 (+-0.00) [INFO:sockeye.data_io] Bucket (80, 80): 2 samples in 1 batches of 112, ~3994.7 target tokens/batch, trg/src length ratio: 0.78 (+-0.62) [INFO:sockeye.data_io] Bucket (88, 88): 1 samples in 1 batches of 440, ~4089.4 target tokens/batch, trg/src length ratio: 0.02 (+-0.00) [INFO:sockeye.data_io] Created bucketed parallel data set. Introduced padding: source=17.2% target=52.9%) [INFO:__main__] Writing data config to '/Users/christyman/Documents/Studium/Bachelorarbeit/ats_program/small_model/data.info' [INFO:__main__] Vocabulary sizes: source=[20008] target=[20008] [INFO:__main__] Source embedding size was not set it will automatically be adjusted to match the Transformer source model size (128). [INFO:__main__] Target embedding size was not set it will automatically be adjusted to match the Transformer target model size (128). [INFO:__main__] OptimizerConfig(name='adam', running_on_gpu=True, lr=0.0002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0, momentum=0.0, gradient_clipping_type='none', gradient_clipping_threshold=1.0, update_interval=1) [INFO:__main__] Gradient accumulation over 1 batch(es) by 1 worker(s). Effective batch size: 4096 [INFO:sockeye.model] ModelConfig(config_data=DataConfig(data_statistics=DataStatistics(num_sents=17000, num_discarded=1, num_tokens_source=331178, num_tokens_target=184162, num_unks_source=24133, num_unks_target=8003, max_observed_len_source=83, max_observed_len_target=80, size_vocab_source=20008, size_vocab_target=20008, length_ratio_mean=0.7176150999425713, length_ratio_std=0.5843208885732439, buckets=[(8, 8), (16, 16), (24, 24), (32, 32), (40, 40), (48, 48), (56, 56), (64, 64), (72, 72), (80, 80), (88, 88), (96, 96)], num_sents_per_bucket=[946, 6029, 5549, 2774, 1159, 406, 80, 32, 5, 3, 17, 0], average_len_target_per_bucket=[6.133192389006345, 9.848731132857871, 11.388538475400976, 12.231795241528477, 12.603968938740291, 13.253694581280785, 13.3625, 13.71875, 18.8, 35.666666666666664, 9.294117647058822, None], length_ratio_stats_per_bucket=[(1.3741555924695448, 0.8553818317165446), (0.9445814717708868, 0.6699187554363534), (0.6036099868076464, 0.383997103331106), (0.45166851521868245, 0.2433969825782521), (0.36098684385814017, 0.20626066538468174), (0.3268138731615176, 0.3348890488643918), (0.2815973408227163, 0.2824537164805833), (0.5745533030171174, 1.449619159814051), (0.27178585119143306, 0.13837005297982366), (2.3391053391053394, 3.0600525538493457), (0.1119773210489015, 0.08527019839940486), (None, None)]), max_seq_len_source=96, max_seq_len_target=96, num_source_factors=1, num_target_factors=1, eop_id=-1), vocab_source_size=20008, vocab_target_size=20008, config_embed_source=EmbeddingConfig(vocab_size=20008, num_embed=128, dropout=0.3, num_factors=1, factor_configs=None, allow_sparse_grad=False), config_embed_target=EmbeddingConfig(vocab_size=20008, num_embed=128, dropout=0.3, num_factors=1, factor_configs=None, allow_sparse_grad=False), config_encoder=TransformerConfig(model_size=128, attention_heads=4, feed_forward_num_hidden=512, act_type='relu', num_layers=3, dropout_attention=0.1, dropout_act=0.1, dropout_prepost=0.1, positional_embedding_type='fixed', preprocess_sequence='n', postprocess_sequence='dr', max_seq_len_source=96, max_seq_len_target=96, decoder_type='transformer', block_prepended_cross_attention=False, use_lhuc=False, depth_key_value=128, use_glu=False), config_decoder=TransformerConfig(model_size=128, attention_heads=4, feed_forward_num_hidden=512, act_type='relu', num_layers=3, dropout_attention=0.1, dropout_act=0.1, dropout_prepost=0.1, positional_embedding_type='fixed', preprocess_sequence='n', postprocess_sequence='dr', max_seq_len_source=96, max_seq_len_target=96, decoder_type='transformer', block_prepended_cross_attention=False, use_lhuc=False, depth_key_value=128, use_glu=False), config_length_task=None, weight_tying_type='src_trg_softmax', lhuc=False, dtype='float32', neural_vocab_selection=None, neural_vocab_selection_block_loss=False) [INFO:sockeye.utils] # of parameters: 3990056 | trainable: 3965480 (99.38%) | shared: 2561024 (64.19%) | fixed: 24576 (0.62%) [INFO:sockeye.utils] Trainable parameters: ['embedding_source.embedding [(20008, 128), float32]', 'embedding_target.embedding [(20008, 128), float32]', 'encoder.layers.0.pre_self_attention.layer_norm [(128,), float32]', 'encoder.layers.0.pre_self_attention.layer_norm [(128,), float32]', 'encoder.layers.0.self_attention.ff_out [(128, 128), float32]', 'encoder.layers.0.self_attention.ff_in [(384, 128), float32]', 'encoder.layers.0.pre_ff.layer_norm [(128,), float32]', 'encoder.layers.0.pre_ff.layer_norm [(128,), float32]', 'encoder.layers.0.ff.ff1 [(512, 128), float32]', 'encoder.layers.0.ff.ff1 [(512,), float32]', 'encoder.layers.0.ff.ff2 [(128, 512), float32]', 'encoder.layers.0.ff.ff2 [(128,), float32]', 'encoder.layers.1.pre_self_attention.layer_norm [(128,), float32]', 'encoder.layers.1.pre_self_attention.layer_norm [(128,), float32]', 'encoder.layers.1.self_attention.ff_out [(128, 128), float32]', 'encoder.layers.1.self_attention.ff_in [(384, 128), float32]', 'encoder.layers.1.pre_ff.layer_norm [(128,), float32]', 'encoder.layers.1.pre_ff.layer_norm [(128,), float32]', 'encoder.layers.1.ff.ff1 [(512, 128), float32]', 'encoder.layers.1.ff.ff1 [(512,), float32]', 'encoder.layers.1.ff.ff2 [(128, 512), float32]', 'encoder.layers.1.ff.ff2 [(128,), float32]', 'encoder.layers.2.pre_self_attention.layer_norm [(128,), float32]', 'encoder.layers.2.pre_self_attention.layer_norm [(128,), float32]', 'encoder.layers.2.self_attention.ff_out [(128, 128), float32]', 'encoder.layers.2.self_attention.ff_in [(384, 128), float32]', 'encoder.layers.2.pre_ff.layer_norm [(128,), float32]', 'encoder.layers.2.pre_ff.layer_norm [(128,), float32]', 'encoder.layers.2.ff.ff1 [(512, 128), float32]', 'encoder.layers.2.ff.ff1 [(512,), float32]', 'encoder.layers.2.ff.ff2 [(128, 512), float32]', 'encoder.layers.2.ff.ff2 [(128,), float32]', 'encoder.final_process.layer_norm [(128,), float32]', 'encoder.final_process.layer_norm [(128,), float32]', 'decoder.layers.0.autoregr_layer.ff_out [(128, 128), float32]', 'decoder.layers.0.autoregr_layer.ff_in [(384, 128), float32]', 'decoder.layers.0.pre_autoregr_layer.layer_norm [(128,), float32]', 'decoder.layers.0.pre_autoregr_layer.layer_norm [(128,), float32]', 'decoder.layers.0.pre_enc_attention.layer_norm [(128,), float32]', 'decoder.layers.0.pre_enc_attention.layer_norm [(128,), float32]', 'decoder.layers.0.enc_attention.ff_out [(128, 128), float32]', 'decoder.layers.0.enc_attention.ff_q [(128, 128), float32]', 'decoder.layers.0.enc_attention.ff_kv [(256, 128), float32]', 'decoder.layers.0.pre_ff.layer_norm [(128,), float32]', 'decoder.layers.0.pre_ff.layer_norm [(128,), float32]', 'decoder.layers.0.ff.ff1 [(512, 128), float32]', 'decoder.layers.0.ff.ff1 [(512,), float32]', 'decoder.layers.0.ff.ff2 [(128, 512), float32]', 'decoder.layers.0.ff.ff2 [(128,), float32]', 'decoder.layers.1.autoregr_layer.ff_out [(128, 128), float32]', 'decoder.layers.1.autoregr_layer.ff_in [(384, 128), float32]', 'decoder.layers.1.pre_autoregr_layer.layer_norm [(128,), float32]', 'decoder.layers.1.pre_autoregr_layer.layer_norm [(128,), float32]', 'decoder.layers.1.pre_enc_attention.layer_norm [(128,), float32]', 'decoder.layers.1.pre_enc_attention.layer_norm [(128,), float32]', 'decoder.layers.1.enc_attention.ff_out [(128, 128), float32]', 'decoder.layers.1.enc_attention.ff_q [(128, 128), float32]', 'decoder.layers.1.enc_attention.ff_kv [(256, 128), float32]', 'decoder.layers.1.pre_ff.layer_norm [(128,), float32]', 'decoder.layers.1.pre_ff.layer_norm [(128,), float32]', 'decoder.layers.1.ff.ff1 [(512, 128), float32]', 'decoder.layers.1.ff.ff1 [(512,), float32]', 'decoder.layers.1.ff.ff2 [(128, 512), float32]', 'decoder.layers.1.ff.ff2 [(128,), float32]', 'decoder.layers.2.autoregr_layer.ff_out [(128, 128), float32]', 'decoder.layers.2.autoregr_layer.ff_in [(384, 128), float32]', 'decoder.layers.2.pre_autoregr_layer.layer_norm [(128,), float32]', 'decoder.layers.2.pre_autoregr_layer.layer_norm [(128,), float32]', 'decoder.layers.2.pre_enc_attention.layer_norm [(128,), float32]', 'decoder.layers.2.pre_enc_attention.layer_norm [(128,), float32]', 'decoder.layers.2.enc_attention.ff_out [(128, 128), float32]', 'decoder.layers.2.enc_attention.ff_q [(128, 128), float32]', 'decoder.layers.2.enc_attention.ff_kv [(256, 128), float32]', 'decoder.layers.2.pre_ff.layer_norm [(128,), float32]', 'decoder.layers.2.pre_ff.layer_norm [(128,), float32]', 'decoder.layers.2.ff.ff1 [(512, 128), float32]', 'decoder.layers.2.ff.ff1 [(512,), float32]', 'decoder.layers.2.ff.ff2 [(128, 512), float32]', 'decoder.layers.2.ff.ff2 [(128,), float32]', 'decoder.final_process.layer_norm [(128,), float32]', 'decoder.final_process.layer_norm [(128,), float32]', 'output_layer [(20008, 128), float32]', 'output_layer [(20008,), float32]'] [INFO:sockeye.utils] Shared parameters: ['embedding_source.embedding.weight = embedding_target.embedding.weight = output_layer.weight'] [INFO:sockeye.utils] Fixed parameters: ['encoder.pos_embedding [(96, 128), float32]', 'decoder.pos_embedding [(96, 128), float32]'] [INFO:sockeye.loss] Loss: cross-entropy | weight=1.00 | metric: perplexity (ppl) | output_name: 'logits' | label_name: 'target_label' [WARNING:sockeye.optimizers] Cannot import NVIDIA Apex optimizers (FusedAdam, FusedSGD). Consider installing Apex for faster GPU training: https://github.com/NVIDIA/apex [INFO:sockeye.lr_scheduler] Will reduce the learning rate by a factor of 0.90 whenever the validation score doesn't improve 8 times. [INFO:__main__] Tracing SockeyeModel on a validation batch [INFO:sockeye.training] tensorboard not found. Consider 'pip install tensorboard' to log events to Tensorboard. [INFO:sockeye.inference] Translator (1 model(s) beam_size=5 algorithm=BeamSearch, beam_search_stop=all max_input_length=95 nbest_size=1 ensemble_mode=None max_batch_size=16 dtype=torch.float32 skip_nvs=False nvs_thresh=0.5) [INFO:sockeye.checkpoint_decoder] Created CheckpointDecoder(max_input_len=-1, beam_size=5, num_sentences=500) /Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/torch/jit/_trace.py:983: TracerWarning: Encountering a list at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for list, use a tupleinstead. fordict, use a NamedTupleinstead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior. argument_names, [INFO:sockeye.training] Early stopping by optimizing 'bleu' [INFO:sockeye.training] Found partial training in '../small_model/training_state'. Resuming from saved state. [INFO:sockeye.model] Loaded params from "../small_model/training_state/params" to "cpu" [INFO:sockeye.training] Loaded optimizer state from "../small_model/training_state/optimizer_last.pkl" [INFO:sockeye.training] Loaded 'LearningRateSchedulerPlateauReduce(reduce_factor=0.90, reduce_num_not_improved=8, num_not_improved=4, base_lr=0.0002, lr=4.437062468924528e-07, warmup=0, warmed_up=True)' from '../small_model/training_state/lr_scheduler_last.pkl' [ERROR:root] Uncaught exception Traceback (most recent call last): File "/Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/sockeye/train.py", line 1225, in <module> main() File "/Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/sockeye/train.py", line 943, in main train(args) File "/Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/sockeye/train.py", line 1208, in train checkpoint_decoder=checkpoint_decoder) File "/Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/sockeye/training.py", line 229, in fit self._load_training_state(train_iter) File "/Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/sockeye/training.py", line 788, in _load_training_state train_iter.load_state(os.path.join(self.training_state_dirname, C.BUCKET_ITER_STATE_NAME)) File "/Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/sockeye/data_io.py", line 2017, in load_state self.data = self.data.permute(self.data_permutations) File "/Users/christyman/miniconda3/envs/ats-program-37/lib/python3.7/site-packages/sockeye/data_io.py", line 1582, in permute source.append(torch.index_select(self.source[buck_idx], 0, permutation)) IndexError: index out of range in self

@mjdenkowski
Copy link
Contributor

It looks like there may be an issue with the checkpoint files. You could rerun training from scratch if you have time and it isn't too expensive. Otherwise, you could potentially save some time by using the best parameters from the existing run to initialize a new training run along with early stopping (--params small_model/params.best --output new_model --max-num-checkpoint-not-improved 32 --checkpoint-improvement-threshold 0.001).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants