what to do with interrupted training? #180

drzraf · 2018-07-23T00:49:36Z

For the second time, training got interrupted after several hours and just after completing epoch 31.
(Here is the stack).

2018-07-22 21:03:20.250391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1046] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3253 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-07-22 21:03:33.019454: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 637.77MiB.  Current allocation summary follows.
[...]
2018-07-22 21:03:33.041741: W tensorflow/core/common_runtime/bfc_allocator.cc:279] 
[...]
2018-07-22 21:03:33.041534: I tensorflow/core/common_runtime/bfc_allocator.cc:671]      Summary of in-use Chunks by size: 
2018-07-22 21:03:33.041549: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 55 Chunks of size 256 totalling 13.8KiB
2018-07-22 21:03:33.041557: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 84 Chunks of size 1024 totalling 84.0KiB
2018-07-22 21:03:33.041565: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 1280 totalling 2.5KiB
2018-07-22 21:03:33.041572: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 1792 totalling 1.8KiB
2018-07-22 21:03:33.041580: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 26 Chunks of size 4096 totalling 104.0KiB
2018-07-22 21:03:33.041588: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 10752 totalling 10.5KiB
2018-07-22 21:03:33.041595: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 6 Chunks of size 94208 totalling 552.0KiB
2018-07-22 21:03:33.041603: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 174080 totalling 170.0KiB
2018-07-22 21:03:33.041611: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 12 Chunks of size 1492224 totalling 17.08MiB
2018-07-22 21:03:33.041619: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 1792512 totalling 1.71MiB
2018-07-22 21:03:33.041626: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 2097152 totalling 2.00MiB
2018-07-22 21:03:33.041634: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 2675200 totalling 2.55MiB
2018-07-22 21:03:33.041641: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 9 Chunks of size 3000064 totalling 25.75MiB
2018-07-22 21:03:33.041649: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 4194304 totalling 4.00MiB
2018-07-22 21:03:33.041656: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 4492288 totalling 4.28MiB
2018-07-22 21:03:33.041663: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 5139712 totalling 4.90MiB
2018-07-22 21:03:33.041671: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 5388544 totalling 5.14MiB
2018-07-22 21:03:33.041678: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 5984512 totalling 5.71MiB
2018-07-22 21:03:33.041685: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 668750080 totalling 1.25GiB
2018-07-22 21:03:33.041692: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 1.32GiB
2018-07-22 21:03:33.041703: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats: 
Limit:                  3411738624
InUse:                  1415132672
MaxInUse:               1509176832
NumAllocs:               460689205
MaxAllocSize:            668750080
********************_________________********************____________________*__________________*_**
2018-07-22 21:03:33.042256: W tensorflow/core/framework/op_kernel.cc:1290] CtxFailure at reverse_sequence_op.cc:135: Resource exhausted: OOM when allocating tensor with shape[2675,125,500] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Unhandled exception
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2675,125,500] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: layer_1/bidirectional_rnn/bw/ReverseSequence = ReverseSequence[T=DT_FLOAT, Tlen=DT_INT32, batch_dim=0, seq_dim=1, _device="/job:localhost/replica:0/task:0/device:GPU:0"](layer_0/concat, _arg_batch_x_lens_0_4/_141)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: logits/_217 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_697_logits", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./train.py", line 14, in <module>
    experiment.train_ready(corp)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/experiment.py", line 101, in train_ready
    model.train(min_epochs=20, early_stopping_steps=3)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/model.py", line 384, in train
    self.eval(restore_model_path=self.saved_model_path)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/model.py", line 182, in eval
    feed_dict=feed_dict)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2675,125,500] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Caused by op 'layer_1/bidirectional_rnn/bw/ReverseSequence', defined at:
  File "./train.py", line 14, in <module>
    experiment.train_ready(corp)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/experiment.py", line 100, in train_ready
    model = get_simple_model(exp_dir, corpus)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/experiment.py", line 91, in get_simple_model
    decoding_merge_repeated=True)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/rnn_ctc.py", line 66, in __init__
    time_major=False)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 424, in bidirectional_dynamic_rnn
    seq_dim=time_dim, batch_dim=batch_dim)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 417, in _reverse
    seq_dim=seq_dim, batch_dim=batch_dim)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2638, in reverse_sequence
    name=name)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6531, in reverse_sequence
    seq_dim=seq_dim, batch_dim=batch_dim, name=name)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3417, in create_op
    op_def=op_def)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1743, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

(Previous failure was because a couple of files from one of the *_prefix.txt files were missing)

Anyway, what to do when this happens? Here is exp/

$ tree exp/
exp/
└── 0
    ├── best_scores.txt
    ├── decoded
    │   ├── best_hyps
    │   ├── epoch0_hyps
    │   ├── epoch1_hyps
    │   ├── ...
    │   ├── epoch31_hyps
    │   └── refs
    ├── model
    │   ├── checkpoint
    │   ├── model_best.ckpt.data-00000-of-00001
    │   ├── model_best.ckpt.index
    │   └── model_best.ckpt.meta
    ├── model_description.txt
    ├── train_description.txt
    ├── train_log.txt
    └── version.txt

Using train() and passing a restore_model_path (which is actually expecting a file) does not work (I apparently don't have yet a fully built model).
Tweaking train() to load_metagraph('exp/0/model/model_best.ckpt") and then saver.restore(sess, tf.train.latest_checkpoint("exp/0/model")) does not work either (restart from epoch0)

Even if I've a lot of restore/checkpoint files in exp/, and after a deep look at tf documentation of Saver I still can't find the way to actually restore that interrupted training.

Hints/doc welcomed.

The text was updated successfully, but these errors were encountered:

shuttle1987 · 2018-07-23T09:51:19Z

Using train() and passing a restore_model_path (which is actually expecting a file)
does not work (I apparently don't have yet a fully built model).

Could you give some more details as to how this is failing? It might help me figure out what needs to be done in your case.

shuttle1987 · 2018-08-12T16:07:58Z

So after looking at this myself for a while it appears really hard to solve this just using tensorflow checkpoint files, maybe using tf.keras.save along with HDF format dump of all the weights might be easier for enabling a restore. This would require some work but might be substantially better for reproducible research reasons too because the model training as far as I know is not fully deterministic.

shuttle1987 · 2018-09-10T08:44:45Z

This is related to #117

shuttle1987 mentioned this issue Dec 30, 2018

Add functionality to correctly load old metagraphs #117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what to do with interrupted training? #180

what to do with interrupted training? #180

drzraf commented Jul 23, 2018

shuttle1987 commented Jul 23, 2018

shuttle1987 commented Aug 12, 2018

shuttle1987 commented Sep 10, 2018

what to do with interrupted training? #180

what to do with interrupted training? #180

Comments

drzraf commented Jul 23, 2018

shuttle1987 commented Jul 23, 2018

shuttle1987 commented Aug 12, 2018

shuttle1987 commented Sep 10, 2018