You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the second time, training got interrupted after several hours and just after completing epoch 31.
(Here is the stack).
2018-07-22 21:03:20.250391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1046] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3253 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-07-22 21:03:33.019454: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 637.77MiB. Current allocation summary follows.
[...]
2018-07-22 21:03:33.041741: W tensorflow/core/common_runtime/bfc_allocator.cc:279]
[...]
2018-07-22 21:03:33.041534: I tensorflow/core/common_runtime/bfc_allocator.cc:671] Summary of in-use Chunks by size:
2018-07-22 21:03:33.041549: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 55 Chunks of size 256 totalling 13.8KiB
2018-07-22 21:03:33.041557: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 84 Chunks of size 1024 totalling 84.0KiB
2018-07-22 21:03:33.041565: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 1280 totalling 2.5KiB
2018-07-22 21:03:33.041572: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 1792 totalling 1.8KiB
2018-07-22 21:03:33.041580: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 26 Chunks of size 4096 totalling 104.0KiB
2018-07-22 21:03:33.041588: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 10752 totalling 10.5KiB
2018-07-22 21:03:33.041595: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 6 Chunks of size 94208 totalling 552.0KiB
2018-07-22 21:03:33.041603: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 174080 totalling 170.0KiB
2018-07-22 21:03:33.041611: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 12 Chunks of size 1492224 totalling 17.08MiB
2018-07-22 21:03:33.041619: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 1792512 totalling 1.71MiB
2018-07-22 21:03:33.041626: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 2097152 totalling 2.00MiB
2018-07-22 21:03:33.041634: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 2675200 totalling 2.55MiB
2018-07-22 21:03:33.041641: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 9 Chunks of size 3000064 totalling 25.75MiB
2018-07-22 21:03:33.041649: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 4194304 totalling 4.00MiB
2018-07-22 21:03:33.041656: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 4492288 totalling 4.28MiB
2018-07-22 21:03:33.041663: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 5139712 totalling 4.90MiB
2018-07-22 21:03:33.041671: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 5388544 totalling 5.14MiB
2018-07-22 21:03:33.041678: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 5984512 totalling 5.71MiB
2018-07-22 21:03:33.041685: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 668750080 totalling 1.25GiB
2018-07-22 21:03:33.041692: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 1.32GiB
2018-07-22 21:03:33.041703: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 3411738624
InUse: 1415132672
MaxInUse: 1509176832
NumAllocs: 460689205
MaxAllocSize: 668750080
********************_________________********************____________________*__________________*_**
2018-07-22 21:03:33.042256: W tensorflow/core/framework/op_kernel.cc:1290] CtxFailure at reverse_sequence_op.cc:135: Resource exhausted: OOM when allocating tensor with shape[2675,125,500] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Unhandled exception
Traceback (most recent call last):
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2675,125,500] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: layer_1/bidirectional_rnn/bw/ReverseSequence = ReverseSequence[T=DT_FLOAT, Tlen=DT_INT32, batch_dim=0, seq_dim=1, _device="/job:localhost/replica:0/task:0/device:GPU:0"](layer_0/concat, _arg_batch_x_lens_0_4/_141)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: logits/_217 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_697_logits", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./train.py", line 14, in <module>
experiment.train_ready(corp)
File "/home/user/.local/lib/python3.6/site-packages/persephone/experiment.py", line 101, in train_ready
model.train(min_epochs=20, early_stopping_steps=3)
File "/home/user/.local/lib/python3.6/site-packages/persephone/model.py", line 384, in train
self.eval(restore_model_path=self.saved_model_path)
File "/home/user/.local/lib/python3.6/site-packages/persephone/model.py", line 182, in eval
feed_dict=feed_dict)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2675,125,500] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Caused by op 'layer_1/bidirectional_rnn/bw/ReverseSequence', defined at:
File "./train.py", line 14, in <module>
experiment.train_ready(corp)
File "/home/user/.local/lib/python3.6/site-packages/persephone/experiment.py", line 100, in train_ready
model = get_simple_model(exp_dir, corpus)
File "/home/user/.local/lib/python3.6/site-packages/persephone/experiment.py", line 91, in get_simple_model
decoding_merge_repeated=True)
File "/home/user/.local/lib/python3.6/site-packages/persephone/rnn_ctc.py", line 66, in __init__
time_major=False)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 424, in bidirectional_dynamic_rnn
seq_dim=time_dim, batch_dim=batch_dim)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 417, in _reverse
seq_dim=seq_dim, batch_dim=batch_dim)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2638, in reverse_sequence
name=name)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6531, in reverse_sequence
seq_dim=seq_dim, batch_dim=batch_dim, name=name)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3417, in create_op
op_def=op_def)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1743, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
(Previous failure was because a couple of files from one of the *_prefix.txt files were missing)
Anyway, what to do when this happens? Here is exp/
Using train() and passing a restore_model_path (which is actually expecting a file) does not work (I apparently don't have yet a fully built model).
Tweaking train() to load_metagraph('exp/0/model/model_best.ckpt") and then saver.restore(sess, tf.train.latest_checkpoint("exp/0/model")) does not work either (restart from epoch0)
Even if I've a lot of restore/checkpoint files in exp/, and after a deep look at tf documentation of Saver I still can't find the way to actually restore that interrupted training.
Hints/doc welcomed.
The text was updated successfully, but these errors were encountered:
So after looking at this myself for a while it appears really hard to solve this just using tensorflow checkpoint files, maybe using tf.keras.save along with HDF format dump of all the weights might be easier for enabling a restore. This would require some work but might be substantially better for reproducible research reasons too because the model training as far as I know is not fully deterministic.
For the second time, training got interrupted after several hours and just after completing epoch 31.
(Here is the stack).
(Previous failure was because a couple of files from one of the
*_prefix.txt
files were missing)Anyway, what to do when this happens? Here is exp/
train()
and passing arestore_model_path
(which is actually expecting a file) does not work (I apparently don't have yet a fully built model).train()
toload_metagraph('exp/0/model/model_best.ckpt")
and thensaver.restore(sess, tf.train.latest_checkpoint("exp/0/model"))
does not work either (restart from epoch0)Even if I've a lot of restore/checkpoint files in
exp/
, and after a deep look at tf documentation of Saver I still can't find the way to actually restore that interrupted training.Hints/doc welcomed.
The text was updated successfully, but these errors were encountered: