We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
srcIndex < srcSelectDimSize
When I try "sh scripts/run_dnabert2.sh /home/DNABERT_2"
I get the following error message:
(base) root@f442fb5fbe89:/home/DNABERT_2/finetune# sh scripts/run_dnabert2.sh /home/DNABERT_2 The provided data_path is /home/DNABERT_2 Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). WARNING:root:Perform single sequence classification... WARNING:root:Perform single sequence classification... WARNING:root:Perform single sequence classification... /root/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/1d020b803b871a976f5f3d5565f0eac8f2c7bb81/bert_layers.py:125: UserWarning: Unable to import Triton; defaulting MosaicBERT attention implementation to pytorch (this will reduce throughput when using this model). warnings.warn( Some weights of the model checkpoint at zhihan1996/DNABERT-2-117M were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['classifier.bias', 'bert.pooler.dense.bias', 'classifier.weight', 'bert.pooler.dense.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Using cuda_amp half precision backend ***** Running training ***** Num examples = 11,971 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 32 Gradient Accumulation steps = 1 Total optimization steps = 1,125 Number of trainable parameters = 117,070,082 0%| | 0/1125 [00:00<?, ?it/s]/root/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' 3%|██▊ | 30/1125 [00:31<15:53, 1.15it/s]***** Running Evaluation ***** Num examples = 1497 Batch size = 64 {'eval_loss': 0.16505089402198792, 'eval_accuracy': 0.5584502338009352, 'eval_f1': 0.5569191296151599, 'eval_matthews_correlation': 0.12047851210237737, 'eval_precision': 0.5607909809982725, 'eval_recall': 0.5596925384326958, 'eval_runtime': 10.8475, 'eval_samples_per_second': 138.004, 'eval_steps_per_second': 2.212, 'epoch': 0.08} 3%|██▊ | 30/1125 [00:42<15:53, 1.15it/sSaving model checkpoint to output/dnabert2/checkpoint-30 Configuration saved in output/dnabert2/checkpoint-30/config.json Model weights saved in output/dnabert2/checkpoint-30/pytorch_model.bin tokenizer config file saved in output/dnabert2/checkpoint-30/tokenizer_config.json Special tokens file saved in output/dnabert2/checkpoint-30/special_tokens_map.json ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed. Traceback (most recent call last): File "/home/DNABERT_2/finetune/train.py", line 311, in <module> train() File "/home/DNABERT_2/finetune/train.py", line 291, in train trainer.train() File "/root/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1664, in train return inner_training_loop( File "/root/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 2735, in training_step loss = self.compute_loss(model, inputs) File "/root/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 2767, in compute_loss outputs = model(**inputs) File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply output.reraise() File "/root/miniconda3/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise raise exception RuntimeError: Caught RuntimeError in replica 3 on device 3. Original Traceback (most recent call last): File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, **kwargs) File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/1d020b803b871a976f5f3d5565f0eac8f2c7bb81/bert_layers.py", line 862, in forward outputs = self.bert( File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/1d020b803b871a976f5f3d5565f0eac8f2c7bb81/bert_layers.py", line 608, in forward encoder_outputs = self.encoder( File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/1d020b803b871a976f5f3d5565f0eac8f2c7bb81/bert_layers.py", line 426, in forward hidden_states, indices, cu_seqlens, _ = unpad_input( File "/root/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/1d020b803b871a976f5f3d5565f0eac8f2c7bb81/bert_padding.py", line 104, in unpad_input indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten() RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The model trained correctly on both the training and test sets, but as it continued training, errors occurred
The text was updated successfully, but these errors were encountered:
No branches or pull requests
When I try "sh scripts/run_dnabert2.sh /home/DNABERT_2"
I get the following error message:
The model trained correctly on both the training and test sets, but as it continued training, errors occurred
The text was updated successfully, but these errors were encountered: