Titan RTX faster than RTX 3090 but slower if RTX 3090 can use sample packing #1007

Nero10578 · 2023-12-27T22:27:11Z

Nero10578
Dec 27, 2023

Hey everyone, I need some help. I got my hands on two Titan RTX 24GB and also two RTX 3090 24GB cards. As far as I know the biggest differentiating factor between them is that the RTX 3090 has half rate FP16 tensor with FP32 accumulate. Which lands it almost 1/2 the performance of the Titan RTX in that area. In every other metric it wipes the floor with the Titan RTX.

The RTX 3090 is also ampere based so it supports flash attention 2 and therefore sample packing. As well as BFloat16. While the Titan RTX I had to run xformers and no sample packing.

In my testing, with this yaml configuration:

base_model: ./models/llama2-70b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false
device_map: sequential
max_memory: {0: "18GIB", 1: "23GIB"}

datasets:
  - path: (short test dataset of 100 lines 200-500 tokens each)
    type: completion

dataset_prepared_path: ./last_run_prepared
val_set_size: 0.05
output_dir: ./qlora-out

save_safetensors: true

adapter: lora
lora_model_dir: 

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 8
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true (for RTX 3090)
fp16: true (for Titan RTX)
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: false
local_rank:
logging_steps: 1
xformers_attention: true (for Titan RTX)
flash_attention: true (for RTX 3090)

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_table_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

This would result in a 24-step training.
This results in these training times:
Titan RTX: 248 seconds
RTX 3090: 325 seconds

But if I enable sample packing on the RTX 3090, it can do it in one step resulting in:
RTX 3090 sample packing on: 28 seconds

I understand this is because this is a super small dataset that can be optimized to be packed and done in one step instead of 24 steps originally. But the Titan RTX is inherently faster without this optimization? Is there a way to turn on sample packing with the Titan RTX? I am contemplating which of the cards to keep and sell. Thanks!

ramzeez88 · 2024-01-08T15:52:30Z

ramzeez88
Jan 8, 2024

sell those titans and one 3090 and buy a 4090. use 4090 as the main driver and the 3090 as additional memory.

2 replies

Nero10578 Jan 11, 2024
Author

I ended up just keeping the 2x 3090s. Would a faster main card make a difference? I thought matching cards would be good.

ramzeez88 Jan 11, 2024

This could be of your interest https://lambdalabs.com/blog/nvidia-rtx-4090-vs-rtx-3090-deep-learning-benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Titan RTX faster than RTX 3090 but slower if RTX 3090 can use sample packing #1007

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Titan RTX faster than RTX 3090 but slower if RTX 3090 can use sample packing #1007

Nero10578 Dec 27, 2023

Replies: 1 comment · 2 replies

ramzeez88 Jan 8, 2024

Nero10578 Jan 11, 2024 Author

ramzeez88 Jan 11, 2024

Nero10578
Dec 27, 2023

Replies: 1 comment 2 replies

ramzeez88
Jan 8, 2024

Nero10578 Jan 11, 2024
Author