[WIP] MLM Training Objective #680

prady-saligram · 2024-07-30T21:22:09Z

Introduces train_mlm.py, a new file adapted from train_lm.py, to support masked language modeling with dynamic masking as utilized in RoBERTa. A new class, MaskedLMDataset, has been implemented in text.py to handle dynamic masking. This class is instantiated and utilized within train_mlm.py, preserving all structural and sharding-related comments from the original train_lm.py to maintain clarity and continuity. The integration of MaskedLMDataset with the training script has been verified with appropriate parameters to ensure consistency with existing training workflows.

dlwh · 2024-08-01T18:39:20Z

src/levanter/data/text.py

@@ -64,6 +65,65 @@

 DEFAULT_IGNORE_INDEX = -100  # Mirrors pytorch's default ignore index

+class MaskedLmDataset(ShardableDataset[LmExample]):


fyi we're gonna do a big refactor on datasets soon, but I'll either handle the refactor or guide you through it)

dlwh · 2024-08-01T19:16:43Z

src/levanter/data/text.py

+            def _create_mlm_example(tokens, key):
+                tokens_array = tokens.array
+
+                example = LmExample.causal(tokens=tokens, ignore_id=self.ignore_id)


you need a non-causal attention mask for Roberta, and you need to set a loss_mask to be only the masked tokens

you also can't use the current LmExample actually because you need a separate targets field (with the non-masked tokens). With more work you could avoid the need for targets (with just masked tokens), but probably better to add an targets: Optional[NamedArray] to the class (or make your own class)

src/levanter/data/text.py

prady-saligram added 3 commits July 30, 2024 11:33

Implements dynamic masking objective

8f7402e

Implements dynamic masked dataset

670b053

Reintroduced accidentally deleted CausalLMDataset class

42f5404

dlwh reviewed Aug 1, 2024

View reviewed changes

prady-saligram marked this pull request as draft August 5, 2024 21:50

prady-saligram added 2 commits August 5, 2024 14:51

[WIP] Re-implements MLM training objective

53fd8d2

Adds error handling and reverts LmExample class to original

dcd45b2

prady-saligram changed the title ~~MLM Training Objective~~ [WIP] MLM Training Objective Aug 6, 2024

prady-saligram added 2 commits August 26, 2024 15:03

Sets RobertaConfig as model architecture and creates default config file

027b176

Adds compute_loss to roberta and changes positional ids to begin from 0

399e08c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] MLM Training Objective #680

[WIP] MLM Training Objective #680

prady-saligram commented Jul 30, 2024

dlwh Aug 1, 2024

dlwh Aug 1, 2024

dlwh Aug 1, 2024

		@@ -64,6 +65,65 @@

		DEFAULT_IGNORE_INDEX = -100 # Mirrors pytorch's default ignore index

		class MaskedLmDataset(ShardableDataset[LmExample]):

[WIP] MLM Training Objective #680

Are you sure you want to change the base?

[WIP] MLM Training Objective #680

Conversation

prady-saligram commented Jul 30, 2024

dlwh Aug 1, 2024

Choose a reason for hiding this comment

dlwh Aug 1, 2024

Choose a reason for hiding this comment

dlwh Aug 1, 2024

Choose a reason for hiding this comment