Possible Issue with the Mamba Initialization #32

AlirezAkbary · 2024-11-25T23:47:36Z

It seems that have some specific initialization strategy for initialization of Mamba:

Lines 75 to 94 in fcc6af7

    
           # Initialize special dt projection to preserve variance at initialization 
        
           dt_init_std = self.dt_rank**-0.5 * dt_scale 
        
           if dt_init == "constant": 
        
               nn.init.constant_(self.dt_proj.weight, dt_init_std) 
        
           elif dt_init == "random": 
        
               nn.init.uniform_(self.dt_proj.weight, -dt_init_std, dt_init_std) 
        
           else: 
        
               raise NotImplementedError 
        
           # Initialize dt bias so that F.softplus(dt_bias) is between dt_min and dt_max 
        
           dt = torch.exp( 
        
               torch.rand(self.d_inner, **factory_kwargs) * (math.log(dt_max) - math.log(dt_min)) 
        
               + math.log(dt_min) 
        
           ).clamp(min=dt_init_floor) 
        
           # Inverse of softplus: https://github.com/pytorch/pytorch/issues/72759 
        
           inv_dt = dt + torch.log(-torch.expm1(-dt)) 
        
           with torch.no_grad(): 
        
               self.dt_proj.bias.copy_(inv_dt) 
        
           # Our initialization would set all Linear.bias to zero, need to mark this one as _no_reinit 
        
           self.dt_proj.bias._no_reinit = True

However, you re-initialize the weights, when instantiating the Mamba based language model:

zoology/zoology/model.py

Line 156 in fcc6af7

self.apply(partial(_init_weights, n_layers=config.n_layers,))

And the _init_weights function doesn't skip the modules that have _no_reinit=True.

zoology/zoology/model.py

Lines 73 to 97 in fcc6af7

    
           def _init_weights( 
        
               module, 
        
               n_layers, 
        
               initializer_range=0.02, 
        
               rescale_prenorm_residual=True, 
        
           ): 
        
               if isinstance(module, nn.Linear): 
        
                   nn.init.normal_(module.weight, std=initializer_range) 
        
                   if module.bias is not None: 
        
                       nn.init.zeros_(module.bias) 
        
               elif isinstance(module, nn.Embedding): 
        
                   nn.init.normal_(module.weight, std=initializer_range) 
        
               if rescale_prenorm_residual: 
        
                   for name, p in module.named_parameters(): 
        
                       if "out_proj.weight" in name or "fc2.weight" in name: 
        
                           # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block 
        
                           nn.init.normal_( 
        
                               p, mean=0.0, std=initializer_range / math.sqrt(2 * n_layers) 
        
                           ) 
        
                       # If using GLU activation for now, we scale the std by 2 
        
                       elif "output_linear.0.weight" in name: 
        
                           nn.init.normal_( 
        
                               p, mean=0.0, std=initializer_range / math.sqrt(2 * n_layers) 
        
                           )

If this is true, the fix should be checking the _no_reinit in the _init_weights function, similar to the Original Mamba implementation:
https://github.com/state-spaces/mamba/blob/442fab4b1fd5226c1b5939b37d91ede430b5d1ae/mamba_ssm/models/mixer_seq_simple.py#L93-L96

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Issue with the Mamba Initialization #32

Possible Issue with the Mamba Initialization #32

AlirezAkbary commented Nov 25, 2024

Possible Issue with the Mamba Initialization #32

Possible Issue with the Mamba Initialization #32

Comments

AlirezAkbary commented Nov 25, 2024