How can I use training parameter from one custom layer in another custom layer? #941

AzamatB · 2024-09-19T04:28:50Z

AzamatB
Sep 19, 2024

Hi, I have two custom Lux layers, say ImportanceScaling and Decoder, that will be Chain-ed to construct the model. I want to use training parameter of the ImportanceScaling in the forward pass of Decoder. How can I do this?
Here is the minimal example:

using Lux

struct ImportanceScaling <: Lux.AbstractLuxLayer
    dim::Int
end

function Lux.initialparameters(rng::AbstractRNG, layer::ImportanceScaling)
    importance_weights = rand(rng, Float32, layer.dim)
    return (; importance_weights)
end

function (layer::ImportanceScaling)(x::AbstractVecOrMat, params::NamedTuple, state::NamedTuple)
    y = x .* params.importance_weights
    return (y, state)
end


struct Decoder <: Lux.AbstractLuxLayer
    dim_encoding::Int
    dim_decoding::Int
end


function Lux.initialparameters(rng::AbstractRNG, decoder::Decoder)
    word_embedding = kaiming_uniform(rng, Float32, decoder.dim_decoding, decoder.dim_encoding)
    bias = Lux.init_linear_bias(rng, nothing, decoder.dim_encoding, decoder.dim_decoding)
    return (; word_embedding, bias)
end

function Lux.outputsize(decoder::Decoder, _, ::AbstractRNG)
    return (decoder.dim_decoding,)
end

function (decoder::Decoder)(x, params::NamedTuple, state::NamedTuple)
    x1 = importance_weights .* x # want to use importance_weights from ImportanceScaling layer above
    x2 = params.word_embedding * x1 .+ params.bias
    return (x2, state)
end

model = Chain(ImportanceScaling(10), Decoder(10, 10))

Answered by avik-pal

Sep 19, 2024

Yes this would work if you are okay with having importance_weights in the decoder params. For the share_parameters to work correctly, you might want to initialize the decoder with importance_weights and later link thme as you did.

One pointer to make your debugging easier (if needed), construct the Chain as Chain(; importance_scaling=ImportanceScaling(10), decoder=Decoder(10, 10)) then the sharing becomes Lux.Experimental.share_parameters(ps, (("importance_scaling.importance_weights", "decoder.importance_weights"),))

View full answer

avik-pal · 2024-09-19T04:41:18Z

avik-pal
Sep 19, 2024
Maintainer

If you Chain the models then it is assumed that layer[I] cannot interact with layer[I - 1]/layer[I + 1] in any form other than the outputs it generates. My recommendation would be to use a Lux.AbstractLuxContainerLayer and write out the forward pass manually.

If you want to hack a solution with Chain, the only way would be to make ImportanceScaling return (y, params.importance_weights) and use the importance_weights in Decoder. But I would generally advise against this usage pattern.

4 replies

AzamatB Sep 19, 2024
Author

Thank you so much for your response.
And what about Lux.Experimental.share_parameters?
It seems like doing something like

model = Chain(ImportanceScaling(10), Decoder(10, 10))

ps, st = Lux.setup(Xoshiro(0), model);

Lux.Experimental.share_parameters(ps, (("layer_1.importance_weights", "layer_2.importance_weights"),))

after adding importance_weights as a training parameter to Decoder also achieves what I want here, no?

If both options work, which one is better/prefereable? AbstractLuxContainerLayer or share_parameters(..)?

avik-pal Sep 19, 2024
Maintainer

Yes this would work if you are okay with having importance_weights in the decoder params. For the share_parameters to work correctly, you might want to initialize the decoder with importance_weights and later link thme as you did.

One pointer to make your debugging easier (if needed), construct the Chain as Chain(; importance_scaling=ImportanceScaling(10), decoder=Decoder(10, 10)) then the sharing becomes Lux.Experimental.share_parameters(ps, (("importance_scaling.importance_weights", "decoder.importance_weights"),))

Answer selected by AzamatB

AzamatB Sep 19, 2024
Author

Thank you so much @avik-pal for your work and for your answers! 👍🏼

P. S. Doesn't the parameter sharing scheme above break the assumption that

layer[I] cannot interact with layer[I - 1]/layer[I + 1] in any form other than the outputs it generates

?

avik-pal Sep 19, 2024
Maintainer

No because you are using information for the current layer. It happens to alias a parameter from a previous layer but that is inconsequential.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LuxDL

How can I use training parameter from one custom layer in another custom layer? #941

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

LuxDL

How can I use training parameter from one custom layer in another custom layer? #941

AzamatB Sep 19, 2024

Replies: 1 comment · 4 replies

avik-pal Sep 19, 2024 Maintainer

AzamatB Sep 19, 2024 Author

avik-pal Sep 19, 2024 Maintainer

AzamatB Sep 19, 2024 Author

avik-pal Sep 19, 2024 Maintainer

AzamatB
Sep 19, 2024

Replies: 1 comment 4 replies

avik-pal
Sep 19, 2024
Maintainer

AzamatB Sep 19, 2024
Author

avik-pal Sep 19, 2024
Maintainer

AzamatB Sep 19, 2024
Author

avik-pal Sep 19, 2024
Maintainer