-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add weight support for LigerCrossEntropy #420
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of this! Had a few minor suggestions.
Another TODO is based on the original paper linked in the original issue for this feature. We also need to support a sample level weight. i.e. a weight that can be applied to each element of the batch if we have logits in the shape (B, S, V). We'd have sample level weights of shape (B, ). This is what's proposed in the C-RLFT paper. https://arxiv.org/abs/2309.11235
Feel free to push to this branch or even take over it and open a new PR, I won't be able to update that often in the next few months. Just trying to make the first step when I got time. |
(1.0, torch.float32, 1e-8, 1e-6), | ||
], | ||
) | ||
def test_correctness_with_weight_with_other_params_once( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test couldn't pass somehow. I might miss something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the issue seems to be with combining label_smoothing
with weighted
loss. I've been staring at the code and equations for a while now but I can't pinpoint anything that's wrong. Simply multiplying the final loss with the weight of the label token seems like the right thing to do to me.
If not there can only be an issue with the:
scaled_x_sum
term since all the other terms in smoothed loss are also a part of the plain ce loss which we know works correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figuring out where it doesn't work is a big! I'll take a look on Saturday.
Gotcha! I'll try wrapping it up, you've done most of the heavy lifting already. |
I took a look at torch's impl, and here's how they compute if (weight.defined()) {
// Expand weight to the correct number of dims for broadcasting with input / target
auto weight_broadcast_shape = SmallBuffer<int64_t, 5>(input.dim());
std::fill(weight_broadcast_shape.begin(), weight_broadcast_shape.end(), 1);
weight_broadcast_shape[class_dim] = weight.size(0);
Tensor weight_ = weight.view(weight_broadcast_shape);
smooth_loss = -(input * weight_).sum(class_dim); related code blocks in liger:
Liger-Kernel/src/liger_kernel/ops/cross_entropy.py Lines 194 to 196 in bd65c47
|
selected_weight = torch.where( | ||
target_mask, torch.gather(weight, dim=0, index=target * target_mask), 0.0 | ||
) | ||
sum_of_non_ignore_weight = selected_weight.sum().item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can rewrite it with torch.masked_select
sum_of_non_ignore_weight = (torch.gather(weight, dim=0, index=target.masked_select(target_mask))
.sum()
.item()
)
Refer to torch's impl mentioned above
@pramodith anything I can do to help with this PR? |
Hey @winglian I won't be able to look into this any further, feel free to take over and see if you can figure out the source of discrepancy. The tests fail when combining smoothing loss with weighted ce. |
I'll make an another PR for sample level weight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Tcc0403, thanks for your wonderful work. I left some of my thoughts for this PR, PTAL.
if not HAS_WEIGHT: | ||
# softmax(x_i) | ||
X_block = tl.exp(X_block - m) / d | ||
# derivative of z-loss: 2 * lse_square_scale * lse * softmax(x_i) | ||
X_block += 2 * lse_square_scale * lse * X_block | ||
# smoothing term | ||
X_block += -eps | ||
# special handle dx_y | ||
X_block = tl.where(X_offsets != y, X_block, X_block - (1 - label_smoothing)) | ||
# reduction scale | ||
if reduction == "mean": | ||
X_block = X_block / n_non_ignore | ||
else: | ||
weight_block = tl.load(weight_ptr + X_offsets, mask=X_offsets < n_cols) | ||
softmax_X = tl.exp(X_block - m) / d | ||
# derivative of original_loss | ||
dloss_ori = (1 - label_smoothing) * softmax_X | ||
# specially handle dx_y | ||
dloss_ori = tl.where( | ||
X_offsets != y, dloss_ori, dloss_ori - (1 - label_smoothing) | ||
) | ||
dloss_ori = dloss_ori * weight_y | ||
# derivative of smooth_loss | ||
dloss_smooth = eps * (-weight_block + softmax_X * weight_sum) | ||
# derivative of z-loss | ||
dz_loss = 2 * lse_square_scale * lse * softmax_X | ||
# reduction scale | ||
if reduction == "mean": | ||
dloss_ori = dloss_ori / sum_non_ignore_weight | ||
dloss_smooth = dloss_smooth / sum_non_ignore_weight | ||
dz_loss = dz_loss / n_non_ignore | ||
# derivative of total_loss | ||
X_block = dloss_ori + dloss_smooth + dz_loss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it will be better if we use if HAS_WEIGHT
instead of if not HAS_WEIGHT
to align all the other behaviors in this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought putting the base case (none weight) first would be better, but I'll consider it, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see your point. Never mind. I was a little too nitpicking.
if reduction == "mean": | ||
if HAS_WEIGHT: | ||
loss = loss / sum_non_ignore_weight | ||
else: | ||
loss = loss / n_non_ignore | ||
z_loss = z_loss / n_non_ignore | ||
loss = loss / n_non_ignore | ||
loss += z_loss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify the change here? I was thinking that if there's any missing part for z_loss
. I am not quite sure whether z_loss
will be affected by weights or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zloss wasn't scaled by weight right now, so it is divided by number of non ignore token, unlike the rest part of loss divided by the sum of weight when weight exists.
That's why I have to do divisions first before summing them.
# NOTE: skip .item() here to avoid CUDA synchronization | ||
total_n_non_ignore = (target != ignore_index).sum() | ||
target_mask = target != ignore_index | ||
total_n_non_ignore = target_mask.sum().item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have noticed the comment above to avoid using .item()
for synchronization issue. Will this change align this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to remove the comment, it doesn't affect the result.
_input, | ||
weight, | ||
target, | ||
bias, | ||
ignore_index, | ||
lse_square_scale, | ||
label_smoothing, | ||
reduction, | ||
softcap, | ||
_input=_input, | ||
weight=weight, | ||
target=target, | ||
bias=bias, | ||
ce_weight=ce_weight, | ||
ignore_index=ignore_index, | ||
lse_square_scale=lse_square_scale, | ||
label_smoothing=label_smoothing, | ||
reduction=reduction, | ||
softcap=softcap, |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
I'll update it on Saturday. Thanks for your review. |
Summary
Resolve #404.
Note: current implementation doesn't weight z loss.
Reference: PyTorch's CrossEntropyLoss
Testing Done
It hasn't fully tested with other params.
make test
to ensure correctnessmake checkstyle
to ensure code stylemake test-convergence
to ensure convergence