Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fixing and patching #807

Closed
wants to merge 4 commits into from
Closed

Conversation

vietanh125
Copy link

@vietanh125 vietanh125 commented Oct 31, 2023

In this commit:

  • ChatML bug that leads to double <|im_end|> at the end has been fixed.
  • Model card (REAME) showing "None dataset" has been fixed.
  • Added input prompt printing before training for debugging purposes.

@vietanh125 vietanh125 closed this Oct 31, 2023
@vietanh125 vietanh125 reopened this Oct 31, 2023
@casper-hansen
Copy link
Collaborator

Could you add an example of the prompt format before and after?

@vietanh125
Copy link
Author

Sure. For ChatML format, I set the special tokens as follow

special_tokens:
    bos_token: "<s>"
    eos_token: "<|im_end|>"
    pad_token: "</s>"
    unk_token: "<unk>"

tokens:
    - "<|im_start|>"
    - "<|im_end|>"

and it results to an extra <|im_end|> before each user turn like this:
image

The issue is solved after setting add_eos_token=False on ShareGPTPromptTokenizingStrategy
image

@winglian
Copy link
Collaborator

winglian commented Nov 2, 2023

hi @vietanh125 I can't edit this PR directly, but if you can run the pre-commit hooks, that would be great! pre-commit run --all-files thanks!

@vietanh125
Copy link
Author

I ran it, there were some errors in pylint, mypy, and bandit but everything has passed when pushing.

@vietanh125
Copy link
Author

I will close this PR and open one with more details about the issues.

@vietanh125 vietanh125 closed this Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants