-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add precheck scoring functionality #356
base: main
Are you sure you want to change the base?
Add precheck scoring functionality #356
Conversation
Signed-off-by: greg pereira <[email protected]>
Signed-off-by: greg pereira <[email protected]>
@Gregory-Pereira Isn't BAM an IBM internal only service? If so, I am concerned that you may be leaking company confidential details in this PR. I would also note that separating out founding participant corporate matters from open source project community operations is quite difficult when the a community is very young - InstructLab is our 13 week old puppy as of next Monday, May 20, 2024. However, it is not best practice nor highly awesome for IBM or Red Hat - or any company for that matter - to mention internal only matters as part of our operations in this open source community space. We InstructLab participants was to be both highly awesome and excellent stewards of this community so everyone feels welcomed and encouraged to join us as participants and leaders. Perhaps we could have this discussion elsewhere or reconsider how we are framing the PR text in the future. Yours in kindness, LH |
Apologies @lhawthorn, still getting used to this new paradigm. Thank you for your kind and swift response. Hopefully not much is impacted, as I only referenced the service name and did not provide any details about what it is or how it works or how to use it. |
I'm not sure what was here originally, but it's no secret that this bot uses model endpoint APIs hosted on IBM infrastructure. |
Signed-off-by: greg pereira <[email protected]>
this change allows us to use different model names for the precheckEndpoint and precheckScoringEndpoint Signed-off-by: greg pereira <[email protected]>
ca19ae6
to
e20843c
Compare
Signed-off-by: greg pereira <[email protected]>
RFR cc @nerdalert @vishnoianil |
update scoring prompt Signed-off-by: Mingxuan Zhao <[email protected]>
Signed-off-by: Mingxuan Zhao <[email protected]>
I am wondering if we should expose precheck-score command to the user? I am assuming that both precheck and precheck scoring endpoints both are separate endpoints to hit. @Gregory-Pereira @mingxzhao |
I believe the end goal is to tie the precheck score directly to the original precheck. With the current scoring process it is still more expensive than precheck so if precheck is not open to users precheck score likely wouldn't be. |
Theoretically precheck and precheck score hit endpoints that are equally expensive |
I agree with Ming's stance on this. A middle ground that I would find acceptable (putting aside the implementation details for now), would be allowing users to select an individual human answer - endpoint answer pairing and score that. It feels a bit heavy handed to give a user access to running the batch run of all seed data scoring. |
Signed-off-by: greg pereira <[email protected]>
Signed-off-by: greg pereira <[email protected]>
2506bfa
to
ac9db5c
Compare
Swap question target location Signed-off-by: Mingxuan Zhao <[email protected]>
typo Signed-off-by: Mingxuan Zhao <[email protected]>
prompt update Signed-off-by: Mingxuan Zhao <[email protected]>
Signed-off-by: Mingxuan Zhao <[email protected]>
lint error Signed-off-by: Mingxuan Zhao <[email protected]>
Current status for provenance: Functionality there. At this point were prompt engineering for performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good to me. But I think I am bit skeptical about the overall user experience, so let's narrow that down. For example, "@instructlab-bot precheck" will give you scoring in the yaml output if the "scoring" endpoint is set, otherwise it won't. It's two different output for the same "precheck" command based on the different backend configuration, which the user might not be aware of (because it's the backend configuration that we admin configure). So It seems like we are kind of intermingling two separate stuff here.
Although we can always set the precheck-endpoing and scoring-endpoint both, but in this case it will always give you scoring. which sounds bit better, but if that's the case, than i think we should just make that assumption that precheck will always give a score and make sure that precheck-endpoing and scoring-endpoint both must be configured for the backend.
Other option is to add a new "@instructlab-bot precheck-score" command, which gives user scoring as well. I think both of these options probably is less confusing options. wdyt? happy to hear any other ideas as well.
@@ -118,6 +121,7 @@ func init() { | |||
generateCmd.Flags().StringVarP(&WorkDir, "work-dir", "w", "", "Directory to work in") | |||
generateCmd.Flags().StringVarP(&VenvDir, "venv-dir", "v", "", "The virtual environment directory") | |||
generateCmd.Flags().StringVarP(&PreCheckEndpointURL, "precheck-endpoint-url", "e", "http://localhost:8000/v1", "Endpoint hosting the model API. Default, it assumes the model is served locally.") | |||
generateCmd.Flags().StringVarP(&PreCheckScoringEndpointURL, "precheck-scoring-endpoint-url", "", PreCheckEndpointURL, "Endpoint hosting the model API that will be scoring the output of precheck against the answers supplied in the PR. Default, it assumes the model is the same as precheck model and is served locally.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question for my own understanding- if user doesn't provide the ScoringEndpoint, and it endup using the default PrecheckEndpoint, who does the scoring ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You nailed it. A model will do the scoring, whether or not the same model as precheck or a different one. Note, by scoring we do not mean anything used in training, just a summary of if the model was close to the human answer for triager conenience.
for i := 0; i < len(precheckPRAnswers); i++ { | ||
err, promptTemplate := generatePrecheckScoringPrompt(precheckPRAnswers[i], precheckEndpointAnswers[i], precheckPRQuestions[i]) | ||
if err != nil { | ||
w.logger.Errorf("Failed to generate a prompt for precheck scorring: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scorring -> scoring
@@ -378,6 +472,12 @@ func (w *Worker) runPrecheck(lab, outputDir, modelName string) error { | |||
w.logger.Error("Question not found or not a string") | |||
continue | |||
} | |||
answer, ok := example["answer"].(string) | |||
if !ok { | |||
w.logger.Error("Question not found or not a string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to fix the log ?
@@ -450,7 +553,8 @@ func (w *Worker) runPrecheck(lab, outputDir, modelName string) error { | |||
time.Sleep(1 * time.Second) | |||
} | |||
} | |||
return nil | |||
return nil, precheckPRAnswers, precheckEndpointAnswers, precheckPRQuestions | |||
// return nil, precheckPRAnswers, precheckPRQuestions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// return nil, precheckPRAnswers, precheckPRQuestions | |
// return nil, precheckPRAnswers, precheckPRQuestions |
i understand the cost aspect. But I think the user experience is bit confusing the way currently it's implemented. I added some more details about it in the comment above. Now the question about single pair vs batch scoring, i think we just need to see how user can actually do that from github comment. Overall the intention is to keep end to end gobot UX simple and less confusing. |
Signed-off-by: Mingxuan Zhao <[email protected]>
Signed-off-by: Mingxuan Zhao <[email protected]>
Signed-off-by: Mingxuan Zhao <[email protected]>
draft implementation of scoring answers based on LAB: LARGE-SCALE ALIGNMENT FOR CHATBOTS.
/cc @mingxzhao @nerdalert @vishnoianil