Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEPRIORITIZED][AAQ-765] Retry LLM generation when AlignScore fails #399

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

lickem22
Copy link
Contributor

@lickem22 lickem22 commented Aug 19, 2024

Reviewer: @amiraliemami
Estimate: 40mins

Ticket

Fixes:AAQ-675

Description

Goal

The goal of this PR is to allow retrying LLM response again when AlignScore fails because of a low score N times (default is 0).

Changes

The following changes have been made:

  • Add backoff library dependency
  • Updated endpoint to retry when response is of type QueryResponseError and the error is low alignment score. Also allowed to add the previous failure raison in response.debug_info["past_failure"]

Future Tasks (optional)

How has this been tested?

Testing this is tricky because for this change to be observed we need LLM response to work but AlignScore to fail, and finding these cases are not straighforward.
Was tested two ways:
First way is :

  • Set ALIGN_SCORE_THRESHOLD to an unrealistic score (example 1.5). That way AlignScore fails
  • Run /search with generate_lllm_response set to true. with a content and a question relevant for the content. An example is content: Here we are going to talk about pineapples because of their pine shapes and the applee like tast
    and question: Are apple related to pineapples,
  • Make sure LLM response is ran twice (by checking logs) if ALIGN_SCORE_N_RETRIES is set to the default value (1)
  • Make sure debug_info["past_failure"] is in the returned response.
    Second way was by still setting ALIGN_SCORE_THRESHOLD to a value >1 but then adding a logic in the code to make sure the value of ALIGN_SCORE_THRESHOLD is reduced to a reasonable value everytime the LLM response is regenerated to make sure that after the second retry the AlignScore check passes. So, in second run, ALIGN_SCORE_THRESHOLD should be less than 0.8. This approach is not straightforward. I am open to more efficient ways of testing this feature.

Checklist

Fill with x for completed.

  • My code follows the style guidelines of this project
  • I have reviewed my own code to ensure good quality
  • I have tested the functionality of my code to ensure it works as intended
  • I have resolved merge conflicts

(Delete any items below that are not relevant)

  • I have updated the automated tests
  • I have updated the scripts in scripts/
  • I have updated the requirements
  • I have updated the README file
  • I have updated affected documentation
  • I have added a blogpost in Latest Updates
  • I have updated the CI/CD scripts in .github/workflows/
  • I have updated the Terraform code

asession=asession,
exclude_archived=True,
)
response.debug_info["past_failure"] = failure_reason
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that in case it works after the second try to understand why it failed first.

@@ -307,6 +320,39 @@ async def search_base(
return response


def is_unable_to_generate_response(response: QueryResponse) -> bool:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this function retry only if that condition is met.



@backoff.on_predicate(
backoff.expo,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What backoff.expo does is basically waiting a little more everytime the function is reran in an exponential way just to handle the load better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just have a logic that retries once, instead of adding a config (num retries) we don't know if we'll use 🤔 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how useful the approach is, since we haven't done any analysis to see how well it works.
But personnally I think since it doesn't add a dependency (backoff being used by litellm), and since the only code we would change if we retry just once is the decorator and the config variable, the cost is pretty low, so we can just keep it.

@@ -177,6 +177,11 @@ def get_prompt(cls) -> str:
You are a helpful question-answering AI. You understand user question and answer their \
question using the REFERENCE TEXT below.
"""
RETRY_PROMPT_SUFFIX = """
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a suffix to the prompt to incorporate failure reason

Copy link
Collaborator

@suzinyou suzinyou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Carlos! I think we need to think and validate the prompt a bit. Should we discuss in the next tech session?

@@ -37,7 +37,14 @@ async def get_llm_rag_answer(
"""

metadata = metadata or {}
prompt = RAG.prompt.format(context=context, original_language=original_language)
if "failure_reason" in metadata and metadata["failure_reason"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we create a new arg, "retry=False"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is

  1. We would have to create it for all the parent functions and
  2. We need both is_retry and metadata["failure_reaon"] to actually do retry.

But I think it would be easier to understand the code, and we won't be hiding any unexpected actions! What do you think?

Something like

    if is_retry:
        if "failure_reason" not in metadata:
            raise ValueError("failure_reason is required for retry requests")
        
        prompt = RAG.retry_prompt.format(
            context=context,
            original_language=original_language,
            failure_reason=metadata["failure_reason"],
        )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial understanding was that we are using this to try the functionality. What if we keep it like this while testing, and if it turns out to be something we want to keep, the we will explicitly set it as a functionality by addind the is_retry parameter. What do you think?

Comment on lines +180 to +184
RETRY_PROMPT_SUFFIX = """
If the response above is not aligned with the question, please rectify this by \
considering the following reason(s) for misalignment: "{failure_reason}".
Make necessary adjustments to ensure the answer is aligned with the question.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we are only passing failure_reason which is response.debug_info["factual_consistency"]["reason"],
but we should also include the LLM response in this prompt..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, shouldn't the prompt define what we mean by alignment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. To be honest, I was just having a go at updating the prompt to take the output into consideration. I am not exactly an expert in prompt engineering. Should we discuss that in a tech session?



@backoff.on_predicate(
backoff.expo,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just have a logic that retries once, instead of adding a config (num retries) we don't know if we'll use 🤔 ?

@lickem22 lickem22 changed the title [AAQ-765] Retry LLM generation when AlignScore fails [DEPRIORITIZED][AAQ-765] Retry LLM generation when AlignScore fails Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants