[DEPRIORITIZED][AAQ-765] Retry LLM generation when AlignScore fails #399

lickem22 · 2024-08-19T13:08:10Z

Reviewer: @amiraliemami
Estimate: 40mins

Ticket

Description

Goal

The goal of this PR is to allow retrying LLM response again when AlignScore fails because of a low score N times (default is 0).

Changes

The following changes have been made:

Add backoff library dependency
Updated endpoint to retry when response is of type QueryResponseError and the error is low alignment score. Also allowed to add the previous failure raison in response.debug_info["past_failure"]

Future Tasks (optional)

How has this been tested?

Testing this is tricky because for this change to be observed we need LLM response to work but AlignScore to fail, and finding these cases are not straighforward.
Was tested two ways:
First way is :

Set ALIGN_SCORE_THRESHOLD to an unrealistic score (example 1.5). That way AlignScore fails
Run /search with generate_lllm_response set to true. with a content and a question relevant for the content. An example is content: Here we are going to talk about pineapples because of their pine shapes and the applee like tast
and question: Are apple related to pineapples,
Make sure LLM response is ran twice (by checking logs) if ALIGN_SCORE_N_RETRIES is set to the default value (1)
Make sure debug_info["past_failure"] is in the returned response.
Second way was by still setting ALIGN_SCORE_THRESHOLD to a value >1 but then adding a logic in the code to make sure the value of ALIGN_SCORE_THRESHOLD is reduced to a reasonable value everytime the LLM response is regenerated to make sure that after the second retry the AlignScore check passes. So, in second run, ALIGN_SCORE_THRESHOLD should be less than 0.8. This approach is not straightforward. I am open to more efficient ways of testing this feature.

Checklist

Fill with x for completed.

My code follows the style guidelines of this project
I have reviewed my own code to ensure good quality
I have tested the functionality of my code to ensure it works as intended
I have resolved merge conflicts

(Delete any items below that are not relevant)

I have updated the automated tests
I have updated the scripts in scripts/
I have updated the requirements
I have updated the README file
I have updated affected documentation
I have added a blogpost in Latest Updates
I have updated the CI/CD scripts in .github/workflows/
I have updated the Terraform code

lickem22 · 2024-08-19T13:13:35Z

core_backend/app/question_answer/routers.py

+            asession=asession,
+            exclude_archived=True,
+        )
+        response.debug_info["past_failure"] = failure_reason


Added that in case it works after the second try to understand why it failed first.

lickem22 · 2024-08-19T13:14:03Z

core_backend/app/question_answer/routers.py

@@ -307,6 +320,39 @@ async def search_base(
    return response


+def is_unable_to_generate_response(response: QueryResponse) -> bool:


Added this function retry only if that condition is met.

lickem22 · 2024-08-19T13:15:34Z

core_backend/app/question_answer/routers.py

+
+
+@backoff.on_predicate(
+    backoff.expo,


What backoff.expo does is basically waiting a little more everytime the function is reran in an exponential way just to handle the load better.

Should we just have a logic that retries once, instead of adding a config (num retries) we don't know if we'll use 🤔 ?

I guess it depends on how useful the approach is, since we haven't done any analysis to see how well it works.
But personnally I think since it doesn't add a dependency (backoff being used by litellm), and since the only code we would change if we retry just once is the decorator and the config variable, the cost is pretty low, so we can just keep it.

core_backend/app/question_answer/routers.py

core_backend/requirements.txt

Co-authored-by: Suzin You <[email protected]>

lickem22 · 2024-08-28T10:09:48Z

core_backend/app/llm_call/llm_prompts.py

@@ -177,6 +177,11 @@ def get_prompt(cls) -> str:
 You are a helpful question-answering AI. You understand user question and answer their \
 question using the REFERENCE TEXT below.
 """
+RETRY_PROMPT_SUFFIX = """


Added a suffix to the prompt to incorporate failure reason

suzinyou

Thanks Carlos! I think we need to think and validate the prompt a bit. Should we discuss in the next tech session?

suzinyou · 2024-08-30T05:13:57Z

core_backend/app/llm_call/llm_rag.py

@@ -37,7 +37,14 @@ async def get_llm_rag_answer(
    """

    metadata = metadata or {}
-    prompt = RAG.prompt.format(context=context, original_language=original_language)
+    if "failure_reason" in metadata and metadata["failure_reason"]:


How about we create a new arg, "retry=False"?

The downside is

We would have to create it for all the parent functions and

We need both is_retry and metadata["failure_reaon"] to actually do retry.

But I think it would be easier to understand the code, and we won't be hiding any unexpected actions! What do you think?

Something like

if is_retry: if "failure_reason" not in metadata: raise ValueError("failure_reason is required for retry requests") prompt = RAG.retry_prompt.format( context=context, original_language=original_language, failure_reason=metadata["failure_reason"], )

My initial understanding was that we are using this to try the functionality. What if we keep it like this while testing, and if it turns out to be something we want to keep, the we will explicitly set it as a functionality by addind the is_retry parameter. What do you think?

suzinyou · 2024-08-30T06:35:41Z

core_backend/app/llm_call/llm_prompts.py

+RETRY_PROMPT_SUFFIX = """
+If the response above is not aligned with the question, please rectify this by \
+considering the following reason(s) for misalignment: "{failure_reason}". 
+Make necessary adjustments to ensure the answer is aligned with the question.
+"""


Right now, we are only passing failure_reason which is response.debug_info["factual_consistency"]["reason"],
but we should also include the LLM response in this prompt..

Also, shouldn't the prompt define what we mean by alignment?

That makes sense. To be honest, I was just having a go at updating the prompt to take the output into consideration. I am not exactly an expert in prompt engineering. Should we discuss that in a tech session?

suzinyou · 2024-08-30T10:36:17Z

core_backend/app/question_answer/routers.py

+
+
+@backoff.on_predicate(
+    backoff.expo,


Should we just have a logic that retries once, instead of adding a config (num retries) we don't know if we'll use 🤔 ?

lickem22 added 6 commits August 15, 2024 16:11

First commit

0b40e2a

Merge branch 'main' into implement-retry

ffef8d5

Add retry logic

98536da

Add retry logic

4216376

Cleanup

7561d8c

Cleanup

a5a0ac9

lickem22 requested review from sidravi1, suzinyou, amiraliemami, Tanmay-97, markbotterill and tonyzhao6 as code owners August 19, 2024 13:08

Merge branch 'main' into implement-retry

c738e64

lickem22 commented Aug 19, 2024

View reviewed changes

core_backend/app/question_answer/routers.py Show resolved Hide resolved

lickem22 and others added 3 commits August 20, 2024 10:48

Fix linting

ef7cef0

Merge branch 'main' into implement-retry

2f10b2f

Fix linting

d08e443

suzinyou reviewed Aug 26, 2024

View reviewed changes

core_backend/app/question_answer/routers.py Show resolved Hide resolved

core_backend/app/question_answer/routers.py Show resolved Hide resolved

core_backend/requirements.txt Outdated Show resolved Hide resolved

lickem22 and others added 4 commits August 26, 2024 10:36

Update core_backend/requirements.txt

8824d09

Co-authored-by: Suzin You <[email protected]>

Merge branch 'main' into implement-retry

eef90a1

Merge branch 'main' into implement-retry

6306091

Add retry prompt for llm alignscore failure

3bde173

lickem22 commented Aug 28, 2024

View reviewed changes

Cleanups

3858453

suzinyou reviewed Aug 30, 2024

View reviewed changes

lickem22 force-pushed the implement-retry branch from 3858453 to a5a0ac9 Compare August 30, 2024 11:37

amiraliemami force-pushed the main branch from 2ac97e3 to d02746c Compare August 30, 2024 12:21

lickem22 changed the title ~~[AAQ-765] Retry LLM generation when AlignScore fails~~ [DEPRIORITIZED][AAQ-765] Retry LLM generation when AlignScore fails Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DEPRIORITIZED][AAQ-765] Retry LLM generation when AlignScore fails #399

[DEPRIORITIZED][AAQ-765] Retry LLM generation when AlignScore fails #399

lickem22 commented Aug 19, 2024 •

edited

Loading

lickem22 Aug 19, 2024

lickem22 Aug 19, 2024

lickem22 Aug 19, 2024

suzinyou Aug 30, 2024

lickem22 Sep 2, 2024

lickem22 Aug 28, 2024

suzinyou left a comment

suzinyou Aug 30, 2024

suzinyou Aug 30, 2024

lickem22 Sep 2, 2024

suzinyou Aug 30, 2024

suzinyou Aug 30, 2024

lickem22 Sep 2, 2024

suzinyou Aug 30, 2024

		@@ -307,6 +320,39 @@ async def search_base(
		return response


		def is_unable_to_generate_response(response: QueryResponse) -> bool:



		@backoff.on_predicate(
		backoff.expo,



		@backoff.on_predicate(
		backoff.expo,

[DEPRIORITIZED][AAQ-765] Retry LLM generation when AlignScore fails #399

Are you sure you want to change the base?

[DEPRIORITIZED][AAQ-765] Retry LLM generation when AlignScore fails #399

Conversation

lickem22 commented Aug 19, 2024 • edited Loading

Reviewer: @amiraliemami Estimate: 40mins

Ticket

Description

Goal

Changes

Future Tasks (optional)

How has this been tested?

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suzinyou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lickem22 commented Aug 19, 2024 •

edited

Loading

Reviewer: @amiraliemami
Estimate: 40mins