Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calibrated prompt generated is completely different from initial prompt. #57

Open
shobhnadhami opened this issue Apr 3, 2024 · 3 comments

Comments

@shobhnadhami
Copy link

I have a prompt which is used to generate sql query from the input text given by a user.
I am trying to optimize prompt using run_generation_pipeline.py, but I am getting completely different Calibrated prompt.
Below are the inputs provided:

--task_description:
Assistant is a large language model that is tasked to generate SQL query based on details and examples provided in prompt.

--prompt: 
We have 2 tables:
    Employee: Employee table have information regarding all the employees in a company.
    Below are the attributes of Employee table
        empid: empid column contains employee id. empid is a primary key of Employee table.
        name: Name column contains name of the employee
        salary: salary column contains salary of the employee
        department_id: department_id column contains employee's department id. It is a foreign key from Department table.
    Department: Department table have information regarding all the department of a company.
    Below are the attributes of Department table
        department_id: department_id contains the id of the department. department_id is primary key of Department table.
        department_name: department_name contains name of the department.
***Below are few examples***:
##Example 1
user query: what is empid of employees in department A?
output: Select Employee.empid
        From Employee 
        Join Department 
        on Employee.department_id = Department.department_id
        Where Department.department_id = 'A';
##Example 2
user query: what is salary of employee with empid=1?
output: Select salary
        From Employee 
        Where empid = 1;
** End of Examples **

Your task is to generate SQL query from natural language input provided by user.
Your task is to understand natural language input and provide SQL query to fetch information asked in natural language input from above tables.

annotator instruction in config_default.yml:
        instruction:
            'We have two tables Employee and Department.
            Employee table have empid, name, salary, department_id as columns
            Department table have department_id, department_name as columns
            You will be given a query in natural language and its interpreted sql query to fetch data from above table. 
            Asses interpreted SQL query with respect to natural language input and table provided. Answer 1 if SQL query is relevant 
           and correct otherwise 0.'

output given by AutoPrompt:

Calibrated prompt score: 1.0
Calibrated prompt: Your task is to generate accurate and context-specific SQL queries based on natural language input provided by the user. Please include specific examples of natural language input and the corresponding expected SQL queries. Additionally, describe the database schema and table structure to provide more context for query generation. Aim for a higher score by improving the model's understanding and accuracy in generating SQL queries. 

Output given is not relevant to the task. Am I providing the wrong inputs or missing some inputs that needs to be provided?

@Eladlev
Copy link
Owner

Eladlev commented Apr 3, 2024

Hi,
Few remarks:

  1. It seems that your ranker is too 'weak', he only requires that the SQL query will be relevant, so it is very hard to generate synthetic data on which GPT-3.5/4 will fail (on any relevant prompt).
  2. In order to avoid divergence you should move some of the initial prompt to the task description.
For example:
--task_description:
Assistant is a large language model that is tasked to generate SQL query based on details and examples provided in prompt.
                    We have 2 tables:
    Employee: Employee table have information regarding all the employees in a company.
    Below are the attributes of Employee table
        empid: empid column contains employee id. empid is a primary key of Employee table.
        name: Name column contains name of the employee
        salary: salary column contains salary of the employee
        department_id: department_id column contains employee's department id. It is a foreign key from Department table.
    Department: Department table have information regarding all the department of a company.
    Below are the attributes of Department table
        department_id: department_id contains the id of the department. department_id is primary key of Department table.
        department_name: department_name contains name of the department.

--prompt
Your task is to generate SQL query from natural language input provided by user.
Your task is to understand natural language input and provide SQL query to fetch information asked in natural language input from above tables.

Lastly, it's important to note that you are using LLM ranker, so you need to skip the ranker training process.
You should change this line:

generation_config_params.eval.function_params.instruction = best_prompt['prompt']

To:
generation_config_params.eval.function_params.instruction = ranker_config_params.annotator.config.instruction

@shobhnadhami
Copy link
Author

shobhnadhami commented Apr 12, 2024 via email

@Eladlev
Copy link
Owner

Eladlev commented Apr 12, 2024

The purpose of the first phase is to fit an LLM ranker according to the user intent (by treating it as a classification task), this saves lots of human effort since the alternative is that the user will rank the whole dataset at each phase.

If you start already with an LLM ranker, then the first stage (fitting an LLM ranker) can only result in a sub-optimal approximation, so it's better to skip this step and use directly the LLM ranker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants