Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should I prepare the dataset for generative question answering on the private documents? #38

Open
AayushSameerShah opened this issue Apr 7, 2023 · 50 comments

Comments

@AayushSameerShah
Copy link

AayushSameerShah commented Apr 7, 2023

Hello,
Thanks for creating this very helpful tool!
I am fine-tuning the model (GPT-J-6B) for the question answering on the private documents. I have 1000+ documents and they are all in text format. And of course, I will be going with the PEFT LoRA.

But the question is...

How should I prepare my dataset?

Since this is the question-answering scenario, my first thought was to prepare the data set in "Question: {} Answer: {} Context: {}" format but since there are so many documents and for that, I will first need to generate the questions, then the answers and... you know it becomes non-feasible.

Then I thought, I should "just provide the raw text" to the model as the knowledge base and choose the model which was fine-tuned already on the alpaca dataset (so now the model understands the instructions - for that I will use the "nlpcloud/instruct-gpt-j-fp16" model), and then my hope is that the model should give the response to my questions.

So what I am doing, is correct? How should I prepare my dataset for the question answering?
Please help,
Thanks 🙏🏻

@IIIIIIIllllllllIIIII
Copy link

I also wonder how to structure a dataset properly…. Using raw text seems to work very well though…

@AayushSameerShah
Copy link
Author

@Gitterman69 indeed! But I also wonder whether it is okay to fine tune with LoRA to "remember" the facts!? Because, LoRA just adds <1% of total trainable parameters, and I don't think if we can expect it to remember the facts of the private docs!

Let me know your suggestions mate!

@Datta0
Copy link

Datta0 commented Apr 17, 2023

I have fine tuned llama using this repo and a few text documents I had with me.
If I provide 3-4 consecutive words from input text, it amazingly completes the next couple of sentences.
But if I ask the same information as a question or reorder the input prompt, it hallucinates.

I thought I was overfitting and hence increased input data size, decreased the number of epochs which was neither completing the sentences when input as above nor answering the questions.

@Gitterman69 curious to know how you got it to "work very well"...

@IIIIIIIllllllllIIIII
Copy link

IIIIIIIllllllllIIIII commented Apr 17, 2023

it really depends what you want to do but this is my workflow:

workflow for text completion:

  1. merge all available text files into one txt
  2. create a python script that uses gpt2tokenizer to create 512 token chunks/paragraphs in said txt file
  3. paste as raw text into the trainer
  4. train it
  5. check the results and finetune your settings

workflow for q&a bot:

  1. create questions (manually / via scripts / whatever) and paste your document (512 token paragraphs!!!!!) below
  2. paste as raw text into the trainer
  3. train it
  4. check the results and finetune your settings

basically its trial and error - just make sure you train in the "formatting" you want the output to be!

edit: a nice way to train q&a would be the following

QUESTION: How many eggs are in the box from company Eggman Corp.
Answer: PASTE YOUR TEXT HERE

and then when you ask your bot the quesion above it will be a similar answer also the formatting/and question: answer: style... etc

@Datta0
Copy link

Datta0 commented Apr 17, 2023

I ideally want it to gain knowledge from my documents and be able to answer if I ask something from them while I run inference. I don't really any qna pairs from/for the data.

I also tried using vector embedding search and a model on top of it to put things together, but this way it is lacking information across few sentences. Also it can't answer anything other than What Where etc kind of questions if the answer it expected to span multiple sentences and its even worse when it has to infer something with this information and general knowledge. So that seems to be a not so fruitful approach.

@AayushSameerShah
Copy link
Author

AayushSameerShah commented Apr 17, 2023

@Datta0 I was facing the same. When given the "raw" text as the training data for the model, it hallucinates. Because it also has much knowledge from its pre-training and to answer your question, it will get information from anywhere or it will make it up. And making the QA pair as Gitterman69 pointed out requires you to create the QA manually which takes a hell lot of time (except you have static data).

So, I have changed the way. Now, I am actively focusing on the In-context-learning (ICL) approach to Question-answering. Because the QA task is where you need the "actual facts" unlike other tasks where the facts as optional and just require the completion style such as generating a quote or a story or getting a response in some personality like Walter White or Sherlock!

For that reason, LoRA or any fine-tuning method isn't a good approach for question answering where you need the concrete facts to be stored. So the solution is to give the context in the prompt and make the model answer only from the prompt. This way there are very low chances for hallucination and this way is very standard!

I am actively focusing on LangChain and LLamma-Index now. (See! Just yesterday LangChain incorporated the chatbot - Mendable which answers the questions from their docs! And they haven't fine-tuned it! They provide the context and then the chatbot replies from the context!).

🤗

@IIIIIIIllllllllIIIII
Copy link

IIIIIIIllllllllIIIII commented Apr 17, 2023 via email

@Datta0
Copy link

Datta0 commented Apr 17, 2023

@AayushSameerShah Thanks for the explanation. I already tried LangChain but I don't want to/ can't use text-davinci-003 or any OpenAI model due to some constraints. I want to use models that are available on huggingface ideally.

When I use some model with CustomLLM like flan-t5-large, it produces decent output sometimes. But when I try to run it as an agent with chat memory, it throws error saying Could not parse LLM output cuz the model isn't capable of understanding the prompt ( like You're a chat bot and can access xyz index files ...) that preceeds it.

I tried to use LLAMA or Alpaca with the same pipeline, it quickly runs out of memory on my 40GB GPU. So I'm kinda stuck here with regards to LangChain LLAMA Index

If you got it to work, can you please elaborate. Would be really helpful.

@GioPetro
Copy link

Really interesting topic, as I'm into this lately.

@Datta0 I was facing the same. When given the "raw" text as the training data for the model, it hallucinates. Because it also has much knowledge from its pre-training and to answer your question, it will get information from anywhere or it will make it up. And making the QA pair as Gitterman69 pointed out requires you to create the QA manually which takes a hell lot of time (except you have static data).

If it happens that I have a lot of docs to pass through? Talking about around 30gb of html text in particular, could it be vialbe to feed the raw text as training data? Of course there aren't any ground truth labels. The idea is to feed it - understand it - and being able to answer domain specific questions. What is the best approach on this?

So, I have changed the way. Now, I am actively focusing on the In-context-learning (ICL) approach to Question-answering. Because the QA task is where you need the "actual facts" unlike other tasks where the facts as optional and just require the completion style such as generating a quote or a story or getting a response in some personality like Walter White or Sherlock!

Can you elaborate how you focus on ICL ? Are there any frameworks that have that available?

For that reason, LoRA or any fine-tuning method isn't a good approach for question answering where you need the concrete facts to be stored. So the solution is to give the context in the prompt and make the model answer only from the prompt. This way there are very low chances for hallucination and this way is very standard!
Considering my use case, on 30gb of data, it's practically impossible to "prompt" all that and expect an answer. What would you do instead?

Thanks

@FatimaHabib
Copy link

FatimaHabib commented Jun 5, 2023

@AayushSameerShah
I'am also working on somethink similar and started to use Langchain and LLAMA index using available open source models on huggingFace. However I was wondering if we can finetune those models with question, answers and context. And I have a question, what is the method you use so the model finds the correct context in the documents to answer a given question?
Thanks

@TBomer-fm
Copy link

Hi @AayushSameerShah, thanks for kicking off this discussion

Are you able to elaborate more on the ICL approach?

And as for LangChain, it seems like a good option but to use Google's PaLM or Anthropic's Claude you need to join the waitlists. And to use an OpenAI model you need to pay for the API, do you know if LangChain offer models that are available/free?

Thank you, Tom

@AayushSameerShah
Copy link
Author

Hei @TBomer-fm,

🧱 Background

Generally "generative question answering" is the topic which requires accuracy because we need to get the "factual" data back from our own private documents. This is unlike other use cases like generating a poem or talking in some personality like talking as if the model is Sherlock or Waler White or whatever. In such use cases we can opt for fine-tuning because we don't want to get the information back but the way it is generated.

🤔 Options

Now, in the use case of "question answering" we have a couple of options:

  1. Extractive
  2. Generative

1️⃣ Extractive

Right off the bat, the first option is thrown away because there we simply can't ask the complex questions and from multiple sources, just because that way the model returns the answers. It simply gives the indices back from the paragraph you provide and the answer is so small.

Ex:

I am Aayush Shah and I live in my mind palace which doesn't physically exist in the physical world.

That is the context and while asking the question in that extractive manner, we need to provide the model that context to get the answer from.

And so the question:

Where does Aayush live?

The model will simply pick the indices Start: 40, End:45 (I haven't counted the perfect indices, but that drives my point). And so we get the result: mind palace.

2️⃣ Generative

This is where things get crazy. And where I was stuck for a long time. Here we get amazing and *** human-like *** responses from the model. Without talking too much about this, we have 2 options here:

  1. Open Book QA (langchain, llammaindex etc.)
  2. Closed Book QA (fine tuning: LoRA)

Let me cover the second approach first because that made me crazy around that time.

2. Closed Book Question-Answering

This is what we experience with Chat-GPT, Bard, Bing, Claude, Phind, OpenAssist and all of those fancy models out there. You just ask the question, and it will return the answer.

So, here comes the challenge to "store the answers on the model's weights". The bigger the model is the more information it can store. But to train a large network (6B, 13B, 65B, 175B) models is not a job of our GPUs. That costs thousands of dollars and a hell lot of training time and data.

We don't train it, but we tune it. And there comes some interesting stuff: PEFT. This is called the Parameter Efficient Fine Tuning. Which means:

Instead of training the whole network, just tune several layers of the network which adapts the new dataset and performs accordingly.

With that said, anyone can load a big model and just tune that with a single GPU. There are several techniques to perform PEFT, but the most famous is "LoRA".


Here comes the interesting part: Such techniques should only be used to change the way for the model to respond to your prompt, but not to store the information on them.

Which means, say you have a vanilla model which was pretrained on a hell lot of data from the world. It knows when was the WWI happened, it knows the height of Sundar Pichai, it knows the first PM of India and all, but the based model is simple completion model. It can't give you the results right away, to make it "chattable" or say "instruction" enabled, you need to fine tune it.

There are several datasets like Stanford's Alpaca, if you train your model with such datasets, then the model will be able to respond you in a certain way.

😅 Problem

If we want to tune this model on our private dataset and expect it to answer all the questions, then we are in the trouble. Because we are simply training a small portion of the model <0.1% and such small portion can't store all the information of your private document, which also reinforces what I said before: "PEFT can be used to change the way the model responds, not to store the information".

While this QA task requires much higher accuracy. And most of the time, the model will hallucinate if we follow this approach.

For Example:

I took an old model GPT-J which was relatively trained on the dataset before 2020. So, it doesn't know anything after 2020. I trained it on the Olympics-2021 dataset and then asked: "How many medals did USA win in 2021 olympics"?, and it gave totally wrong answers all the time, and also different answers on each run. Simply because of the training.

What is the solution then? Read next... ↙

1. Open Book Question-Answering

This means we allow the model to "read from the context given" and then expect from it to answer our question from that context.

Much like the Extractive QA, here it is capable of:

  1. Generation
  2. Complex answers
  3. Cross domain answers

It is where the ICL comes.

Here we simply provide the relavant chunks in the prompt and give the model an opportunity to answer.

NOTE: For this ICL to work, the model should be instruction tuned. Which means, the model should be able to understand the instructions. Example below ↙

The prompt:

Instruction: Suggest some cool name of my high school which is just for Kids below 8 years.
Suggestion:

(high school for kids!??)

Jokes apart, the model then completes the sentence:

Instruction: Suggest some cool name of my high school which is just for Kids below 8 years.
Suggestion: Sunrises before 8

Or whatever, one more example:

Prompt:

Instruction: Generate an SQL query to get sales from states Arizona and Ohio and the table name is sales_demo
SELECT 

The model completes:

Instruction: Generate an SQL query to get sales from state Arizona and Ohio and the table name is sales_demo
SELECT
  state,
  SUM(total_sales) AS total_sales
FROM
  sales_demo
WHERE
  state IN ('Arizona', 'Ohio')
GROUP BY
  state

Right? Here we provide the instruction and the model responds. This is called prompt and finding a perfect prompt that works just fine is called prompt engineering.

🙋🏻‍♂️ Where is ICL?

ICL = In Context Learning.

This means, the learning happens temporarily and the weights of the model don't change during that. The best example is this 👇🏻

Try asking:

Who is the CEO of Twitter?

The model (assuming is trained on the data before 2021) will return Jack Dorsey.

But with ICL:

You are a smart model, you just need to read the following article and respond to the question
from the article only.

Article:
Twitter was founded in 1866 even before the internet was found and its founder was Jack Dorsey but after some decades 
a person Elon Musk was born and he took the company from him now in 2023 he is the CEO.

Question: Who is the CEO of Twitter?

Answer:

Now the model will answer (like):

Before 2023 the CEO of Twitter was Jack Dorsey but now in 2023 the CEO is Elon Musk

Cool right?? Here, the model *** temporarily learned*** the knowledge from the prompt, but the learning wasn't permanent. And this exactly is called ICL.

🧐 How can we use that for private documents?

We can leverage this for our advantage.

ICL is simply giving the model the stuff to learn from and perform the instruction based on it.

You may have heard some terms like: "Few shot learning", "One shot learning" etc. they exactly point to the ICL. Provide examples in the prompt and expect the model to answer. You can read a hell lot of literature on the internet so not covering them now here.

For now, let's focus on the private document question answering.


For this we can do something like the following, suppose you have 5 documents, then we can pass those 5 documents in the context like the following:

Hey! You are a brilliant analyst, please read the content given below and help the user to get their answers from.

Context:
1. Doc 1...
2. Doc 2...
3. Doc 3...
4. Doc 4...
5. Doc 5...

Instruction: Only provide the answers from the context given above, and don't use any other information which is not 
covered in the context.
Question: What is the meaning of RedPajama?
Helpful answer:


Note: The prompt above is very generic, which might not work for all use cases, please change that for your use case & the kind of model you're using.

🎉 The model will be able to generate the answer now 🎉

What if you have thousands of documents?

Hmmm, now instead of only 5 docs, you have a hell lot of docs to question from. And we can't possibly pass all of them in the model. Models have prompt limits to process tokens like 2048 or 4096 and so on.

🌟 This is where the LangChain and Llamma Index come into the picture 🎓

The general scenario there is:

  1. We have all 1000 documents
  2. We split them in the smaller chunks, say 500 words (tokens)
  3. Then store all of them in some vector store (database)
  4. Done!

After this, we will just have to query: "What is meaning of RedPajama"?

Now, the langchain will automatically, convert your question into the embeddings and then compare that question's embeddings to whole database of articles. And will returns top k articles, generally 3.

Then we pass those 3 most relevant articles which may contain your answer and finally we get an answer!


This is the overall scenario of using all of these for the generative question answering.
I have tried to cover all things in short but keep things in context, but if you have any questions, please let me know.

You're welcome 🤗

@AayushSameerShah
Copy link
Author

AayushSameerShah commented Jun 8, 2023

Hei @Gitterman69, @Datta0,
I really am sorry for the late response. I have tried to cover the answers in my last comment. Please let me know if it doesn't.

Thanks.

@AayushSameerShah
Copy link
Author

And, @TBomer-fm

Langchain, DOESN'T provide and free LLM or anything like that. Even it doesn't provide any paid LLMs. It is not the provider but a connector of LLMs with different kinds of prompts and agents.

To use any free LLM, you will need to use HuggingFace's LLM in Langchain. There are guides available. Please let me know if you need further direction.

🤗

@AayushSameerShah
Copy link
Author

Hei @FatimaHabib,

Great! You are working on a similar project. Unfortunately, fine-tuning isn't the scalable way for the question answering. I have tried to include the answer in my comment above. Hope I was able to explain there.

Let me know if any further clarification is needed.
Thanks.

@AayushSameerShah
Copy link
Author

Hello @GioPetro !

To pass a hell lot of data, like in your case, you need to store them first in some vector store like FAISS, Chroma, Pinecone etc. Then use Langchain / Llamma Index to retrieve just the documents which are required for that particular question and pass in the prompt. Hence the ICL.

🤗

@TBomer-fm
Copy link

Wow @AayushSameerShah, thank you very much for your super comprehensive and helpful response!

@FatimaHabib
Copy link

Big thanks @AayushSameerShah for the excellent explanation! I really appreciate it (: . I will try out the solution you have suggested.

@IIIIIIIllllllllIIIII
Copy link

Thanks for the superb explanation and answer!!!!!

@dhairyakataria
Copy link

Thanks @AayushSameerShah for your explanation.
I tried with ICL and it's working quite good.

But I also want to train the model using raw text, instead of using vector stories and searching. I am not able to train the model on raw text @Gitterman69 @Datta0 you where able to do this task, can you please guide me in the right direction.

@EL-MEHDI-git
Copy link

EL-MEHDI-git commented Jul 4, 2023

Thanks, @AayushSameerShah, for this excellent explanation!
I used Langchain to build a QA system with my own documents. I created the chain, from loading the documents to retrieving answers. However, I didn't get the expected answer. Upon investigation, I discovered an issue with the documents returned by the retriever (I used ChromaDB as the vectorstores and Text-ADA-Embedding-003 as the embedding model). If you have any suggestions to assist me, I would be grateful!

@AayushSameerShah
Copy link
Author

AayushSameerShah commented Jul 18, 2023

Hie @dhairyakataria! 👋

I see, you are willing to use the fine-tuning approach for your question-answering task on your private documents.

Now I am not sure which method you've used for fine-tuning, but I am quite sure you are not training the model as a whole you must be using some kind of PEFT.

Now, as explained in the GitHub thread, such "fact extraction" use cases should not be tried to solve by fine tuning because that will basically won't guarantee the information that you are willing to ask, will be retrieved and often the model will mix your trained information with its pre-trained information.

But if you are anyhow willing not to use the second approach (providing the documents in the prompt) and want to fine-tune it, then you will have to train the model from scratch. Which is truly expensive mate.

This directly refers to the model's ability to give the correct responses. Take an example of the Wikipedia dataset. In a hypothetical example let's say 3 models are trained on the same Wikipedia dataset. These 3 models have different architecture and sizes.

Now, the model's ability to "remember" things depend heavily n the training & architecture and we see often lines like "GPT-X model gives 89% accuracy on science-paper test" or something like that, but here it would be related to Wikipedia.

Now, practically, your case which involves the private documents to be fed (which often changes frequently) it makes this task impractical for fine-tuning.

Such tasks should only be solved using the ICL which is the standard way to go forward.

Still this is an active area of research, many techniques are being researched every day, but as far as I know, this is the way to go.


And yes, here is one thread on openAI which supports what we are talking about:
Screenshot 2023-07-07 144902

Let me know if it helps 🤗

@AayushSameerShah
Copy link
Author

AayushSameerShah commented Jul 18, 2023

Hie @EL-MEHDI-git, 🙋🏻‍♂️
I can see that you are using ADA model from OpenAI to create the embedding which is the SOTA model and there should not be an issue.

It is a bit unclear that what you mean by: "answers that you are getting are not as expected". Does that mean they are "wrong answers" or the "completion is not proper" meaning that the model answers but ends the answer in the middle without even completing the sentence?

1️⃣ Wrong answers

If the answers are wrong, this simply means the context isn't passed properly or not in enough volume. To fix that:

  • Increase the number of documents returned (take top 3 to 5) - And don't use a very high number otherwise there will be a low signal-to-noise ratio and the answers won't be correct.
  • Make sure the query/question that you are putting in contains enough context to fetch relevant documents: This is obvious that asking "How many medals did the US win?" is less likely to fetch relevant content than asking "How many medals did the US win in the 2020 summer Olympics?". Thus, the query highly affects your retrieval.
  • Decrease the chunk size. I assume that all documents are big and you had to chunk them into smaller pieces. Again, with the low signal-to-noise ratio, you need to play around with the sizes of the document chunks, generally around 500 but it needs to be adjusted. Though it won't make a drastic difference, surely increase the performance and answers quality.

Here I am not sure which model did you use for the inference, was it DavinCi, or any other open-source model. So, I am assuming it was any open-source model and carrying on with other diagnostics.

2️⃣ Answers are incomplete

This is a huge problem when working with open-source models. Even if they are instruction tuned they often not complete the sentence properly.

The easy fix is: Use the prompt on which they were trained.

This means, using the prompt:

Won't work ❌

Given the articles below, answer the question.

- article 1
- article 2
- article 3

Question: Hey there! What is up?
Answer:

And there are higher chances the model will not give the answer properly. Because the model was trained on the structure like this: (for an example OpenAssist 12B model)

<|prompter|>
-- Instruction ---
<|prompter|>
<|endoftext|>
<|assistant|>
-- Completion ---

Now, in such models, using the generic prompt won't help, there we will need to change the prompt in langchain:

May work well ✔

<|prompter|>
Given the documents below, try answering the question. Don't use any other information to answer...

Documents:
- Doc 1
- Doc 2
- Doc 3

Question: Hey there! What is up?
Answer: <|prompter|><|endoftext|><|assistant|>

Note: The example given above was for the OpenAssist model, there will be different prompts styles for different models.

So, the Fix-1: Change the prompt. The generic won't help.

Fix-2: Use the better/bigger model. Like GPT-3, there you should not have any problem with the langchain's default prompts, but still changing the prompt will help you there.

Fix-3: Change the generation parameters. Play around with temperature, repetition penalty, and so on. They have a huge impact on the answers.


I have not elaborated on each fix too much, because these diagnostics were based on my assumption of your problem, you may be talking about entirely different things.

Please let me know if any of these helps.
🤗

@AjinkyaBankar
Copy link

@AayushSameerShah Thanks for your two detailed responses. I am facing issues with similarity_search() method of the Langchain for ChromaDB/FAISS. User thinks the question has enough context to retrieve the similar chunks, but similarity_search() doesn't return relevant top-k chunks. I use k=3 to minimize the noise. Are there any best alternatives to retrieve similar chunks, which would be more accurate in fetching relevant chunks?

@AayushSameerShah
Copy link
Author

AayushSameerShah commented Aug 14, 2023

Hie, @AjinkyaBankar
Sorry, for a late response. If I am able to understand your question well, you are asking that "when the user asks the question, the chunks that are fetched as the context, aren't involving enough information that could answer the question well, because the user doesn't know "how much information" to provide in the question".

If that's correct, then I think we are limited here. Because what we are fetching solely relies on the question only, so it is advisable to ask the user to "provide more context" while asking.

I believe, the langchain retriever provides some kind of similarity score. So, if the fetched chunks are having scores less than your threshold, say <0.5, then you may prompt the user to ask more in detail like:

🤖 Please give more context with your question, still based on your question, here is the answer that I think will fulfill your requirement.

But there are workarounds.

TL;DR: Have a look at the 4️⃣th solution.

1️⃣ Change the vector stores

By that I mean, the "way we search" the similar chunks differs from method to method. And many vector stores support the "cosine" similarity while others don't. For that you need to go through the documentation.

Why am I suggesting this?

Because, mostly we are dealing with 2 types of similarity searches.

  1. Cosine
  2. Dot

Both are commonly available in most vector stores ie. pinecone, croma, FAISS etc. and changing these both will impact which chunks are returned. So, you should try changing the "method" to retrieve the articles.

Try MMR (Maximum Marginal Relevance Retrieval)

This method will likely to give you the diverse range of chunks, which in contrast of giving "most similar" chunks found in the "cosine" or "dot" method.s

Find them: here in langchain docs.

2️⃣ Make the chunks smaller

I think you've already done this, but still in case. Small chunks tend to retain the "specific" information.

Suppose, you are chunking your document of 500 tokens. And then you ask: "Where does Waler White live?" then out of so many chunks of 500 tokens in your vector store, the retriever will fetch k=3 chunks, so total context will be 1500 and just for a simple answer of where does that person live, the LLM has to go through all 1500 tokens and may not find the answer and will hallucinate!

On the other hand, if you have chunks of say 100 tokens, then it is likely that all fetched context totalling 300 tokens will have the answer and the model will have an easy day to find the answer.

There is a problem...

In this 2️⃣ point, we need to take care of the treadoff of how small the chunk should be. Because smaller is better, but that won't cover whole context! I am diverging to the discussion which can be discussed in a lot detail, but still it may be helpful.

In that case, we can use ParentDocumentRetriever of Langchain. With which we can get the benefits from the both worlds. We can get the "exact chunk" which has the information and can get the context of which that small chunk is the part of.

Access that here in langchain doc.

3️⃣ Lost in the middle!

In the research, it is found that the model can fetch the information more accurately from the context, if the relevant information is found either in the starting or at the ending of the provided context.

That means... if I ask Where does Walter White live? and I fetch the 3 chunks:

1. A long long ago a person who ever lived in the north America named Saul Goodman... 
2. Say my name, said Walter white; a famous cook (not a regular cook found in the hotels, but a Crystal meth cook) lives in Albuquerque , USA...
3. There was a man standing, having his gun in his left hand. He was no one but Walter white!...

Now, assume these paragraphs are long, and not just single sentence. In this case the model will have hard time to find "where does Mr. White live". Because the answer is hidden in the middle.

The solution?

Put the context in the beginning or at the end. Please find that out here: langchain docs.

Like:

1. Say my name, said Walter white; a famous cook (not a regular cook found in the hotels, but a Crystal meth cook) lives in Albuquerque , USA...
2. A long long ago a person who ever lived in the north America named Saul Goodman... 
3. There was a man standing, having his gun in his left hand. He was no one but Walter white!...

And there is the dedicated method, which will do this automatically.

4️⃣ Ask the same query in different terms!

I think this is one of the most useful ways to address your issue. Which is to generate more versions of the same question. And query them individually - because one of them (at least) is more likely to have the answer.

User asks short questions without providing enough context. Here the MultiQueryRetriever of Langchain helps. There is a whole walkthrough there: docs so me explaining here won't make sense.

That is the one that you should be looking for if you don't want to dig much.

5️⃣ Use better embedding model!

Hopefully, you are already using OpenAI's model. Which is the SOTA as said in my previous response. So try using that. And make sure while splitting the document into chunks, you provide some overlap between the chunks to have some connection.

6️⃣ Perform the "Rerank" of the documents

So, you have received the top 3 chunks. But there might be some irrelavent chunks which are close enough to the query, but may not involve the answer you are looking for!

Take an example:
Query: What is the capital of Canada?

Now, there are say so many articles which are related to the query and say it has retrieved these:

1. The capital of Canada is Ottawa.
2. The Canadian capital is Sydney (wrong but matches!)
3. Toronto is in Canada.
4. The sky is blue in Canada!

Now these are the closest according to the model but may involve the wrong answers or the chunks that may not include the answer at all!.

There comes the reRank. From my knowledge this is only available if you are using the Cohere models. But what it essentially does is it assigns the importance score to the chunks and "re-ranks" these chunks based on its knowledge.

The re-rank model (like rerank-english-v2.0) are trained with true answers and false answers. So the model knows which might be the good answer for the given question.

Usage in Langchain is easy, where you have to use the compressor which again is another method to get the quality results.

🥜 In a nutshell...

A compressor basically "compresses" the large retrieved chunks (kind of paraphrases / summarises them) so that the signal stays high in each chunks! So, instead of passing the "whole retrieved" chunk as it is, we will pass the compressed version of it.

So in the reRank case, we will use the copressor wrapper of the langchain but instead of paraphrasing/compressing the chunks, we will assign the ranks.

In our example, it will give:

1. The capital of Canada is Ottawa (0.9)
2. Toronto is in Canada. (0.6)
3. The Canadian capital is Sydney (0.4)
4. The sky is blue in Canada! (0.1)

And based on that again we will select top n chunks! So it is desirable to retrieve a larger number of chunks in the first go and then after re-ranking, select top 3.

The langchain doc for compression (paraphrasing): doc here
Tha doc for rerank: doc here


There are other a lot of ways which you can use to get better chunks, but I think whatever I have suggested will solve your issue. AFAIK.

Please let me know if anything is still fuzzy. Because at the end of the day, it all depends on what and how you ask.
Have fun! 🤗

@AjinkyaBankar
Copy link

Thanks @AayushSameerShah for taking time to explain in detail. Appreciate that!

@terilias
Copy link

terilias commented Aug 16, 2023

Hello @AayushSameerShah,
Thanks for your effort to explain the topic from your study and experience with this complete and detailed manner!

I decided to write this comment to ask you about the two libraries you mentioned: LangChain and LlamaIndex. I know that LlamaIndex is based on the first one but I can't understand which library I have to choose. What do you think?

Thank you for your time!

Best regards,
Elias

@FatimaHabib
Copy link

Thank you, @AayushSameerShah! Your explanations are consistently insightful.

I'd like to get your perspective on a strategy I've been using. I've been breaking down the text data into smaller segments, essentially paragraph-sized chunks. Given that in my scenario, each question's answer tends to reside within a specific paragraph, this approach seemed fitting. Therefore when spliting the text using RecursiveCharacterTextSplitter we can select the maximum number of tokens without overlapping.

I have tried this, I got better answers,
I would like if you have tried this and whether the splitting into paragraph is really effictive ?
Thanks,

@AayushSameerShah
Copy link
Author

Hie @terilias 👋🏻

LlammaIndex and Langchain both are gaining popularity because of their area of focus. But their common purpose is to augment the LLMs with the data.

I may differentiate them in the following way:

🦙 LammaIndex: Is a vertical framework which is like an expert in for the ingestion, manipulation, cleaning, vectoring, retrieving and so much. It is like the backend for your application.

As they say themselves in their docs:

📔 Tie LlamaIndex back into the rest of your ecosystem. This could be LangChain, Flask, Docker, ChatGPT, or… anything else!

LangChain: It is the general framework (horizonal) for your application. It is like the "front" with which the user will interact.

A potential point of confusion is: Langchain also has so many connectors and stores already! How does it differ from the Llamaindex then?

Here, I would like to note that, in the day-to-day... common application we could just use the Langchain and it will work just fine! It has more than enough that we need. It is like a complete toolbox.

It provides:

  • Access to the data (loaders)
  • Cleaning (splitters)
  • Database connectors
  • LLM access (a ton!)
  • Agents
  • Chains
    and a lot...

Overall-ly:

Llamma Index:

  • Has a hell lot of connectors and splitters and different ways to index your data (which can be in any format!)
  • Be it structured - SQL tables from a number of databases
  • Be it unstructuresd - YouTube video transcript, your notebook notes, notion notes or anything
  • It will help you to read them, store them and help to retrieve them (with indeed a verity of retrievers)

LangChain:

  • Has all things in single place but is expert in calling the model, managing prompts, building chains and so on.
  • It is basically your interface for the user. The input that user provides will hit the langchain first.
  • You can connect LlammaIndex with Langchain to get the best from the both worlds.
  • Set LlammaIndex for your data part, ingestion and all, and use Lnagchain for the models, agents and all these fancy routing stuff.

As an end note: I would say, you wouldn't need LlammaIndex if your task is not too complex. It has connectors for graphs and other things which we may not want to use. And most of the applications can be made easily with Langchain.

But recently a lot of development is being done in LlammaIndex, and it may made the line thinner between these both libraries 🤷🏻‍♂️

@AayushSameerShah
Copy link
Author

AayushSameerShah commented Aug 19, 2023

Hello @FatimaHabib !!
I am glad you are getting better results and have found "a good spot" of chunking the documents!

Now, it really depends on what kind of application you are making - because there cannot be a definitive answer for "how large the chunk should be" or "is splitting in paragraph is a better way to go instead of other ways".

As you might have seen there are different types of textSplitters available there. And even the RecursiveCharacterSplitter is customizable; ie. instead of splitting ["\n\n", "\n", " "] you can change it to something else.

Just as a little quick example: Say you are building an application that connects the LLM to the document store which can return the chunk of "python code" and give the explanation of that particular code.

Your input: Explain what does the nacci-fibo sequence do?

[Retriever - fetched 3 matching code]
1. def nacci-fibo(n):
       # code body

2. def fibo(yo):
        # code body

3. def yo():
        # code body

[LLM Chain Response]
The nacci-fibo prints the fibonacci sequence in reverse order which is given in the definition `nacci-fibo` in context. As an AI model I can not help you with cheating in exams. Try harder dude!

Well, the model's response was sarcastic 😳 but suppose this kind of chain we want to develop (which is dumb BTW, no one stores the code in the vectorstore to get the explanations!).

In this case you wouldn't use the default RecursiveCharacterSplitter because that will mess up the code. The code (your data) has some structure, which needs to be preserved. There you would use a splitter which is specific to the data (maybe a link for an example? ) which would split the document from the function definition instead of somewhere in between its body!

And there can be other specific splitters as well to preserve the structure in the chunks!


So, long story short it looks fine if you are getting good results as per your use-case looks like the QA with documents, but if you have some "special" case like above, then the paragraph splitting won't work.

And as an advice, I would like you to add 2 more things in your split:

  1. Use the overlap: Well again it depends, but if you can use the overlapping_tokens while splitting, it will be helpful for the model to "connect" the pieces if in-case they come together in the retrieval.
  2. Appending title of the document: Well this might be done internally, automatically as while retrieving, the retriever also checks in the metadata of the chunks, but I think it would be beneficial to append manually.

👉 A small example if you don't mind?
Suppose we have a long long text and want to split by sentence (splitting by .).

Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.
Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.
Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.
Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.
Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects.

Interstellar premiered on October 26, 2014, in Los Angeles.
In the United States, it was first released on film stock, expanding to venues using digital projectors.
The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014.
It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambition, themes, and emotional weight.
It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics. Since its premiere, Interstellar gained a cult following,[5] and now is regarded by many sci-fi experts as one of the best science-fiction films of all time.
Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades

So, you split the text and get the following:

[
'Interstellar is a 2014 epic science fiction film co-written...
'It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine',
'Set in a dystopian future where humanity is struggling to survive...,
'Brothers Christopher and Jonathan Nolan wrote the screenplay...,
'Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip...,
'Cinematographer Hoyte van Hoytema shot it on 35 mm...,
'Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles',
'In the United States, it was first released on film stock, expanding to venues using digital projectors',
...]

Longer sentences are truncated but as you can see, whole thing discusses the Interstellar movie, but when if we see some sentence like an individual 'In the United States, it was first released on film stock, expanding to venues using digital projectors.' we can't say it is for the Interstellar movie!

So, in such case if we "append the prefix" to each sentence like:

[
'Interstellar (2014): Interstellar is a 2014 epic science fiction film co-written...
'Interstellar (2014): It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine',
'Interstellar (2014): Set in a dystopian future where humanity is struggling to survive...,
'Interstellar (2014): Brothers Christopher and Jonathan Nolan wrote the screenplay...,
'Interstellar (2014): Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip...,
'Interstellar (2014): Cinematographer Hoyte van Hoytema shot it on 35 mm...,
'Interstellar (2014): Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles',
'Interstellar (2014): In the United States, it was first released on film stock, expanding to venues using digital projectors',
...]

Will keep the context! So when you ask something related to the Interstellar, it will fetch such chunks better and will produce quality results!

I am not sure if there is the standard way to do this in langchain... but I work with the following code to do so:

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter()

# Static string to prepend 
prefix = "Interstellar (2014): "

def prepend_prefix(texts):
    return [Document(prefix + text.page_content, text.metadata) for text in texts]

texts = text_splitter.create_documents([long_document])
texts_with_prefix = prepend_prefix(texts)

Let me know if it helps 👍🏻

@umairafzal9245
Copy link

@AayushSameerShah what if we increase the number of parameters using LORA large size of matrix. will the llm able to store new information?

@FatimaHabib
Copy link

Thanks @AayushSameerShah , It helps alot. I forgot about the using the separators parameter .
Now I am facing another issue /: , the answer of some questions are in tables, and when converting PDF to text, the text of table is missed up and the model could not find the answers.

Do you have any tip to solve this issue?
Thanks again ^^

@terilias
Copy link

terilias commented Aug 29, 2023

Thank you @AayushSameerShah for your answer to my question! Have a good day!

@AayushSameerShah
Copy link
Author

@FatimaHabib, oh PDFs? Okay... Tables? Okay... Tables in PDFs? Not okay 😨

Actually I was dealing with the similar problem some months ago there I needed to use the content stored in the PDFs and fuse them in the LLMs to get the "accurate" result.

While LangChain provides a great list connectors for every type of data including PDFs, it lacks with parsing tables stored in PDFs.

If you have simple textual data stored in PDF, it will work and also if you have tables stored in CSV or Excel, it will work, but the way PDF stores table, at retrieval it gets pretty challenging to pull the data back.

Because...

☝ Table can be stored as images

I belive in many connectors/loader in LangChain, have some flag "fast" or "quality" kind, depending on which, internally it will use some kind of OCR to extract text from the image and on the other hand "fast" wont.

While extracting the text data (especially numbers) from image, can lead to highly inaccurate results. Because of the quality of an image (or also the ability of the model to interpret the numbers) the numbers can easily be misread (reading 174% when actually it is 17.4%).

✌️ Tables are in text but...

... the structure is messed up. Since the PDF loader reads all as text, left to right, and feeding the data in LLM, the model won't be able to comprehend which value relates to which column - but here, you can read the right numbers (17.4%)


🤔 Any solution?

During the research I tried many libraries and models to help me get through this, and this is a known problem. Actually there are many online businesses which lets the user to upload the PDFs and you know, extract the data preserving the structure and all... but for now I think you should checkout: unstructured.io library of python because on top of my head I can remember that it was giving good results.

I think LangChain uses it internally as a part of its connectors, but try to have a look it as a seperate service and use it there and see what happens.

Actually there were a bunch of good libraries that I came up with, but I need to dig down a little bit in my Discord, because it is lost in the history somewhere, I may update this comment when I find that list.

For now...

But for now, in the open source way, we have options but need to take appropriate decisions when to use which, because we don't know the table structure!

The table structure may change in the PDF, and if it doesn't (which is highly unlikely) then you can use simple RegEx to extract the data and go forward with it!

Good luck!

@kaoutaar
Copy link

kaoutaar commented Sep 11, 2023

Hi @AayushSameerShah ! can you tell please what pretrained models in huggingface i can use with the following inp/outp architecture:
title {sep} text {sep} question =====> generate (NOT extract) an answer
no finetuning is needed, i want to directly apply it on my data.

@AayushSameerShah
Copy link
Author

AayushSameerShah commented Sep 12, 2023

Hello @kaoutaar 👋🏻
If you had asked me this question 2 months back I would have answered differently, but by today's standards an amazing model (also commercially usable) is Meta's Llama-2 if not the best.

Your question doesn't specify your requirements, resources available etc so I can't guide you exactly but meta-llama/Llama-2-7b-chat-hf would be a good choice for most of your application needs.

There are many communities that have fine-tuned the base Llama-2 model and have achieved great scores on the leaderboard, but again this meta's vanilla llama-2 would suffice your generative QA need with satisfactory results.

🥂

@kaoutaar
Copy link

kaoutaar commented Sep 12, 2023

@AayushSameerShah thank you for responding me, i've actually sent a request to meta to give me access, still waiting though.
what requirements i should specify ? sorry i think i don't really understand what you said here Your question doesn't specify your requirements, resources available etc

also i am really curious what would you have suggested 2 months ago. because i am trying to find all ready-to-use models that have this inp/output architecture.

@thomasjv799
Copy link

thomasjv799 commented Feb 7, 2024

I know its been couple of months and there was lot of new developments, but is there any new things that could help us to finetune the LLM using our private document to give solid answers.

@MatteoRiva95
Copy link

Hello everyone, I am in the same situation of @thomasjv799 !
In the last couple of weeks I tried RAG and unfortunately it could not give me good results :(
I have almost 100k english PDFs all about the same topic and I really tried everything: change the prompt, change the embedding model, change the LLM, change the chunk size, but nothing could help. Sometimes it gives very good answers, while sometimes it gives erroneous information and a sort of "hallucination" :(
Moreover, I suppose due to this huge amount of data, RAG is really really slow and to answer it takes 10 min or more (on a cluster of 32 gb of RAM).

Should I use fine tuning then? If yes, how can I allow my model to assimilate all these data?
Any help would really appreciated. Thank you in advance!

@AayushSameerShah
Copy link
Author

AayushSameerShah commented Feb 16, 2024

@thomasjv799 Of course!
But to train your model, you will need to do some homework i.e. preparation of the dataset.

So if your dataset is supposed to be made of Taylor Swift's background, news, and other articles dump then we won't be able to train the model from that dataset and expect it to give the "solid answers". (btw are you are swiftie? Yes, we will have a great conversation 💜).

For that, you will need to create a whole new dataset with the "question and answer pairs".

To do that:

  • You can make use of any good GPT model (check out the T&C of the respective model whether they support this)
  • You can create your own.

After doing that you can proceed further for training the model of your choice and then finally the model will hopefully answer what TS means by "Vigilante Shit" in her song ✌🏻


PS: I am attaching a link to a short - free course on deeplearning.ai which will help you to fine-tune the model for Q&A.
🔗 https://www.deeplearning.ai/short-courses/finetuning-large-language-models/

Best,
Aayush

@AayushSameerShah
Copy link
Author

AayushSameerShah commented Feb 16, 2024

@MatteoRiva95 Dude your issue might be similar to the one I have tried to address above.
The punchline is: "The questions that we ask don't always contain enough information for the retriever to know what to retrieve".

So say you are working on a project that has the documents of "software manuals", okay, then you ask questions like: "How to create X and apply Y filter"?

In this case, there are no direct references in your manual to answer your "HOWs". Just because the retriever doesn't have enough information on what to return. And whatever it returns, the model tries its best to answer your questions and in turn, the hallucination.

Here, you can use some modules from langchain that create similar questions that may answer your original question.
So let's say your question is the same: "How to create X and apply Y filter".

In this case, the langchain module may create alternative questions like:

  • "What are the steps of creating X?"
  • "What is Y filter"?
  • "Steps to apply Y filter"

And for each of these questions we will retrieve the chunks and a union of them is more likely to answer (and don't hallucinate) your question!


There is a lot of new literature that has been posted in langchain and even I haven't had an opportunity to skim through it. But I would encourage you to check them out.

Especially here: https://python.langchain.com/docs/modules/data_connection/retrievers/

Good luck!

@MatteoRiva95
Copy link

@AayushSameerShah Thank you so much for your kind reply. Really appreciated! :)

Do you think that changing the retriever could help me? Also with this enormous amount of data? Because I am afraid that 100k PDFs is too much for RAG :( It takes a lot of minutes to reply to one question also in a decent (still not super) gpu and cpu setting!

I am trying to develop a chatbot with acceptable time to answer. For this reason, I was thinking to come back to fine-tuning (even though last time it gave me a lot of hallucination...but maybe I did something wrong...I am still a beginner after all). But I am not so sure...

What do you suggest?

Thank you so much again for your support!

@AayushSameerShah
Copy link
Author

AayushSameerShah commented Feb 19, 2024

Also with this enormous amount of data? Because I am afraid that 100k PDFs is too much for RAG :( It takes a lot of minutes

There are always 2 parts.

  1. Retrieval
  2. Generation

Generally, retrieval is fast. But as you are saying there are 100k of PDFs which I agree is a lot. In that case, you can look up different data stores that provide faster search. Even some paid ones may help like Pinecone or vaivenet)

Secondly the generation, there is something loosely called "startup time". This means even before the model starts generating a first token, it takes some time to process the prompt (the typical forward pass). This takes time and based on what architecture, framework, and resource setting you're using it differs.


So,

  1. Analyse the whole process by dividing them into smaller individual parts.
  2. I suppose you are using langchain. It gives you the retrievers, vector stores, and LLM module all separate and all together to make a chain.
  3. So, use all parts individually.

For a concrete example:

  • Use a vectorstore A and create embeddings and store your 100k PDFs there.
  • Don't bother too much about the model now.
  • Now, pass a single query "How to do XX" and see how much time it takes to just retrieve something.

In this phase, you will know whether the time is spent too much in retrieval or somewhere else. In this stage, you may compare a couple of vector stores.

Then, move to the model.
Now, it heavily depends on the model size, architecture, and resources. Also what framework do you use to run the model? If you are aware of the "GGUF" format or "AWQ". They work much faster even on the CPU compared to the typical HF transformers.

Alternatively, you can also look for the ONNX format which would be the ultimate solution but that would require some more digging. I would recommend you to look at ONNX at the last when nothing works.

Now you might have a better picture of what is taking time. Generally, the culprit is the model. Depending on how big the context is, it takes time. So, you may want to check other model alternatives/frameworks or upgrade resources.

Do you think that changing the retriever could help me?

Qualitywise or timewise? I would say, it would make a difference qualitywise more. But I might be wrong here too. So, I can't say for sure.


At the EOD, there are so many moving pieces (I even didn't suggest you change the chunk size and all), that you need to see where it takes more time.

I have given scattered information, I know... let me know if you need any clarification.

Best,
Aayush Shah

@MatteoRiva95
Copy link

@AayushSameerShah Yes, there are a lot of info, variables, ways and options to try! But really, thank you so much for your kind help and time :)

Yes, exactly I am using Langchain and in particular I am following these tutorials for RAG:

https://levelup.gitconnected.com/building-a-private-ai-chatbot-2c071f6715ad
https://www.diariodiunanalista.it/posts/chatbot-python-langchain-rag/

They are super useful, but with a small amount of data, because with 60k PDFs (the last test I did)...well there troubles begin :( Moreover, the deployment of RAG and 100k PDFs become very very complex!

What about fine-tuning then? Maybe turning all the PDFs into a huge csv file contaning questions and answers columns could be optimal for this particular process and could be the solution in order to avoid hallucination and to have an LLM ready to reply to my answers.

Apologies again: I am a beginner, I am trying to find a solution to this huge issue and maybe my replies are wrong and not detailed enough :(

Thank you again so much!

@pallavi-allada
Copy link

pallavi-allada commented Feb 23, 2024

@AayushSameerShah I have been working on a project to be able to generate Hindi Questions and their answers from given context. I use LangChain, HuggingFace - OpenHathi base model for generation of questions given a context. The model generates questions in both Hindi and English, though I would like it to generate only Hindi questions. Also, it does not provide the answers for the questions it generates. I have used the below prompt for question and answer generation -

प्रणाली :

नीचे दी गई संदर्भ के आधार पर दस हिंदी प्रश्न तैयार करें। प्रश्न दोहराएँ नहीं।
सभी प्रश्न और उत्तर केवल हिंदी में ही उपलब्ध कराए जाएंगे।

यह कथा अवध की है। और बहुत पुरानी है। अवध में सरयू नदी के किनारे एक अति सुंदर नगर था। अयोध्या। सही अर्थों में दर्शनीय। देखने लायक। भव्यता जैसे उसका दूसरा नाम हो! अयोध्या में केवल राजमहल भव्य नहीं था। उसकी एक-एक इमारत आलीशान थी। आम लोगों के घर भव्य थे। सड़कें चौड़ी थीं। सुंदर बाग-बगीचे। पानी से लबालब भरे सरोवर। खेतों में लहराती हरियाली। हवा में हिलती फ़सलें सरयू की लहरों के साथ खेलती थीं। अयोध्या हर तरह से संपन्न नगरी थी। संपन्‍नता कोने-अंतरे तक बिखरी हुई। सभी सुखी। सब समृद्ध। दु:ख और विपन्नता को अयोध्या का पता नहीं मालूम था। या उन्हें नगर की सीमा में प्रवेश की अनुमति नहीं थी। पूरा नगर विलक्षण था। अदूभुत और मनोरम। उसे ऐसा होना ही था। वह कोसल राज्य की राजधानी था। राजा दशरथ वहीं रहते थे। उनके राज में दुःख का भला क्या काम? राजा दशरथ कुशल योद्धा और न्यायप्रिय शासक थे। महाराज अज के पुत्र। महाराज रघु के वंशज। रघुकुल के उत्तराधिकारी। रघुकुल की रीति-नीति का प्रभाव हर जगह दिखाई देता था। सुख-समृद्धि से लेकर बात-व्यवहार तक। लोग मर्यादाओं का पालन करते थे। सदाचारी थे। पवित्रता और शांति हर जगह थी। नगर में भी। लोगों के मन में भी। राजा दशरथ यशस्वी थे। उन्हें किसी चीज़ की कमी नहीं थी। राज-सुख था। कमी होने का प्रश्न ही नहीं था। लेकिन उन्हें एक दुःख था। छोटा सा दुःख। मन के एक कोने में छिपा हुआ। वह रह-रहकर उभर आता था। उन्हें सालता रहता था। उनके कोई संतान नहीं थी। आयु लगातार बढ़ती जा रही थी। ली सुनियाँ थीं- कौशुल्या, सुमित्रा और रानियों के मन में भी बस यही एक दुःख था। संतान की कमी। जीवन सूना-सूना लगता था। राजा दशरथ से रानियों की बातचीत प्राय: इसी विषय पर आकर रुक जाती थी। राजा दशरथ की चिंता बढ़ती जा रही थी। बहुत सोच-विचारकर महाराज दशरथ ने इस संबंध में वशिष्ठ मुनि से चर्चा की। उन्हें पूरी बात विस्तार से बताई। रघुकुल के अगले उत्तराधिकारी के बारे में अपनी चिंता बताई। मुनि वशिष्ठ राजा दशरथ की चिंता समझते थे। उन्होंने दशरथ को यज्ञ करने की सलाह दी। पुत्रेष्टि यज्ञ। महर्षि ने कहा, “आप पुत्रेष्टि यज्ञ करें, महाराज! आपकी इच्छा अवश्य पूरी होगी।” में हुआ। पूरा नगर उसकी तैयारी में लगा. हुआ था। यज्ञशाला सरयू नदी के किनारे बनाई गई। यज्ञ में अनेक राजाओं को निमंत्रित किया गया। तमाम ऋषि-मुनि पधारे। शंखध्वनि और मंत्रोच्चार के बीच एक-एक कर सबने आहुति डाली। अंतिम आहुति राजा दशरथ की थी। यज्ञ पूरा हुआ। अग्नि के देवता ने महाराज दशरथ को आशीर्वाद दिया। कुछ समय बाद दशरथ की इच्छा पूरी हुई। तीनों रानियाँ पुत्रवती हुई। महारानी कौशल्या ने राम को जन्म दिया। चैत्र माह की नवमी के दिन। रानी सुमित्रा के दो पुत्र हुए। लक्ष्मण और शत्रुघ्त। रानी कैकेयी के पुत्र का नाम भरत रखा गया।

प्रश्न :

And below are the responses it generates -

  1. किसका जन्म चैत्र मास की नवमी के दिन हुआ था?
  2. Who was the father of Ram and Lakshman?
  3. कौन भरत का पिता था?
  4. Which king lived in Kosala?
  5. किस राजा ने यज्ञ किया था?
  6. What did he ask for?
  7. किस राजा ने यज्ञ किया था?
  8. Who was the father of Sumitra's two sons?
  9. किस राजा ने यज्ञ किया था?
  10. Who was the father of Bharat?
    जवाबः
  11. The birthday of Ram and Lakshman was on the ninth day of the month of Chaitra.
  12. राम और लक्ष्मण के पिता राजा दशरथ थे।

Issues I face -

  1. I have tried stressing on the fact that only Hindi text is to be generated but it still uses English in the response.
  2. The questions are repeated, though I mention in the prompt not to repeat the questions and also provided a repetition_penalty=1.18.
  3. I have even tried in a different prompt providing an example of how the output shud look like - , but in vain, it continues only generating this format.
  4. From my observations, it seems to generate question related to last few lines of the paragraph only, no questions were generated related to the earlier sentences in the paragraph. Do you think i should try generating questions for smaller chunk of context than larger ones? Or should I summarize the paragraphs and then generate the questions on the summarized paragraph?

Please advise on how to proceed - if you think I should create a dataset in the format below and fine tune the base model to respond in the given format.
<Context 1 > <Question 1 >
<Context 1 > <Question 2>
<Context 1 > <Question 3 >
.......
<Context 2> <Question 1 >
<Context 2> <Question 2>
<Context 2> <Question 3 >
......

I have worked on finetuning for classification and QA (SQUAD) tasks on BERT for English text but have not finetuned Llama2 models before using PEFT techniques. Can you please guide me here?

@AayushSameerShah
Copy link
Author

AayushSameerShah commented Feb 23, 2024

Hie @pallavi-allada
Based on my first impression, I don't think the model itself is a problem for this task. Generating questions is a relatively easy task for a 7B model.

But, I would suspect that the model itself (Llama-2) is pre-trained with the English corpus and SvarnamAI has tuned this model for Indic languages, the 7B may not get the perfect grip on Hindi instructions and all. But again, I assume that they have fine-tuned the model and not pre-trained. I have not read their paper thoroughly. Nevertheless, it is not that important.

What important is giving structured instructions. As the model gets larger it picks up the structure on where the instruction ends and where the example ends. For smaller models like this, we need to be a bit conscious (especially when it is trained with multiple languages).

In your case, I can see that the instructions that you gave are in hindi like: "नीचे दी गई संदर्भ के आधार पर दस हिंदी प्रश्न तैयार करें। प्रश्न दोहराएँ नहीं।" and also the output structure is not given. (I don't know if you have already tried them, but let me write just in case).

Make sure your prompt has these things in place:

  • Try giving the instruction in English only. The generation part in Hindi will be carried out by the model. (Which I assume you have tested)
  • Let the model know where the question starts and where ends by giving the separators.
  • Give the format for the model to respond to. This helps the model to know in which format it has to give output. (You have tried giving examples, but it is about the "format")
  • Use the prompt signature. I am pretty sure that all Llamma models follow the prompt format like <s>[INST] {instruction} </s> {user} [/INST]
  • Following the prompt structure is a must (if SwarnamAI gives some format, then you should use that) because nowadays, most of the models are not the completion models as they used to be like "Start: ". But they now follow a certain format.

Let's see an example: (I am not putting the prompt signatures since I don't know what it does, but you should)

### Instructions:
- Pretend as if you are an experienced professor at a Hindi university. Your task is to generate 3 questions from the given passage along with their answers all in Hindi.
- You must write questions and answers both in Hindi for the upcoming examination.
- Follow the question-answer format given below

### Question-answer format
```
**Question 1**: 
**Answer 1**: 

**Question 2**: 
**Answer 2**: 

**Question N**: 
**Answer N**: 
```

### The passage is given below:
"""
यह कथा अवध की है। और बहुत पुरानी है। अवध में सरयू नदी के किनारे एक अति सुंदर नगर था। अयोध्या। सही अर्थों में दर्शनीय। देखने लायक। भव्यता जैसे उसका दूसरा नाम हो! अयोध्या में केवल राजमहल भव्य नहीं था। उसकी एक-एक इमारत आलीशान थी। आम लोगों के घर भव्य थे। सड़कें चौड़ी थीं। सुंदर बाग-बगीचे। पानी से लबालब भरे सरोवर। खेतों में लहराती हरियाली। हवा में हिलती फ़सलें सरयू की लहरों के साथ खेलती थीं। अयोध्या हर तरह से संपन्न नगरी थी। संपन्‍नता कोने-अंतरे तक बिखरी हुई। सभी सुखी। सब समृद्ध। दु:ख और विपन्नता को अयोध्या का पता नहीं मालूम था। या उन्हें नगर की सीमा में प्रवेश की अनुमति नहीं थी। पूरा नगर विलक्षण था। अदूभुत और मनोरम। उसे ऐसा होना ही था। वह कोसल राज्य की राजधानी था। राजा दशरथ वहीं रहते थे। उनके राज में दुःख का भला क्या काम? राजा दशरथ कुशल योद्धा और न्यायप्रिय शासक थे। महाराज अज के पुत्र। महाराज रघु के वंशज। रघुकुल के उत्तराधिकारी। रघुकुल की रीति-नीति का प्रभाव हर जगह दिखाई देता था। सुख-समृद्धि से लेकर बात-व्यवहार तक। लोग मर्यादाओं का पालन करते थे। सदाचारी थे। पवित्रता और शांति हर जगह थी। नगर में भी। लोगों के मन में भी। राजा दशरथ यशस्वी थे। उन्हें किसी चीज़ की कमी नहीं थी। राज-सुख था। कमी होने का प्रश्न ही नहीं था। लेकिन उन्हें एक दुःख था। छोटा सा दुःख। मन के एक कोने में छिपा हुआ। वह रह-रहकर उभर आता था। उन्हें सालता रहता था। उनके कोई संतान नहीं थी। आयु लगातार बढ़ती जा रही थी। ली सुनियाँ थीं- कौशुल्या, सुमित्रा और रानियों के मन में भी बस यही एक दुःख था। संतान की कमी। जीवन सूना-सूना लगता था। राजा दशरथ से रानियों की बातचीत प्राय: इसी विषय पर आकर रुक जाती थी। राजा दशरथ की चिंता बढ़ती जा रही थी। बहुत सोच-विचारकर महाराज दशरथ ने इस संबंध में वशिष्ठ मुनि से चर्चा की। उन्हें पूरी बात विस्तार से बताई। रघुकुल के अगले उत्तराधिकारी के बारे में अपनी चिंता बताई। मुनि वशिष्ठ राजा दशरथ की चिंता समझते थे। उन्होंने दशरथ को यज्ञ करने की सलाह दी। पुत्रेष्टि यज्ञ। महर्षि ने कहा, “आप पुत्रेष्टि यज्ञ करें, महाराज! आपकी इच्छा अवश्य पूरी होगी।” में हुआ। पूरा नगर उसकी तैयारी में लगा. हुआ था। यज्ञशाला सरयू नदी के किनारे बनाई गई। यज्ञ में अनेक राजाओं को निमंत्रित किया गया। तमाम ऋषि-मुनि पधारे। शंखध्वनि और मंत्रोच्चार के बीच एक-एक कर सबने आहुति डाली। अंतिम आहुति राजा दशरथ की थी। यज्ञ पूरा हुआ। अग्नि के देवता ने महाराज दशरथ को आशीर्वाद दिया। कुछ समय बाद दशरथ की इच्छा पूरी हुई। तीनों रानियाँ पुत्रवती हुई। महारानी कौशल्या ने राम को जन्म दिया। चैत्र माह की नवमी के दिन। रानी सुमित्रा के दो पुत्र हुए। लक्ष्मण और शत्रुघ्त। रानी कैकेयी के पुत्र का नाम भरत रखा गया।
"""

**Question 1**:

I think this will work. Here:

  • I have put the separators
  • Given instructions in English on how to perform and what to perform
  • Given the format to respond to
  • And ended with the leading formatting tokens (again, follow the prompt signature that is appropriate with your model).

So,

  • I don't think you need to give smaller or larger chunks. Larger passages are fine.
  • You may tweak the number of question-answer pairs you want to generate.
  • After the generation you need to write a small Python code to extract questions and answers as pairs.
  • More mature models like GPT-4 can give you output in JSON, but I don't think we should do that with this model.

Another thing is if it still gives wrong results a little scope is still there with the prompt, but then you should switch to a larger model that understands Hindi.

BTW, google's Gemini is multilingual, if you can use that in their "AI Vertex" then it will provide you with quality results!

Let me know how that goes!

@pallavi-allada
Copy link

pallavi-allada commented Feb 23, 2024

@AayushSameerShah - Thank you for the quick response. Modified the prompt mentioned by you to get some questions and answers with some issues -
The prompt as used -

### Instructions:
- Pretend as if you are an experienced professor at a Hindi university. Your task is to generate questions and their answers from the given passage .
- You must write three questions and answers both in Hindi for the upcoming examination. 
- Follow the question-answer format given below 

### Question-answer format
` ` `
**Question 1**: <question in hindi>
**Answer 1**: <answer in hindi>

` ` `

### The passage is given below:
{context}

### Questions and answers in Hindi:

"""

And below is the response in one run -

प्रश्न 1: सरस्वती नदी के तट पर स्थित शहर कौन सा है?
Answers 1: Ayodhya

प्रश्न 2: किसका जन्म चैत्र महीने की नौवें दिन हुआ था?
Answers 2: Ram

प्रश्न 3: किसका जन्म चैत्र महीने की नौवीं तारीख को हुआ था?
Answers 3: Bharat

And below is the response in another run where I asked for generating 5 questions -

प्रश्न 1: सरस्वती नदी के तट पर स्थित शहर कौन सा था?
Answers 1: Ayodhya

प्रश्न 2: किसका जन्म चैत्र महीने की नौवें दिन हुआ था?
Answers 2: Ram

प्रश्न 3: किसका जन्म चैत्र महीने की नौवीं तारीख को हुआ था?
Answers 3: Lakshman

प्रश्न 4: किसका जन्म चैत्र महीने की आठवीं तारीख को हुआ था?
Answers 4: Shatrughan

Incorrect Questions
Question 1 seems to be incorrect. It is "Sarayu" and not "Saraswati" as mentioned in question 1. Like wise for other passages too it generates incorrect questions.
Out of Context Questions
प्रश्न 4: किसका जन्म चैत्र महीने की आठवीं तारीख को हुआ था? There is no mention of the date or day when Lakshman, Shatrughn or Bharat were born in the passage but the question is generated.
Answers still generated in English
This was the reason I tried writing the entire prompt in Hindi to force the model to generate all the content in Hindi. It may be a wrong approach but nevertheless tried and it doesn't work too.

Do you think the responses can be improved further?
Am wondering if I should move the entire project to a different LLM altogether. OpenAI, Gemini and many others are multilingual but they mention around 1% of their dataset to be indic languages ( though there is no mention of the quantity used by sarvam, am assuming its more than the multilingual LLMs and hence chose it ).I maybe wrong in my thinking that OpenHathi model has a better understanding of Hindi than other multilingual LLMs.
Another reason to choose OpenHathi is due to it being open source.

Experimented with Gemini chat (UI) parallely - It generated better quality outputs for the same prompts (ofcourse comes at a price !). Would you suggest any other APIs which are cheaper than Gemini/OpenAI? Am trying to use open source LLMs and free kaggle/colab GPUs to build a product in Hindi which reads PDFs/images of text and stores them in a vectorDB to be used later for question answering. Please let me know your thoughts.

@mcdaqc
Copy link

mcdaqc commented Apr 2, 2024

Hei @TBomer-fm,

🧱 Background

Generally "generative question answering" is the topic which requires accuracy because we need to get the "factual" data back from our own private documents. This is unlike other use cases like generating a poem or talking in some personality like talking as if the model is Sherlock or Waler White or whatever. In such use cases we can opt for fine-tuning because we don't want to get the information back but the way it is generated.

🤔 Options

Now, in the use case of "question answering" we have a couple of options:

  1. Extractive
  2. Generative
    ...

Great explanation, thanks

@aasem
Copy link

aasem commented Apr 28, 2024

Thank you, @AayushSameerShah for such an insightful discussion. I would like your comments on the following scenario.

Let's say we have a private corpus and we build a RAG for open book QA using an embedding model and an LLM. The corpus however is specialized. In my case, for instance, I am trying to build RAG around a corpus of really obscure philosophical works which no LLM including GPT-4 has seen. Now the retrieval works fine, and I am getting answers given the context after optimal prompt crafting with context augmentation. I am using Gemma-7B with sentence transformers.

But to further improve my system, I am now thinking to make my LLM a memory machine. Since, the corpus is fixed and I want a companion LLM to always facilitate a conversational semantic search, I am now developing a dataset with three columns, i.e,. context (chunk), Question and Answer. With such a dataset, my finetuned LLM would ingest the whole corpus as well as Q/A pairs with varying styles. I understand that this seems like overfitting on purpose but the aim is to have companion LLM to always reference the corpus easily.

I now have two questions:

  1. Do you think this approach of RAG + Finetuned LLM would work better than simple RAG which is classical ICL?
  2. If it is likely to work, what should be the dataset format? Should I do three columns like this or make two columns whereby I give a prompt and a response. Prompt would include both the question as well as the chunk, such as,

Keeping the given context in background:

[whole chunk appended here]

What should be the....?

Thank you.

@AayushSameerShah
Copy link
Author

AayushSameerShah commented May 26, 2024

Hey @aasem !
I am extremely sorry for responding after a ridiculously long time. It's been a busy month and by now I think you may have figured out the solution (😓). But nevertheless, let me share what do I think; that may benefit your use-case.

In my case, for instance, I am trying to build RAG around a corpus of really obscure philosophical works which no LLM including GPT-4 has seen.

Now, I assume your corpus is "still in English". and not in some other less seen languages like Hindi, Arabic, etc.

Once that in mind, the RAG would work great and it does as you've said with your experiments with Gemma-7B.

I am now thinking to make my LLM a memory machine.

Now, this part is the core of your question. Instead of using RAG all the time, providing context with every question; you want to "train the model on your text" so that the model doesn't have to be given the context anymore.

Now, whether to train from scratch or to fine tune it, come with different set of challenges.

  1. Is your corpus entirely different? Industry Jargons, new language, special syntax?
  2. If not the first one, then you just need to "tune the mode" for your specific case.
  3. A companion LLM

Let's discuss each one-by-one.

1. Entire different corpus, different set of tokens

This will be the ultimate solution but it will take a lot of training, you will also need to train a new tokenizer to get this job done. And yes, it is worth mentioning you need to have really a rich set of data for this.

So the overall process would look like this:

  1. Take a pre-trained LLM.
  2. Train it for your corpus (just to predict next token) here we are just building the knowledge base.
  3. Train the model with question answer pair dataset.

It seems logical as it has now understanding of specific corpus and the model knows how to answer. And this should solve the problem. But, as said it will cost more as you will need 1) Rich corpus to train with 2) Q-A pair dataset covering many varities.

But, this way comes with the cost. The model is likely to forget other things that it knows.

  1. The model will now not be able to perform the RAG
  2. The model won't understand any instructions (no chance for ICL at all)
  3. Other previously known knowledge is vanished.

To mitigate to these problems, you will need to provide a mixture of:

  • Instructions
  • Q&A pair

But that might be worth doing if your case falls under this category.

2. Fine tune with Q-A pair

This should be something that is cheap and can be done easily. Here you will only provide the Q-A pair with thousands of examples and the model will now be able to give you the answers based of off just the questions. But with caution, here you will need to give very large set of dataset for the model to understand the varities.

But I am suspicious if it would work. As, our corpus is unseen by the model, so any new questions other than in the dataset won't be known to the model.

3. A companion model

Okay, what if we train a new model with context, question and an answer?

I want a companion LLM to always facilitate a conversational semantic search,

This can be done, so with the context, and your question with answer; the model knows the "style" to respond to and with "what depth" it should respond.

Doing this will vanish the knowledge but it will benefit us using our original model. We are not benefitting from 2 models.

  1. Let's call it RAG model.
  2. Tuned model.

So the flow would look like:

  1. Question
  2. Retrieval
  3. Context -> ingest in model 2 (tuned model)
  4. Tuned model responds the answer
  5. The model 1 (RAG model) now has:
    • Original questions
    • Retrieved chunks
    • Model 2 answer (what does model 2 thinks)
  6. Based on these data, the model 1 will fabricate the answer. And in theory, this should work better (if you don't have any model context length related challenges).

Bottom line: If you are satisfied enough with the current responses with RAG only, then I think you should stick to that. As it will use only single model and is fast.

If you want to get responses in "some specific style" then you can always populate your prompt with some examples; guiding the model how to respond. This is cheap and it will work if you are satisfied enough.


TBH, The tale I've spun above is based on my personal experiences, to give you confidence and back my arguments, I should have cited some researches, but ATM I have not referred any of them so, in future if anything comes up similar to this, I will surely share that with you.

Hope this helps.
Cheers 🥂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests