-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should I prepare the dataset for generative question answering on the private documents? #38
Comments
I also wonder how to structure a dataset properly…. Using raw text seems to work very well though… |
@Gitterman69 indeed! But I also wonder whether it is okay to fine tune with LoRA to "remember" the facts!? Because, LoRA just adds <1% of total trainable parameters, and I don't think if we can expect it to remember the facts of the private docs! Let me know your suggestions mate! |
I have fine tuned llama using this repo and a few text documents I had with me. I thought I was overfitting and hence increased input data size, decreased the number of epochs which was neither completing the sentences when input as above nor answering the questions. @Gitterman69 curious to know how you got it to "work very well"... |
it really depends what you want to do but this is my workflow: workflow for text completion:
workflow for q&a bot:
basically its trial and error - just make sure you train in the "formatting" you want the output to be! edit: a nice way to train q&a would be the following QUESTION: How many eggs are in the box from company Eggman Corp. and then when you ask your bot the quesion above it will be a similar answer also the formatting/and question: answer: style... etc |
I ideally want it to gain knowledge from my documents and be able to answer if I ask something from them while I run inference. I don't really any qna pairs from/for the data. I also tried using vector embedding search and a model on top of it to put things together, but this way it is lacking information across few sentences. Also it can't answer anything other than What Where etc kind of questions if the answer it expected to span multiple sentences and its even worse when it has to infer something with this information and general knowledge. So that seems to be a not so fruitful approach. |
@Datta0 I was facing the same. When given the "raw" text as the training data for the model, it hallucinates. Because it also has much knowledge from its pre-training and to answer your question, it will get information from anywhere or it will make it up. And making the QA pair as Gitterman69 pointed out requires you to create the QA manually which takes a hell lot of time (except you have static data). So, I have changed the way. Now, I am actively focusing on the In-context-learning (ICL) approach to Question-answering. Because the QA task is where you need the "actual facts" unlike other tasks where the facts as optional and just require the completion style such as generating a quote or a story or getting a response in some personality like Walter White or Sherlock! For that reason, LoRA or any fine-tuning method isn't a good approach for question answering where you need the concrete facts to be stored. So the solution is to give the context in the prompt and make the model answer only from the prompt. This way there are very low chances for hallucination and this way is very standard! I am actively focusing on LangChain and LLamma-Index now. (See! Just yesterday LangChain incorporated the chatbot - Mendable which answers the questions from their docs! And they haven't fine-tuned it! They provide the context and then the chatbot replies from the context!). 🤗 |
Sounds super interesting! Where could I find more info about your approach, langchain as well as llama indexes for training purposes?! Any hint would be highly appreciated!!! :)
… Message ID: ***@***.***>
|
@AayushSameerShah Thanks for the explanation. I already tried LangChain but I don't want to/ can't use text-davinci-003 or any OpenAI model due to some constraints. I want to use models that are available on huggingface ideally. When I use some model with CustomLLM like flan-t5-large, it produces decent output sometimes. But when I try to run it as an agent with chat memory, it throws error saying I tried to use LLAMA or Alpaca with the same pipeline, it quickly runs out of memory on my 40GB GPU. So I'm kinda stuck here with regards to LangChain LLAMA Index If you got it to work, can you please elaborate. Would be really helpful. |
Really interesting topic, as I'm into this lately.
If it happens that I have a lot of docs to pass through? Talking about around 30gb of html text in particular, could it be vialbe to feed the raw text as training data? Of course there aren't any ground truth labels. The idea is to feed it - understand it - and being able to answer domain specific questions. What is the best approach on this?
Can you elaborate how you focus on ICL ? Are there any frameworks that have that available?
Thanks |
@AayushSameerShah |
Hi @AayushSameerShah, thanks for kicking off this discussion Are you able to elaborate more on the ICL approach? And as for LangChain, it seems like a good option but to use Google's PaLM or Anthropic's Claude you need to join the waitlists. And to use an OpenAI model you need to pay for the API, do you know if LangChain offer models that are available/free? Thank you, Tom |
Hei @TBomer-fm, 🧱 BackgroundGenerally "generative question answering" is the topic which requires accuracy because we need to get the "factual" data back from our own private documents. This is unlike other use cases like generating a poem or talking in some personality like talking as if the model is Sherlock or Waler White or whatever. In such use cases we can opt for fine-tuning because we don't want to get the information back but the way it is generated. 🤔 OptionsNow, in the use case of "question answering" we have a couple of options:
1️⃣ ExtractiveRight off the bat, the first option is thrown away because there we simply can't ask the complex questions and from multiple sources, just because that way the model returns the answers. It simply gives the indices back from the paragraph you provide and the answer is so small. Ex:
That is the context and while asking the question in that extractive manner, we need to provide the model that context to get the answer from. And so the question:
The model will simply pick the indices 2️⃣ GenerativeThis is where things get crazy. And where I was stuck for a long time. Here we get amazing and *** human-like *** responses from the model. Without talking too much about this, we have 2 options here:
Let me cover the second approach first because that made me crazy around that time.
|
Hei @Gitterman69, @Datta0, Thanks. |
And, @TBomer-fm Langchain, DOESN'T provide and free LLM or anything like that. Even it doesn't provide any paid LLMs. It is not the provider but a connector of LLMs with different kinds of prompts and agents. To use any free LLM, you will need to use HuggingFace's LLM in Langchain. There are guides available. Please let me know if you need further direction. 🤗 |
Hei @FatimaHabib, Great! You are working on a similar project. Unfortunately, fine-tuning isn't the scalable way for the question answering. I have tried to include the answer in my comment above. Hope I was able to explain there. Let me know if any further clarification is needed. |
Hello @GioPetro ! To pass a hell lot of data, like in your case, you need to store them first in some vector store like FAISS, Chroma, Pinecone etc. Then use Langchain / Llamma Index to retrieve just the documents which are required for that particular question and pass in the prompt. Hence the ICL. 🤗 |
Wow @AayushSameerShah, thank you very much for your super comprehensive and helpful response! |
Big thanks @AayushSameerShah for the excellent explanation! I really appreciate it (: . I will try out the solution you have suggested. |
Thanks for the superb explanation and answer!!!!! |
Thanks @AayushSameerShah for your explanation. But I also want to train the model using raw text, instead of using vector stories and searching. I am not able to train the model on raw text @Gitterman69 @Datta0 you where able to do this task, can you please guide me in the right direction. |
Thanks, @AayushSameerShah, for this excellent explanation! |
Hie @dhairyakataria! 👋 I see, you are willing to use the fine-tuning approach for your question-answering task on your private documents. Now I am not sure which method you've used for fine-tuning, but I am quite sure you are not training the model as a whole you must be using some kind of PEFT. Now, as explained in the GitHub thread, such "fact extraction" use cases should not be tried to solve by fine tuning because that will basically won't guarantee the information that you are willing to ask, will be retrieved and often the model will mix your trained information with its pre-trained information. But if you are anyhow willing not to use the second approach (providing the documents in the prompt) and want to fine-tune it, then you will have to train the model from scratch. Which is truly expensive mate. This directly refers to the model's ability to give the correct responses. Take an example of the Wikipedia dataset. In a hypothetical example let's say 3 models are trained on the same Wikipedia dataset. These 3 models have different architecture and sizes. Now, the model's ability to "remember" things depend heavily n the training & architecture and we see often lines like "GPT-X model gives 89% accuracy on science-paper test" or something like that, but here it would be related to Wikipedia. Now, practically, your case which involves the private documents to be fed (which often changes frequently) it makes this task impractical for fine-tuning. Such tasks should only be solved using the ICL which is the standard way to go forward. Still this is an active area of research, many techniques are being researched every day, but as far as I know, this is the way to go. And yes, here is one thread on openAI which supports what we are talking about: Let me know if it helps 🤗 |
Hie @EL-MEHDI-git, 🙋🏻♂️ It is a bit unclear that what you mean by: "answers that you are getting are not as expected". Does that mean they are "wrong answers" or the "completion is not proper" meaning that the model answers but ends the answer in the middle without even completing the sentence? 1️⃣ Wrong answersIf the answers are wrong, this simply means the context isn't passed properly or not in enough volume. To fix that:
2️⃣ Answers are incompleteThis is a huge problem when working with open-source models. Even if they are instruction tuned they often not complete the sentence properly. The easy fix is: Use the prompt on which they were trained. This means, using the prompt: Won't work ❌
And there are higher chances the model will not give the answer properly. Because the model was trained on the structure like this: (for an example OpenAssist 12B model)
Now, in such models, using the generic prompt won't help, there we will need to change the prompt in langchain: May work well ✔
So, the Fix-1: Change the prompt. The generic won't help. Fix-2: Use the better/bigger model. Like GPT-3, there you should not have any problem with the langchain's default prompts, but still changing the prompt will help you there. Fix-3: Change the generation parameters. Play around with I have not elaborated on each fix too much, because these diagnostics were based on my assumption of your problem, you may be talking about entirely different things. Please let me know if any of these helps. |
@AayushSameerShah Thanks for your two detailed responses. I am facing issues with |
Hie, @AjinkyaBankar If that's correct, then I think we are limited here. Because what we are fetching solely relies on the question only, so it is advisable to ask the user to "provide more context" while asking. I believe, the langchain retriever provides some kind of similarity score. So, if the fetched chunks are having scores less than your threshold, say
But there are workarounds.
1️⃣ Change the vector storesBy that I mean, the "way we search" the similar chunks differs from method to method. And many vector stores support the "cosine" similarity while others don't. For that you need to go through the documentation. Why am I suggesting this?Because, mostly we are dealing with 2 types of similarity searches.
Both are commonly available in most vector stores ie. pinecone, croma, FAISS etc. and changing these both will impact which chunks are returned. So, you should try changing the "method" to retrieve the articles. Try MMR (Maximum Marginal Relevance Retrieval)This method will likely to give you the diverse range of chunks, which in contrast of giving "most similar" chunks found in the "cosine" or "dot" method.s Find them: here in langchain docs. 2️⃣ Make the chunks smallerI think you've already done this, but still in case. Small chunks tend to retain the "specific" information. Suppose, you are chunking your document of On the other hand, if you have chunks of say There is a problem...In this 2️⃣ point, we need to take care of the treadoff of how small the chunk should be. Because smaller is better, but that won't cover whole context! I am diverging to the discussion which can be discussed in a lot detail, but still it may be helpful. In that case, we can use Access that here in langchain doc. 3️⃣ Lost in the middle!In the research, it is found that the model can fetch the information more accurately from the context, if the relevant information is found either in the starting or at the ending of the provided context. That means... if I ask Where does Walter White live? and I fetch the
Now, assume these paragraphs are long, and not just single sentence. In this case the model will have hard time to find "where does Mr. White live". Because the answer is hidden in the middle. The solution?Put the context in the beginning or at the end. Please find that out here: langchain docs. Like:
And there is the dedicated method, which will do this automatically. 4️⃣ Ask the same query in different terms!I think this is one of the most useful ways to address your issue. Which is to generate more versions of the same question. And query them individually - because one of them (at least) is more likely to have the answer. User asks short questions without providing enough context. Here the That is the one that you should be looking for if you don't want to dig much. 5️⃣ Use better embedding model!Hopefully, you are already using OpenAI's model. Which is the SOTA as said in my previous response. So try using that. And make sure while splitting the document into chunks, you provide some overlap between the chunks to have some connection. 6️⃣ Perform the "Rerank" of the documentsSo, you have received the top 3 chunks. But there might be some irrelavent chunks which are close enough to the query, but may not involve the answer you are looking for! Take an example: Now, there are say so many articles which are related to the query and say it has retrieved these:
Now these are the closest according to the model but may involve the wrong answers or the chunks that may not include the answer at all!. There comes the The re-rank model (like Usage in Langchain is easy, where you have to use the compressor which again is another method to get the quality results. 🥜 In a nutshell...
So in the In our example, it will give:
And based on that again we will select top The langchain doc for compression (paraphrasing): doc here There are other a lot of ways which you can use to get better chunks, but I think whatever I have suggested will solve your issue. AFAIK. Please let me know if anything is still fuzzy. Because at the end of the day, it all depends on what and how you ask. |
Thanks @AayushSameerShah for taking time to explain in detail. Appreciate that! |
Hello @AayushSameerShah, I decided to write this comment to ask you about the two libraries you mentioned: LangChain and LlamaIndex. I know that LlamaIndex is based on the first one but I can't understand which library I have to choose. What do you think? Thank you for your time! Best regards, |
Thank you, @AayushSameerShah! Your explanations are consistently insightful. I'd like to get your perspective on a strategy I've been using. I've been breaking down the text data into smaller segments, essentially paragraph-sized chunks. Given that in my scenario, each question's answer tends to reside within a specific paragraph, this approach seemed fitting. Therefore when spliting the text using RecursiveCharacterTextSplitter we can select the maximum number of tokens without overlapping. I have tried this, I got better answers, |
Hie @terilias 👋🏻 LlammaIndex and Langchain both are gaining popularity because of their area of focus. But their common purpose is to augment the LLMs with the data. I may differentiate them in the following way:🦙 LammaIndex: Is a vertical framework which is like an expert in for the ingestion, manipulation, cleaning, vectoring, retrieving and so much. It is like the backend for your application. As they say themselves in their docs:
⛓ LangChain: It is the general framework (horizonal) for your application. It is like the "front" with which the user will interact.
Here, I would like to note that, in the day-to-day... common application we could just use the Langchain and it will work just fine! It has more than enough that we need. It is like a complete toolbox. It provides:
Overall-ly:Llamma Index:
LangChain:
As an end note: I would say, you wouldn't need LlammaIndex if your task is not too complex. It has connectors for graphs and other things which we may not want to use. And most of the applications can be made easily with Langchain. But recently a lot of development is being done in LlammaIndex, and it may made the line thinner between these both libraries 🤷🏻♂️ |
Hello @FatimaHabib !! Now, it really depends on what kind of application you are making - because there cannot be a definitive answer for "how large the chunk should be" or "is splitting in paragraph is a better way to go instead of other ways". As you might have seen there are different types of textSplitters available there. And even the Just as a little quick example: Say you are building an application that connects the LLM to the document store which can return the chunk of "python code" and give the explanation of that particular code.
Well, the model's response was sarcastic 😳 but suppose this kind of chain we want to develop (which is dumb BTW, no one stores the code in the vectorstore to get the explanations!). In this case you wouldn't use the default And there can be other specific splitters as well to preserve the structure in the chunks! So, long story short it looks fine if you are getting good results as per your use-case looks like the QA with documents, but if you have some "special" case like above, then the paragraph splitting won't work. And as an advice, I would like you to add 2 more things in your split:
👉 A small example if you don't mind?
So, you split the text and get the following:
Longer sentences are truncated but as you can see, whole thing discusses the Interstellar movie, but when if we see some sentence like an individual So, in such case if we "append the prefix" to each sentence like:
Will keep the context! So when you ask something related to the Interstellar, it will fetch such chunks better and will produce quality results! I am not sure if there is the standard way to do this in langchain... but I work with the following code to do so: from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter()
# Static string to prepend
prefix = "Interstellar (2014): "
def prepend_prefix(texts):
return [Document(prefix + text.page_content, text.metadata) for text in texts]
texts = text_splitter.create_documents([long_document])
texts_with_prefix = prepend_prefix(texts) Let me know if it helps 👍🏻 |
@AayushSameerShah what if we increase the number of parameters using LORA large size of matrix. will the llm able to store new information? |
Thanks @AayushSameerShah , It helps alot. I forgot about the using the separators parameter . Do you have any tip to solve this issue? |
Thank you @AayushSameerShah for your answer to my question! Have a good day! |
@FatimaHabib, oh PDFs? Okay... Tables? Okay... Tables in PDFs? Not okay 😨 Actually I was dealing with the similar problem some months ago there I needed to use the content stored in the PDFs and fuse them in the LLMs to get the "accurate" result. While LangChain provides a great list connectors for every type of data including PDFs, it lacks with parsing tables stored in PDFs. If you have simple textual data stored in PDF, it will work and also if you have tables stored in CSV or Excel, it will work, but the way PDF stores table, at retrieval it gets pretty challenging to pull the data back. Because... ☝ Table can be stored as imagesI belive in many connectors/loader in LangChain, have some flag "fast" or "quality" kind, depending on which, internally it will use some kind of OCR to extract text from the image and on the other hand "fast" wont. While extracting the text data (especially numbers) from image, can lead to highly inaccurate results. Because of the quality of an image (or also the ability of the model to interpret the numbers) the numbers can easily be misread (reading ✌️ Tables are in text but...... the structure is messed up. Since the PDF loader reads all as text, left to right, and feeding the data in LLM, the model won't be able to comprehend which value relates to which column - but here, you can read the right numbers ( 🤔 Any solution?During the research I tried many libraries and models to help me get through this, and this is a known problem. Actually there are many online businesses which lets the user to upload the PDFs and you know, extract the data preserving the structure and all... but for now I think you should checkout: I think LangChain uses it internally as a part of its connectors, but try to have a look it as a seperate service and use it there and see what happens. Actually there were a bunch of good libraries that I came up with, but I need to dig down a little bit in my Discord, because it is lost in the history somewhere, I may update this comment when I find that list. For now...But for now, in the open source way, we have options but need to take appropriate decisions when to use which, because we don't know the table structure! The table structure may change in the PDF, and if it doesn't (which is highly unlikely) then you can use simple RegEx to extract the data and go forward with it! Good luck! |
Hi @AayushSameerShah ! can you tell please what pretrained models in huggingface i can use with the following inp/outp architecture: |
Hello @kaoutaar 👋🏻 Your question doesn't specify your requirements, resources available etc so I can't guide you exactly but There are many communities that have fine-tuned the base Llama-2 model and have achieved great scores on the leaderboard, but again this meta's vanilla llama-2 would suffice your generative QA need with satisfactory results. 🥂 |
@AayushSameerShah thank you for responding me, i've actually sent a request to meta to give me access, still waiting though. also i am really curious what would you have suggested 2 months ago. because i am trying to find all ready-to-use models that have this inp/output architecture. |
I know its been couple of months and there was lot of new developments, but is there any new things that could help us to finetune the LLM using our private document to give solid answers. |
Hello everyone, I am in the same situation of @thomasjv799 ! Should I use fine tuning then? If yes, how can I allow my model to assimilate all these data? |
@thomasjv799 Of course! So if your dataset is supposed to be made of Taylor Swift's background, news, and other articles dump then we won't be able to train the model from that dataset and expect it to give the "solid answers". (btw are you are swiftie? Yes, we will have a great conversation 💜). For that, you will need to create a whole new dataset with the "question and answer pairs". To do that:
After doing that you can proceed further for training the model of your choice and then finally the model will hopefully answer what TS means by "Vigilante Shit" in her song ✌🏻 PS: I am attaching a link to a short - free course on deeplearning.ai which will help you to fine-tune the model for Q&A. Best, |
@MatteoRiva95 Dude your issue might be similar to the one I have tried to address above. So say you are working on a project that has the documents of "software manuals", okay, then you ask questions like: "How to create X and apply Y filter"? In this case, there are no direct references in your manual to answer your "HOWs". Just because the retriever doesn't have enough information on what to return. And whatever it returns, the model tries its best to answer your questions and in turn, the hallucination. Here, you can use some modules from langchain that create similar questions that may answer your original question. In this case, the langchain module may create alternative questions like:
And for each of these questions we will retrieve the chunks and a union of them is more likely to answer (and don't hallucinate) your question! There is a lot of new literature that has been posted in langchain and even I haven't had an opportunity to skim through it. But I would encourage you to check them out. Especially here: https://python.langchain.com/docs/modules/data_connection/retrievers/ Good luck! |
@AayushSameerShah Thank you so much for your kind reply. Really appreciated! :) Do you think that changing the retriever could help me? Also with this enormous amount of data? Because I am afraid that 100k PDFs is too much for RAG :( It takes a lot of minutes to reply to one question also in a decent (still not super) gpu and cpu setting! I am trying to develop a chatbot with acceptable time to answer. For this reason, I was thinking to come back to fine-tuning (even though last time it gave me a lot of hallucination...but maybe I did something wrong...I am still a beginner after all). But I am not so sure... What do you suggest? Thank you so much again for your support! |
There are always 2 parts.
Generally, retrieval is fast. But as you are saying there are 100k of PDFs which I agree is a lot. In that case, you can look up different data stores that provide faster search. Even some paid ones may help like Pinecone or vaivenet) Secondly the generation, there is something loosely called "startup time". This means even before the model starts generating a first token, it takes some time to process the prompt (the typical forward pass). This takes time and based on what architecture, framework, and resource setting you're using it differs. So,
For a concrete example:
In this phase, you will know whether the time is spent too much in retrieval or somewhere else. In this stage, you may compare a couple of vector stores. Then, move to the model. Alternatively, you can also look for the ONNX format which would be the ultimate solution but that would require some more digging. I would recommend you to look at ONNX at the last when nothing works. Now you might have a better picture of what is taking time. Generally, the culprit is the model. Depending on how big the context is, it takes time. So, you may want to check other model alternatives/frameworks or upgrade resources.
Qualitywise or timewise? I would say, it would make a difference qualitywise more. But I might be wrong here too. So, I can't say for sure. At the EOD, there are so many moving pieces (I even didn't suggest you change the chunk size and all), that you need to see where it takes more time. I have given scattered information, I know... let me know if you need any clarification. Best, |
@AayushSameerShah Yes, there are a lot of info, variables, ways and options to try! But really, thank you so much for your kind help and time :) Yes, exactly I am using Langchain and in particular I am following these tutorials for RAG: https://levelup.gitconnected.com/building-a-private-ai-chatbot-2c071f6715ad They are super useful, but with a small amount of data, because with 60k PDFs (the last test I did)...well there troubles begin :( Moreover, the deployment of RAG and 100k PDFs become very very complex! What about fine-tuning then? Maybe turning all the PDFs into a huge csv file contaning questions and answers columns could be optimal for this particular process and could be the solution in order to avoid hallucination and to have an LLM ready to reply to my answers. Apologies again: I am a beginner, I am trying to find a solution to this huge issue and maybe my replies are wrong and not detailed enough :( Thank you again so much! |
@AayushSameerShah I have been working on a project to be able to generate Hindi Questions and their answers from given context. I use LangChain, HuggingFace - OpenHathi base model for generation of questions given a context. The model generates questions in both Hindi and English, though I would like it to generate only Hindi questions. Also, it does not provide the answers for the questions it generates. I have used the below prompt for question and answer generation - प्रणाली :नीचे दी गई संदर्भ के आधार पर दस हिंदी प्रश्न तैयार करें। प्रश्न दोहराएँ नहीं। यह कथा अवध की है। और बहुत पुरानी है। अवध में सरयू नदी के किनारे एक अति सुंदर नगर था। अयोध्या। सही अर्थों में दर्शनीय। देखने लायक। भव्यता जैसे उसका दूसरा नाम हो! अयोध्या में केवल राजमहल भव्य नहीं था। उसकी एक-एक इमारत आलीशान थी। आम लोगों के घर भव्य थे। सड़कें चौड़ी थीं। सुंदर बाग-बगीचे। पानी से लबालब भरे सरोवर। खेतों में लहराती हरियाली। हवा में हिलती फ़सलें सरयू की लहरों के साथ खेलती थीं। अयोध्या हर तरह से संपन्न नगरी थी। संपन्नता कोने-अंतरे तक बिखरी हुई। सभी सुखी। सब समृद्ध। दु:ख और विपन्नता को अयोध्या का पता नहीं मालूम था। या उन्हें नगर की सीमा में प्रवेश की अनुमति नहीं थी। पूरा नगर विलक्षण था। अदूभुत और मनोरम। उसे ऐसा होना ही था। वह कोसल राज्य की राजधानी था। राजा दशरथ वहीं रहते थे। उनके राज में दुःख का भला क्या काम? राजा दशरथ कुशल योद्धा और न्यायप्रिय शासक थे। महाराज अज के पुत्र। महाराज रघु के वंशज। रघुकुल के उत्तराधिकारी। रघुकुल की रीति-नीति का प्रभाव हर जगह दिखाई देता था। सुख-समृद्धि से लेकर बात-व्यवहार तक। लोग मर्यादाओं का पालन करते थे। सदाचारी थे। पवित्रता और शांति हर जगह थी। नगर में भी। लोगों के मन में भी। राजा दशरथ यशस्वी थे। उन्हें किसी चीज़ की कमी नहीं थी। राज-सुख था। कमी होने का प्रश्न ही नहीं था। लेकिन उन्हें एक दुःख था। छोटा सा दुःख। मन के एक कोने में छिपा हुआ। वह रह-रहकर उभर आता था। उन्हें सालता रहता था। उनके कोई संतान नहीं थी। आयु लगातार बढ़ती जा रही थी। ली सुनियाँ थीं- कौशुल्या, सुमित्रा और रानियों के मन में भी बस यही एक दुःख था। संतान की कमी। जीवन सूना-सूना लगता था। राजा दशरथ से रानियों की बातचीत प्राय: इसी विषय पर आकर रुक जाती थी। राजा दशरथ की चिंता बढ़ती जा रही थी। बहुत सोच-विचारकर महाराज दशरथ ने इस संबंध में वशिष्ठ मुनि से चर्चा की। उन्हें पूरी बात विस्तार से बताई। रघुकुल के अगले उत्तराधिकारी के बारे में अपनी चिंता बताई। मुनि वशिष्ठ राजा दशरथ की चिंता समझते थे। उन्होंने दशरथ को यज्ञ करने की सलाह दी। पुत्रेष्टि यज्ञ। महर्षि ने कहा, “आप पुत्रेष्टि यज्ञ करें, महाराज! आपकी इच्छा अवश्य पूरी होगी।” में हुआ। पूरा नगर उसकी तैयारी में लगा. हुआ था। यज्ञशाला सरयू नदी के किनारे बनाई गई। यज्ञ में अनेक राजाओं को निमंत्रित किया गया। तमाम ऋषि-मुनि पधारे। शंखध्वनि और मंत्रोच्चार के बीच एक-एक कर सबने आहुति डाली। अंतिम आहुति राजा दशरथ की थी। यज्ञ पूरा हुआ। अग्नि के देवता ने महाराज दशरथ को आशीर्वाद दिया। कुछ समय बाद दशरथ की इच्छा पूरी हुई। तीनों रानियाँ पुत्रवती हुई। महारानी कौशल्या ने राम को जन्म दिया। चैत्र माह की नवमी के दिन। रानी सुमित्रा के दो पुत्र हुए। लक्ष्मण और शत्रुघ्त। रानी कैकेयी के पुत्र का नाम भरत रखा गया। प्रश्न :And below are the responses it generates -
Issues I face -
Please advise on how to proceed - if you think I should create a dataset in the format below and fine tune the base model to respond in the given format. I have worked on finetuning for classification and QA (SQUAD) tasks on BERT for English text but have not finetuned Llama2 models before using PEFT techniques. Can you please guide me here? |
Hie @pallavi-allada But, I would suspect that the model itself (Llama-2) is pre-trained with the English corpus and SvarnamAI has tuned this model for Indic languages, the 7B may not get the perfect grip on Hindi instructions and all. But again, I assume that they have fine-tuned the model and not pre-trained. I have not read their paper thoroughly. Nevertheless, it is not that important. What important is giving structured instructions. As the model gets larger it picks up the structure on where the instruction ends and where the example ends. For smaller models like this, we need to be a bit conscious (especially when it is trained with multiple languages). In your case, I can see that the instructions that you gave are in hindi like: "नीचे दी गई संदर्भ के आधार पर दस हिंदी प्रश्न तैयार करें। प्रश्न दोहराएँ नहीं।" and also the output structure is not given. (I don't know if you have already tried them, but let me write just in case). Make sure your prompt has these things in place:
Let's see an example: (I am not putting the prompt signatures since I don't know what it does, but you should) ### Instructions: - Pretend as if you are an experienced professor at a Hindi university. Your task is to generate 3 questions from the given passage along with their answers all in Hindi. - You must write questions and answers both in Hindi for the upcoming examination. - Follow the question-answer format given below ### Question-answer format ``` **Question 1**: **Answer 1**: **Question 2**: **Answer 2**: **Question N**: **Answer N**: ``` ### The passage is given below: """ यह कथा अवध की है। और बहुत पुरानी है। अवध में सरयू नदी के किनारे एक अति सुंदर नगर था। अयोध्या। सही अर्थों में दर्शनीय। देखने लायक। भव्यता जैसे उसका दूसरा नाम हो! अयोध्या में केवल राजमहल भव्य नहीं था। उसकी एक-एक इमारत आलीशान थी। आम लोगों के घर भव्य थे। सड़कें चौड़ी थीं। सुंदर बाग-बगीचे। पानी से लबालब भरे सरोवर। खेतों में लहराती हरियाली। हवा में हिलती फ़सलें सरयू की लहरों के साथ खेलती थीं। अयोध्या हर तरह से संपन्न नगरी थी। संपन्नता कोने-अंतरे तक बिखरी हुई। सभी सुखी। सब समृद्ध। दु:ख और विपन्नता को अयोध्या का पता नहीं मालूम था। या उन्हें नगर की सीमा में प्रवेश की अनुमति नहीं थी। पूरा नगर विलक्षण था। अदूभुत और मनोरम। उसे ऐसा होना ही था। वह कोसल राज्य की राजधानी था। राजा दशरथ वहीं रहते थे। उनके राज में दुःख का भला क्या काम? राजा दशरथ कुशल योद्धा और न्यायप्रिय शासक थे। महाराज अज के पुत्र। महाराज रघु के वंशज। रघुकुल के उत्तराधिकारी। रघुकुल की रीति-नीति का प्रभाव हर जगह दिखाई देता था। सुख-समृद्धि से लेकर बात-व्यवहार तक। लोग मर्यादाओं का पालन करते थे। सदाचारी थे। पवित्रता और शांति हर जगह थी। नगर में भी। लोगों के मन में भी। राजा दशरथ यशस्वी थे। उन्हें किसी चीज़ की कमी नहीं थी। राज-सुख था। कमी होने का प्रश्न ही नहीं था। लेकिन उन्हें एक दुःख था। छोटा सा दुःख। मन के एक कोने में छिपा हुआ। वह रह-रहकर उभर आता था। उन्हें सालता रहता था। उनके कोई संतान नहीं थी। आयु लगातार बढ़ती जा रही थी। ली सुनियाँ थीं- कौशुल्या, सुमित्रा और रानियों के मन में भी बस यही एक दुःख था। संतान की कमी। जीवन सूना-सूना लगता था। राजा दशरथ से रानियों की बातचीत प्राय: इसी विषय पर आकर रुक जाती थी। राजा दशरथ की चिंता बढ़ती जा रही थी। बहुत सोच-विचारकर महाराज दशरथ ने इस संबंध में वशिष्ठ मुनि से चर्चा की। उन्हें पूरी बात विस्तार से बताई। रघुकुल के अगले उत्तराधिकारी के बारे में अपनी चिंता बताई। मुनि वशिष्ठ राजा दशरथ की चिंता समझते थे। उन्होंने दशरथ को यज्ञ करने की सलाह दी। पुत्रेष्टि यज्ञ। महर्षि ने कहा, “आप पुत्रेष्टि यज्ञ करें, महाराज! आपकी इच्छा अवश्य पूरी होगी।” में हुआ। पूरा नगर उसकी तैयारी में लगा. हुआ था। यज्ञशाला सरयू नदी के किनारे बनाई गई। यज्ञ में अनेक राजाओं को निमंत्रित किया गया। तमाम ऋषि-मुनि पधारे। शंखध्वनि और मंत्रोच्चार के बीच एक-एक कर सबने आहुति डाली। अंतिम आहुति राजा दशरथ की थी। यज्ञ पूरा हुआ। अग्नि के देवता ने महाराज दशरथ को आशीर्वाद दिया। कुछ समय बाद दशरथ की इच्छा पूरी हुई। तीनों रानियाँ पुत्रवती हुई। महारानी कौशल्या ने राम को जन्म दिया। चैत्र माह की नवमी के दिन। रानी सुमित्रा के दो पुत्र हुए। लक्ष्मण और शत्रुघ्त। रानी कैकेयी के पुत्र का नाम भरत रखा गया। """ **Question 1**: I think this will work. Here:
So,
Another thing is if it still gives wrong results a little scope is still there with the prompt, but then you should switch to a larger model that understands Hindi. BTW, google's Gemini is multilingual, if you can use that in their "AI Vertex" then it will provide you with quality results! Let me know how that goes! |
@AayushSameerShah - Thank you for the quick response. Modified the prompt mentioned by you to get some questions and answers with some issues -
And below is the response in one run - प्रश्न 1: सरस्वती नदी के तट पर स्थित शहर कौन सा है? प्रश्न 2: किसका जन्म चैत्र महीने की नौवें दिन हुआ था? प्रश्न 3: किसका जन्म चैत्र महीने की नौवीं तारीख को हुआ था? And below is the response in another run where I asked for generating 5 questions - प्रश्न 1: सरस्वती नदी के तट पर स्थित शहर कौन सा था? प्रश्न 2: किसका जन्म चैत्र महीने की नौवें दिन हुआ था? प्रश्न 3: किसका जन्म चैत्र महीने की नौवीं तारीख को हुआ था? प्रश्न 4: किसका जन्म चैत्र महीने की आठवीं तारीख को हुआ था? Incorrect Questions Do you think the responses can be improved further? Experimented with Gemini chat (UI) parallely - It generated better quality outputs for the same prompts (ofcourse comes at a price !). Would you suggest any other APIs which are cheaper than Gemini/OpenAI? Am trying to use open source LLMs and free kaggle/colab GPUs to build a product in Hindi which reads PDFs/images of text and stores them in a vectorDB to be used later for question answering. Please let me know your thoughts. |
Great explanation, thanks |
Thank you, @AayushSameerShah for such an insightful discussion. I would like your comments on the following scenario. Let's say we have a private corpus and we build a RAG for open book QA using an embedding model and an LLM. The corpus however is specialized. In my case, for instance, I am trying to build RAG around a corpus of really obscure philosophical works which no LLM including GPT-4 has seen. Now the retrieval works fine, and I am getting answers given the context after optimal prompt crafting with context augmentation. I am using Gemma-7B with sentence transformers. But to further improve my system, I am now thinking to make my LLM a memory machine. Since, the corpus is fixed and I want a companion LLM to always facilitate a conversational semantic search, I am now developing a dataset with three columns, i.e,. context (chunk), Question and Answer. With such a dataset, my finetuned LLM would ingest the whole corpus as well as Q/A pairs with varying styles. I understand that this seems like overfitting on purpose but the aim is to have companion LLM to always reference the corpus easily. I now have two questions:
Thank you. |
Hey @aasem !
Now, I assume your corpus is "still in English". and not in some other less seen languages like Hindi, Arabic, etc. Once that in mind, the RAG would work great and it does as you've said with your experiments with Gemma-7B.
Now, this part is the core of your question. Instead of using RAG all the time, providing context with every question; you want to "train the model on your text" so that the model doesn't have to be given the context anymore. Now, whether to train from scratch or to fine tune it, come with different set of challenges.
Let's discuss each one-by-one.
|
Hello,
Thanks for creating this very helpful tool!
I am fine-tuning the model (GPT-J-6B) for the question answering on the private documents. I have 1000+ documents and they are all in text format. And of course, I will be going with the PEFT LoRA.
But the question is...
Since this is the question-answering scenario, my first thought was to prepare the data set in
"Question: {} Answer: {} Context: {}"
format but since there are so many documents and for that, I will first need to generate the questions, then the answers and... you know it becomes non-feasible.Then I thought, I should "just provide the raw text" to the model as the knowledge base and choose the model which was fine-tuned already on the alpaca dataset (so now the model understands the instructions - for that I will use the "nlpcloud/instruct-gpt-j-fp16" model), and then my hope is that the model should give the response to my questions.
So what I am doing, is correct? How should I prepare my dataset for the question answering?
Please help,
Thanks 🙏🏻
The text was updated successfully, but these errors were encountered: