Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About vicuna_dummy_data.json lack 'example_id' #14

Open
Harry-mic opened this issue Nov 21, 2023 · 5 comments
Open

About vicuna_dummy_data.json lack 'example_id' #14

Harry-mic opened this issue Nov 21, 2023 · 5 comments

Comments

@Harry-mic
Copy link

Hi! I encounter a bug when doing the step3 (Principle Engraving). I used the self_align_merged.json which is created with "self_align_32shards_*.jsonl" and "vicuna_dummy_data.json" to finetune the base model.

However, I find that vicuna_dummy_data.json file items do not have 'example_id' labels. It results in a bug when execute function "extract_dromedary_dataset":

def extract_dromedary_dataset(example, meta_prompts):
    assert "example_id" in example
    total_meta_prompt = len(meta_prompts)
    meta_prompt = meta_prompts[int(example["example_id"]) % total_meta_prompt]
    if example.get("input", "") != "":
        prompt_format = DROMEDARY_PROMPT_DICT["prompt_input"]
    else:
        prompt_format = DROMEDARY_PROMPT_DICT["prompt_no_input"]
    return {
        "input": prompt_format.format(meta_prompt=meta_prompt, **example),
        "output": "\n" + example["output"],
    }

The vicuna_dummy_data are all labeled "example_id" = None, and result in a int error.

Therefore, I wonder how to deal with this issue and correctly get the vicuna_dummy_data example_ids.Thanks a lot for your reply!

@Edward-Sun
Copy link
Contributor

Hi,

For now, please try the following code to replace the line of meta_prompt = meta_prompts[int(example["example_id"]) % total_meta_prompt]. We will add a commit to solve the issue soon.

example_id = 0
try:
    example_id = int(example["example_id"])
except:
    pass
meta_prompt = meta_prompts[example_id % total_meta_prompt]

Best,
Zhiqing

@Harry-mic
Copy link
Author

Harry-mic commented Nov 22, 2023

Thanks a lot for your reply and quick revision!

So in the original code, you tag all the unlabeled vicuna_dummy_data with 'example_id = 0'? I wonder what's the point to tag the vicuna_dummy_data with the same example_id while the self_align data is tag different example_id. Also, I notice vicuna_dummy_data are nearly all short conversations, so there seems a significant difference in quality between vicuna_dummy_data and self_align data.

By the way, do you do the inference with llama-2-70b while do the finetuning with llama-2-70b-hf? I notice there is a difference in loading model when doing the inference and doing the finetuning process.

I'd appreciate for your help!

@Edward-Sun
Copy link
Contributor

Hi Harryis,

In our codebase, "example_id" only affects which prompt template to use, so it won't affect too much on the performance.

Also, if you inspect the data, you would find that the vicuna_dummy_data are only about the identity questions, such that the model generates correct outputs given inquiries about its name or developers. So in this case, it can be guaranteed that it would not affect the model's performance.

By the way, do you do the inference with llama-2-70b while do the finetuning with llama-2-70b-hf?

We use the original llama checkpoint (i.e., llama-2-70b) for model-parallel inference (from the original llama codebase). For fine-tuning, llama-2-70b-hf is used since we are using deepspeed (in Dromedary-1) or qlora (in Dromedary-2)

@Harry-mic
Copy link
Author

Thanks a lot for your explaination!

Is it because of the faster inference of llama-2-70b ckpt code that you choose to use it rather than huggingface code? The past_key_values in the cache of huggingface code are also a problem.

@Edward-Sun
Copy link
Contributor

Yes, when we developed this project around March/April, the faster inference techniques (e.g., TGI and vLLM) of llama had not been developed, so we tried our best to use a customized llama with a native model parallel to improve the generation throughput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants