About vicuna_dummy_data.json lack 'example_id' #14

Harry-mic · 2023-11-21T12:13:54Z

Hi! I encounter a bug when doing the step3 (Principle Engraving). I used the self_align_merged.json which is created with "self_align_32shards_*.jsonl" and "vicuna_dummy_data.json" to finetune the base model.

However, I find that vicuna_dummy_data.json file items do not have 'example_id' labels. It results in a bug when execute function "extract_dromedary_dataset":

def extract_dromedary_dataset(example, meta_prompts):
    assert "example_id" in example
    total_meta_prompt = len(meta_prompts)
    meta_prompt = meta_prompts[int(example["example_id"]) % total_meta_prompt]
    if example.get("input", "") != "":
        prompt_format = DROMEDARY_PROMPT_DICT["prompt_input"]
    else:
        prompt_format = DROMEDARY_PROMPT_DICT["prompt_no_input"]
    return {
        "input": prompt_format.format(meta_prompt=meta_prompt, **example),
        "output": "\n" + example["output"],
    }

The vicuna_dummy_data are all labeled "example_id" = None, and result in a int error.

Therefore, I wonder how to deal with this issue and correctly get the vicuna_dummy_data example_ids.Thanks a lot for your reply!

The text was updated successfully, but these errors were encountered:

Edward-Sun · 2023-11-22T04:56:11Z

Hi,

For now, please try the following code to replace the line of meta_prompt = meta_prompts[int(example["example_id"]) % total_meta_prompt]. We will add a commit to solve the issue soon.

example_id = 0
try:
    example_id = int(example["example_id"])
except:
    pass
meta_prompt = meta_prompts[example_id % total_meta_prompt]

Best,
Zhiqing

Harry-mic · 2023-11-22T06:05:25Z

Thanks a lot for your reply and quick revision!

So in the original code, you tag all the unlabeled vicuna_dummy_data with 'example_id = 0'? I wonder what's the point to tag the vicuna_dummy_data with the same example_id while the self_align data is tag different example_id. Also, I notice vicuna_dummy_data are nearly all short conversations, so there seems a significant difference in quality between vicuna_dummy_data and self_align data.

By the way, do you do the inference with llama-2-70b while do the finetuning with llama-2-70b-hf? I notice there is a difference in loading model when doing the inference and doing the finetuning process.

I'd appreciate for your help!

Edward-Sun · 2023-11-22T11:34:53Z

Hi Harryis,

In our codebase, "example_id" only affects which prompt template to use, so it won't affect too much on the performance.

Also, if you inspect the data, you would find that the vicuna_dummy_data are only about the identity questions, such that the model generates correct outputs given inquiries about its name or developers. So in this case, it can be guaranteed that it would not affect the model's performance.

By the way, do you do the inference with llama-2-70b while do the finetuning with llama-2-70b-hf?

We use the original llama checkpoint (i.e., llama-2-70b) for model-parallel inference (from the original llama codebase). For fine-tuning, llama-2-70b-hf is used since we are using deepspeed (in Dromedary-1) or qlora (in Dromedary-2)

Harry-mic · 2023-11-25T10:20:34Z

Thanks a lot for your explaination!

Is it because of the faster inference of llama-2-70b ckpt code that you choose to use it rather than huggingface code? The past_key_values in the cache of huggingface code are also a problem.

Edward-Sun · 2023-11-27T02:21:19Z

Yes, when we developed this project around March/April, the faster inference techniques (e.g., TGI and vLLM) of llama had not been developed, so we tried our best to use a customized llama with a native model parallel to improve the generation throughput.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About vicuna_dummy_data.json lack 'example_id' #14

About vicuna_dummy_data.json lack 'example_id' #14

Harry-mic commented Nov 21, 2023

Edward-Sun commented Nov 22, 2023

Harry-mic commented Nov 22, 2023 •

edited

Loading

Edward-Sun commented Nov 22, 2023

Harry-mic commented Nov 25, 2023

Edward-Sun commented Nov 27, 2023

About vicuna_dummy_data.json lack 'example_id' #14

About vicuna_dummy_data.json lack 'example_id' #14

Comments

Harry-mic commented Nov 21, 2023

Edward-Sun commented Nov 22, 2023

Harry-mic commented Nov 22, 2023 • edited Loading

Edward-Sun commented Nov 22, 2023

Harry-mic commented Nov 25, 2023

Edward-Sun commented Nov 27, 2023

Harry-mic commented Nov 22, 2023 •

edited

Loading