Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

是否支持sharegpt格式数据?或者带"history"字段的多轮对话数据? #306

Open
jiejie1993 opened this issue Jul 31, 2024 · 1 comment

Comments

@jiejie1993
Copy link

进行SFT时,从源码 megatron_patch/data/llama.py 中的函数`def preprocess(self, examples):
"""
Preprocess the data by tokenizing.
Args:
sources (List[str]): a list of source strings
targets (List[str]): a list of target strings
tokenizer (Tokenizer): a tokenizer object used for tokenization
Returns:
dict: a dictionary containing the input_ids and labels for the examples
"""

    prompt_input, prompt_no_input = PROMPT_DICT[
        'prompt_input'], PROMPT_DICT['prompt_no_input']

    sources = []
    if 'input' not in examples:
        if 'instruction' in examples:
            for instruction in examples['instruction']:
                sources.append(prompt_no_input.format_map({"instruction": instruction}))
        elif 'query' in examples:
            for query in examples['query']:
                sources.append(prompt_no_input.format_map({"instruction": query}))
    else:
        if 'instruction' in examples:
            for instruction, minput in zip(examples['instruction'], examples['input']):
                sources.append(prompt_input.format_map({"instruction": instruction, "input": minput}))
        elif 'query' in examples:
            for query, minput in zip(examples['query'], examples['input']):
                sources.append(prompt_input.format_map({"instruction": query, "input": minput}))

    if 'output' in examples:
        key = 'output'
    elif 'content' in examples:
        key = 'content'
    elif 'response' in examples:
        key = 'response'

    targets = [
        example + self.tokenizer.eos_token
        for example in examples[key]
    ]

    examples_raw = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [
        self.tokenize(strings, self.tokenizer)
        for strings in (examples_raw, sources)
    ]
    input_ids = examples_tokenized['input_ids']
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels,
                                 sources_tokenized['input_ids_lens']):
        label[:source_len] = self.IGNORE_INDEX
    return dict(input_ids=input_ids, labels=labels)`

可以看到没有处理带history历史对话的alpaca格式数据,请问后续会支持么

@jerryli1981
Copy link
Collaborator

您好,这个您可以fork一下patch,自行开发的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants