是否支持sharegpt格式数据？或者带"history"字段的多轮对话数据？ #306

jiejie1993 · 2024-07-31T06:52:26Z

进行SFT时，从源码 megatron_patch/data/llama.py 中的函数`def preprocess(self, examples):
"""
Preprocess the data by tokenizing.
Args:
sources (List[str]): a list of source strings
targets (List[str]): a list of target strings
tokenizer (Tokenizer): a tokenizer object used for tokenization
Returns:
dict: a dictionary containing the input_ids and labels for the examples
"""

    prompt_input, prompt_no_input = PROMPT_DICT[
        'prompt_input'], PROMPT_DICT['prompt_no_input']

    sources = []
    if 'input' not in examples:
        if 'instruction' in examples:
            for instruction in examples['instruction']:
                sources.append(prompt_no_input.format_map({"instruction": instruction}))
        elif 'query' in examples:
            for query in examples['query']:
                sources.append(prompt_no_input.format_map({"instruction": query}))
    else:
        if 'instruction' in examples:
            for instruction, minput in zip(examples['instruction'], examples['input']):
                sources.append(prompt_input.format_map({"instruction": instruction, "input": minput}))
        elif 'query' in examples:
            for query, minput in zip(examples['query'], examples['input']):
                sources.append(prompt_input.format_map({"instruction": query, "input": minput}))

    if 'output' in examples:
        key = 'output'
    elif 'content' in examples:
        key = 'content'
    elif 'response' in examples:
        key = 'response'

    targets = [
        example + self.tokenizer.eos_token
        for example in examples[key]
    ]

    examples_raw = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [
        self.tokenize(strings, self.tokenizer)
        for strings in (examples_raw, sources)
    ]
    input_ids = examples_tokenized['input_ids']
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels,
                                 sources_tokenized['input_ids_lens']):
        label[:source_len] = self.IGNORE_INDEX
    return dict(input_ids=input_ids, labels=labels)`

可以看到没有处理带history历史对话的alpaca格式数据，请问后续会支持么

The text was updated successfully, but these errors were encountered:

jerryli1981 · 2024-08-06T07:09:24Z

您好，这个您可以fork一下patch，自行开发的

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

是否支持sharegpt格式数据？或者带"history"字段的多轮对话数据？ #306

是否支持sharegpt格式数据？或者带"history"字段的多轮对话数据？ #306

jiejie1993 commented Jul 31, 2024

jerryli1981 commented Aug 6, 2024

是否支持sharegpt格式数据？或者带"history"字段的多轮对话数据？ #306

是否支持sharegpt格式数据？或者带"history"字段的多轮对话数据？ #306

Comments

jiejie1993 commented Jul 31, 2024

jerryli1981 commented Aug 6, 2024