You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
进行SFT时,从源码 megatron_patch/data/llama.py 中的函数`def preprocess(self, examples):
"""
Preprocess the data by tokenizing.
Args:
sources (List[str]): a list of source strings
targets (List[str]): a list of target strings
tokenizer (Tokenizer): a tokenizer object used for tokenization
Returns:
dict: a dictionary containing the input_ids and labels for the examples
"""
prompt_input, prompt_no_input = PROMPT_DICT[
'prompt_input'], PROMPT_DICT['prompt_no_input']
sources = []
if 'input' not in examples:
if 'instruction' in examples:
for instruction in examples['instruction']:
sources.append(prompt_no_input.format_map({"instruction": instruction}))
elif 'query' in examples:
for query in examples['query']:
sources.append(prompt_no_input.format_map({"instruction": query}))
else:
if 'instruction' in examples:
for instruction, minput in zip(examples['instruction'], examples['input']):
sources.append(prompt_input.format_map({"instruction": instruction, "input": minput}))
elif 'query' in examples:
for query, minput in zip(examples['query'], examples['input']):
sources.append(prompt_input.format_map({"instruction": query, "input": minput}))
if 'output' in examples:
key = 'output'
elif 'content' in examples:
key = 'content'
elif 'response' in examples:
key = 'response'
targets = [
example + self.tokenizer.eos_token
for example in examples[key]
]
examples_raw = [s + t for s, t in zip(sources, targets)]
examples_tokenized, sources_tokenized = [
self.tokenize(strings, self.tokenizer)
for strings in (examples_raw, sources)
]
input_ids = examples_tokenized['input_ids']
labels = copy.deepcopy(input_ids)
for label, source_len in zip(labels,
sources_tokenized['input_ids_lens']):
label[:source_len] = self.IGNORE_INDEX
return dict(input_ids=input_ids, labels=labels)`
可以看到没有处理带history历史对话的alpaca格式数据,请问后续会支持么
The text was updated successfully, but these errors were encountered:
进行SFT时,从源码 megatron_patch/data/llama.py 中的函数`def preprocess(self, examples):
"""
Preprocess the data by tokenizing.
Args:
sources (List[str]): a list of source strings
targets (List[str]): a list of target strings
tokenizer (Tokenizer): a tokenizer object used for tokenization
Returns:
dict: a dictionary containing the input_ids and labels for the examples
"""
可以看到没有处理带history历史对话的alpaca格式数据,请问后续会支持么
The text was updated successfully, but these errors were encountered: