Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: a must be greater than 0 unless no samples are taken #110

Open
shuyu-rich opened this issue Dec 6, 2024 · 7 comments
Open

Comments

@shuyu-rich
Copy link

I'm using a local model, and I don't have a dataset. There's no data in my Argilla account as well. I can only generate some data myself based on the keywords in the code that are used to read datasets. Now there's an error, and I don't know where the problem lies.
Here is my error message:

D:\lla\AutoPrompt-main\AutoPrompt\utils\config.py:6: LangChainDeprecationWarning: Importing HuggingFacePipeline from langchain.llms is deprecated. Please replace
deprecated imports:

from langchain.llms import HuggingFacePipeline

with new imports of:

Starting step 0
Processing samples: 0it [00:00, ?it/s]
Processing samples: 0it [00:00, ?it/s]
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Previous prompt score:
nan
#########

Get new prompt:
D:/lla/AutoPrompt-main/prompts/predictor_completion/prediction.prompt
Processing samples: 0%| | 0/1 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.39s/it]
Starting step 1
Processing samples: 0%| | 0/2 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 4.00s/it]
Processing samples: 0it [00:00, ?it/s]
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Previous prompt score:
nan
#########

Get new prompt:
D:/lla/AutoPrompt-main/prompts/predictor_completion/prediction.prompt
Processing samples: 0%| | 0/1 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.02s/it]
Starting step 2
Processing samples: 0%| | 0/2 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.35s/it]
Processing samples: 0it [00:00, ?it/s]
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Previous prompt score:
nan
#########

Get new prompt:
D:/lla/AutoPrompt-main/prompts/predictor_completion/prediction.prompt
Processing samples: 0%| | 0/1 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.95s/it]
Starting step 3
Processing samples: 0%| | 0/1 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.63s/it]
Processing samples: 0it [00:00, ?it/s]
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Previous prompt score:
nan
#########

Get new prompt:
D:/lla/AutoPrompt-main/prompts/predictor_completion/prediction.prompt
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ D:\lla\AutoPrompt-main\AutoPrompt\dataset\base_dataset.py:145 in sample_records │
│ │
│ 142 │ │ │ │ df_samples = self.records.head(n) │
│ 143 │ │ else: │
│ 144 │ │ │ try: │
│ ❱ 145 │ │ │ │ df_samples = self.records.sample(n) │
│ 146 │ │ │ except: │
│ 147 │ │ │ │ n = 1 # 保证样本大小至少为 1 │
│ 148 │ │ │ │ df_samples = self.records.sample(n=n) │
│ │
│ C:\Users\PS\AppData\Roaming\Python\Python310\site-packages\pandas\core\generic.py:5773 in sample │
│ │
│ 5770 │ │ if weights is not None: │
│ 5771 │ │ │ weights = sample.preprocess_weights(self, weights, axis) │
│ 5772 │ │ │
│ ❱ 5773 │ │ sampled_indices = sample.sample(obj_len, size, replace, weights, rs) │
│ 5774 │ │ result = self.take(sampled_indices, axis=axis) │
│ 5775 │ │ │
│ 5776 │ │ if ignore_index: │
│ │
│ C:\Users\PS\AppData\Roaming\Python\Python310\site-packages\pandas\core\sample.py:150 in sample │
│ │
│ 147 │ │ else: │
│ 148 │ │ │ raise ValueError("Invalid weights: weights sum to zero") │
│ 149 │ │
│ ❱ 150 │ return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype( │
│ 151 │ │ np.intp, copy=False │
│ 152 │ ) │
│ 153 │
│ │
│ in numpy.random.mtrand.RandomState.choice:909 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: a must be greater than 0 unless no samples are taken

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ D:\lla\AutoPrompt-main\AutoPrompt\run_pipeline.py:44 in │
│ │
│ 41 pipeline = OptimizationPipeline(config_params, task_description, initial_prompt, output_ │
│ 42 if (opt.load_path != ''): │
│ 43 │ pipeline.load_state(opt.load_path) │
│ ❱ 44 best_prompt = pipeline.run_pipeline(opt.num_steps) │
│ 45 print('\033[92m' + 'Calibrated prompt score:', str(best_prompt['score']) + '\033[0m') │
│ 46 print('\033[92m' + 'Calibrated prompt:', best_prompt['prompt'] + '\033[0m') │
│ 47 │
│ │
│ D:\lla\AutoPrompt-main\AutoPrompt\optimization_pipeline.py:281 in run_pipeline │
│ │
│ 278 │ │ # Run the optimization pipeline for num_steps │
│ 279 │ │ num_steps_remaining = num_steps - self.batch_id │
│ 280 │ │ for i in range(num_steps_remaining): │
│ ❱ 281 │ │ │ stop_criteria = self.step(i, num_steps_remaining) │
│ 282 │ │ │ if stop_criteria: │
│ 283 │ │ │ │ break │
│ 284 │ │ final_result = self.extract_best_prompt() │
│ │
│ D:\lla\AutoPrompt-main\AutoPrompt\optimization_pipeline.py:273 in step │
│ │
│ 270 │ │ │ self.log_and_print('Stop criteria reached') │
│ 271 │ │ │ return True │
│ 272 │ │ if current_iter != total_iter-1: │
│ ❱ 273 │ │ │ self.run_step_prompt() │
│ 274 │ │ self.save_state() │
│ 275 │ │ return False │
│ 276 │
│ │
│ D:\lla\AutoPrompt-main\AutoPrompt\optimization_pipeline.py:137 in run_step_prompt │
│ │
│ 134 │ │ │ │ │ batch['extra_samples'] = extra_samples_text │
│ 135 │ │ │ else: │
│ 136 │ │ │ │ for batch in batch_inputs: │
│ ❱ 137 │ │ │ │ │ extra_samples = self.dataset.sample_records() │
│ 138 │ │ │ │ │ extra_samples_text = DatasetBase.samples_to_text(extra_samples) │
│ 139 │ │ │ │ │ batch['history'] = 'No previous errors information' │
│ 140 │ │ │ │ │ batch['extra_samples'] = extra_samples_text │
│ │
│ D:\lla\AutoPrompt-main\AutoPrompt\dataset\base_dataset.py:148 in sample_records │
│ │
│ 145 │ │ │ │ df_samples = self.records.sample(n) │
│ 146 │ │ │ except: │
│ 147 │ │ │ │ n = 1 # 保证样本大小至少为 1 │
│ ❱ 148 │ │ │ │ df_samples = self.records.sample(n=n) │
│ 149 │ │ │
│ 150 │ │ return df_samples │
│ 151 │
│ │
│ C:\Users\PS\AppData\Roaming\Python\Python310\site-packages\pandas\core\generic.py:5773 in sample │
│ │
│ 5770 │ │ if weights is not None: │
│ 5771 │ │ │ weights = sample.preprocess_weights(self, weights, axis) │
│ 5772 │ │ │
│ ❱ 5773 │ │ sampled_indices = sample.sample(obj_len, size, replace, weights, rs) │
│ 5774 │ │ result = self.take(sampled_indices, axis=axis) │
│ 5775 │ │ │
│ 5776 │ │ if ignore_index: │
│ │
│ C:\Users\PS\AppData\Roaming\Python\Python310\site-packages\pandas\core\sample.py:150 in sample │
│ │
│ 147 │ │ else: │
│ 148 │ │ │ raise ValueError("Invalid weights: weights sum to zero") │
│ 149 │ │
│ ❱ 150 │ return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype( │
│ 151 │ │ np.intp, copy=False │
│ 152 │ ) │
│ 153 │
│ │
│ in numpy.random.mtrand.RandomState.choice:909 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: a must be greater than 0 unless no samples are taken

@shuyu-rich
Copy link
Author

Here is my configuration file information. I hope it can help you understand the issue more clearly.

use_wandb: False
dataset:
name: 'dataset'
records_path: null
initial_dataset: 'D:\lla\AutoPrompt-main\dataset\reviews.csv'
label_schema: ["Yes", "No"]
max_samples: 50
semantic_sampling: False # Change to True in case you don't have M1. Currently there is an issue with faiss and M1

annotator:
method: 'llm'
config:
llm:
type: 'HuggingFacePipeline'
name: 'D:/lla/AutoPrompt-main/GPT2Model' # 使用本地的 GPT-2 模型
max_new_tokens: 100 # 控制生成的最大 token 数
instruction:
'Assess whether the text contains a harmful topic.
Answer Yes if it does and No otherwise.'
num_workers: 5
prompt: 'prompts/predictor_completion/prediction.prompt'
mini_batch_size: 1
mode: 'annotation'

predictor:
method : 'llm'
config:
llm:
type: 'HuggingFacePipeline'
name: 'D:/lla/AutoPrompt-main/GPT2Model' # 使用本地的 GPT-2 模型
model_path: 'D:/lla/AutoPrompt-main/GPT2Model'
temperature: 0.8 # 控制生成文本的多样性
max_new_tokens: 100 # 控制生成的最大 token 数
model_kwargs: {"seed": 220}
num_workers: 5
prompt: 'D:/lla/AutoPrompt-main/prompts/predictor_completion/prediction.prompt'
mini_batch_size: 1 #change to >1 if you want to include multiple samples in the one prompt
mode: 'prediction'

meta_prompts:
folder: 'prompts/meta_prompts_classification'
num_err_prompt: 1 # Number of error examples per sample in the prompt generation
num_err_samples: 2 # Number of error examples per sample in the sample generation
history_length: 4 # Number of sample in the meta-prompt history
num_generated_samples: 10 # Number of generated samples at each iteration
num_initialize_samples: 10 # Number of generated samples at iteration 0, in zero-shot case
samples_generation_batch: 10 # Number of samples generated in one call to the LLM
num_workers: 5 #Number of parallel workers
warmup: 4 # Number of warmup steps

eval:
function_name: 'accuracy'
num_large_errors: 4
num_boundary_predictions : 0
error_threshold: 0.5

llm:
name: 'D:/lla/AutoPrompt-main/GPT2Model' # 使用本地 GPT-2 模型
type: 'huggingfacepipeline' # 模型类型为 Hugging Face
temperature: 0.8
model_path: 'D:/lla/AutoPrompt-main/GPT2Model' # 本地模型路径
max_new_tokens: 100 # 控制生成的最大 token 数

stop_criteria:
max_usage: 2 #In $ in case of OpenAI models, otherwise number of tokens
patience: 10 # Number of patience steps
min_delta: 0.01 # Delta for the improvement definition

@Eladlev
Copy link
Owner

Eladlev commented Dec 6, 2024

The llm part in the configuration is the meta-prompt llm.
This model must be a very strong model and should be some proprietary model (openAI/Anthropic/Google).
You can use a local model for the 'predictor' llm (depending on your task).

@shuyu-rich
Copy link
Author

Is it necessary to modify this configuration information to change meta-prompt llm? Can I download the google/flan-t5-xl model to my local computer and use it? But please check the error message above. I guess it may be a problem with the dataset. However, I don't have a dataset. Could you share an open source dataset for me to test?

annotator:
method: 'llm'
config:
llm:
type: 'HuggingFacePipeline'
name: 'D:/lla/AutoPrompt-main/GPT2Model' # 使用本地的 GPT-2 模型
max_new_tokens: 100 # 控制生成的最大 token 数
instruction:
'Assess whether the text contains a harmful topic.
Answer Yes if it does and No otherwise.'
num_workers: 5
prompt: 'prompts/predictor_completion/prediction.prompt'
mini_batch_size: 1
mode: 'annotation'

@Eladlev
Copy link
Owner

Eladlev commented Dec 9, 2024

  1. In order to modify the meta-prompt llm you need to change the llm that is outside the annotator/predictor. Simply use the GPT-2 only in the predictor part (all the rest LLMs should be strong models)
  2. Flan-T5 might be good as a predictor but not as the meta-prompt LLM
  3. You don't need a dataset The system supports zero-shot setting, and generates synthetic samples for the dataset. The issue you encounter is because the meta-prompt LLM that is responsible also for generating the dataset is not strong enough (therefore it doesn't generate the dataset)

@shuyu-rich
Copy link
Author

If I want to run everything locally, where should I modify the meta-prompt LLM?

@Eladlev
Copy link
Owner

Eladlev commented Dec 9, 2024

If you want to run everything locally then your modifications are correct.
However, as I said unless you have access to gpt-4 and run it locally, this simply not going to work. The meta-prompt task is too challenging for basic open-source models and you will not get a valid response

@shuyu-rich
Copy link
Author

Alright, now I know where the problem lies. Although it cannot run entirely locally, I still really appreciate your response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants