ValueError: a must be greater than 0 unless no samples are taken #110

shuyu-rich · 2024-12-06T01:38:25Z

I'm using a local model, and I don't have a dataset. There's no data in my Argilla account as well. I can only generate some data myself based on the keywords in the code that are used to read datasets. Now there's an error, and I don't know where the problem lies.
Here is my error message：

D:\lla\AutoPrompt-main\AutoPrompt\utils\config.py:6: LangChainDeprecationWarning: Importing HuggingFacePipeline from langchain.llms is deprecated. Please replace
deprecated imports:

from langchain.llms import HuggingFacePipeline

with new imports of:

Starting step 0
Processing samples: 0it [00:00, ?it/s]
Processing samples: 0it [00:00, ?it/s]
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Previous prompt score:
nan
#########

Get new prompt:
D:/lla/AutoPrompt-main/prompts/predictor_completion/prediction.prompt
Processing samples: 0%| | 0/1 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.39s/it]
Starting step 1
Processing samples: 0%| | 0/2 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 4.00s/it]
Processing samples: 0it [00:00, ?it/s]
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Previous prompt score:
nan
#########

Get new prompt:
D:/lla/AutoPrompt-main/prompts/predictor_completion/prediction.prompt
Processing samples: 0%| | 0/1 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.02s/it]
Starting step 2
Processing samples: 0%| | 0/2 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.35s/it]
Processing samples: 0it [00:00, ?it/s]
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Previous prompt score:
nan
#########

Get new prompt:
D:/lla/AutoPrompt-main/prompts/predictor_completion/prediction.prompt
Processing samples: 0%| | 0/1 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.95s/it]
Starting step 3
Processing samples: 0%| | 0/1 [00:00<?, ?it/s]S
etting pad_token_id to eos_token_id:50256 for open-end generation.
Processing samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.63s/it]
Processing samples: 0it [00:00, ?it/s]
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Setting pad_token_id to eos_token_id:50256 for open-end generation.
Previous prompt score:
nan
#########

Get new prompt:
D:/lla/AutoPrompt-main/prompts/predictor_completion/prediction.prompt
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ D:\lla\AutoPrompt-main\AutoPrompt\dataset\base_dataset.py:145 in sample_records │
│ │
│ 142 │ │ │ │ df_samples = self.records.head(n) │
│ 143 │ │ else: │
│ 144 │ │ │ try: │
│ ❱ 145 │ │ │ │ df_samples = self.records.sample(n) │
│ 146 │ │ │ except: │
│ 147 │ │ │ │ n = 1 # 保证样本大小至少为 1 │
│ 148 │ │ │ │ df_samples = self.records.sample(n=n) │
│ │
│ C:\Users\PS\AppData\Roaming\Python\Python310\site-packages\pandas\core\generic.py:5773 in sample │
│ │
│ 5770 │ │ if weights is not None: │
│ 5771 │ │ │ weights = sample.preprocess_weights(self, weights, axis) │
│ 5772 │ │ │
│ ❱ 5773 │ │ sampled_indices = sample.sample(obj_len, size, replace, weights, rs) │
│ 5774 │ │ result = self.take(sampled_indices, axis=axis) │
│ 5775 │ │ │
│ 5776 │ │ if ignore_index: │
│ │
│ C:\Users\PS\AppData\Roaming\Python\Python310\site-packages\pandas\core\sample.py:150 in sample │
│ │
│ 147 │ │ else: │
│ 148 │ │ │ raise ValueError("Invalid weights: weights sum to zero") │
│ 149 │ │
│ ❱ 150 │ return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype( │
│ 151 │ │ np.intp, copy=False │
│ 152 │ ) │
│ 153 │
│ │
│ in numpy.random.mtrand.RandomState.choice:909 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: a must be greater than 0 unless no samples are taken

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ D:\lla\AutoPrompt-main\AutoPrompt\run_pipeline.py:44 in │
│ │
│ 41 pipeline = OptimizationPipeline(config_params, task_description, initial_prompt, output_ │
│ 42 if (opt.load_path != ''): │
│ 43 │ pipeline.load_state(opt.load_path) │
│ ❱ 44 best_prompt = pipeline.run_pipeline(opt.num_steps) │
│ 45 print('\033[92m' + 'Calibrated prompt score:', str(best_prompt['score']) + '\033[0m') │
│ 46 print('\033[92m' + 'Calibrated prompt:', best_prompt['prompt'] + '\033[0m') │
│ 47 │
│ │
│ D:\lla\AutoPrompt-main\AutoPrompt\optimization_pipeline.py:281 in run_pipeline │
│ │
│ 278 │ │ # Run the optimization pipeline for num_steps │
│ 279 │ │ num_steps_remaining = num_steps - self.batch_id │
│ 280 │ │ for i in range(num_steps_remaining): │
│ ❱ 281 │ │ │ stop_criteria = self.step(i, num_steps_remaining) │
│ 282 │ │ │ if stop_criteria: │
│ 283 │ │ │ │ break │
│ 284 │ │ final_result = self.extract_best_prompt() │
│ │
│ D:\lla\AutoPrompt-main\AutoPrompt\optimization_pipeline.py:273 in step │
│ │
│ 270 │ │ │ self.log_and_print('Stop criteria reached') │
│ 271 │ │ │ return True │
│ 272 │ │ if current_iter != total_iter-1: │
│ ❱ 273 │ │ │ self.run_step_prompt() │
│ 274 │ │ self.save_state() │
│ 275 │ │ return False │
│ 276 │
│ │
│ D:\lla\AutoPrompt-main\AutoPrompt\optimization_pipeline.py:137 in run_step_prompt │
│ │
│ 134 │ │ │ │ │ batch['extra_samples'] = extra_samples_text │
│ 135 │ │ │ else: │
│ 136 │ │ │ │ for batch in batch_inputs: │
│ ❱ 137 │ │ │ │ │ extra_samples = self.dataset.sample_records() │
│ 138 │ │ │ │ │ extra_samples_text = DatasetBase.samples_to_text(extra_samples) │
│ 139 │ │ │ │ │ batch['history'] = 'No previous errors information' │
│ 140 │ │ │ │ │ batch['extra_samples'] = extra_samples_text │
│ │
│ D:\lla\AutoPrompt-main\AutoPrompt\dataset\base_dataset.py:148 in sample_records │
│ │
│ 145 │ │ │ │ df_samples = self.records.sample(n) │
│ 146 │ │ │ except: │
│ 147 │ │ │ │ n = 1 # 保证样本大小至少为 1 │
│ ❱ 148 │ │ │ │ df_samples = self.records.sample(n=n) │
│ 149 │ │ │
│ 150 │ │ return df_samples │
│ 151 │
│ │
│ C:\Users\PS\AppData\Roaming\Python\Python310\site-packages\pandas\core\generic.py:5773 in sample │
│ │
│ 5770 │ │ if weights is not None: │
│ 5771 │ │ │ weights = sample.preprocess_weights(self, weights, axis) │
│ 5772 │ │ │
│ ❱ 5773 │ │ sampled_indices = sample.sample(obj_len, size, replace, weights, rs) │
│ 5774 │ │ result = self.take(sampled_indices, axis=axis) │
│ 5775 │ │ │
│ 5776 │ │ if ignore_index: │
│ │
│ C:\Users\PS\AppData\Roaming\Python\Python310\site-packages\pandas\core\sample.py:150 in sample │
│ │
│ 147 │ │ else: │
│ 148 │ │ │ raise ValueError("Invalid weights: weights sum to zero") │
│ 149 │ │
│ ❱ 150 │ return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype( │
│ 151 │ │ np.intp, copy=False │
│ 152 │ ) │
│ 153 │
│ │
│ in numpy.random.mtrand.RandomState.choice:909 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: a must be greater than 0 unless no samples are taken

The text was updated successfully, but these errors were encountered:

shuyu-rich · 2024-12-06T01:59:16Z

Here is my configuration file information. I hope it can help you understand the issue more clearly.

use_wandb: False
dataset:
name: 'dataset'
records_path: null
initial_dataset: 'D:\lla\AutoPrompt-main\dataset\reviews.csv'
label_schema: ["Yes", "No"]
max_samples: 50
semantic_sampling: False # Change to True in case you don't have M1. Currently there is an issue with faiss and M1

annotator:
method: 'llm'
config:
llm:
type: 'HuggingFacePipeline'
name: 'D:/lla/AutoPrompt-main/GPT2Model' # 使用本地的 GPT-2 模型
max_new_tokens: 100 # 控制生成的最大 token 数
instruction:
'Assess whether the text contains a harmful topic.
Answer Yes if it does and No otherwise.'
num_workers: 5
prompt: 'prompts/predictor_completion/prediction.prompt'
mini_batch_size: 1
mode: 'annotation'

predictor:
method : 'llm'
config:
llm:
type: 'HuggingFacePipeline'
name: 'D:/lla/AutoPrompt-main/GPT2Model' # 使用本地的 GPT-2 模型
model_path: 'D:/lla/AutoPrompt-main/GPT2Model'
temperature: 0.8 # 控制生成文本的多样性
max_new_tokens: 100 # 控制生成的最大 token 数
model_kwargs: {"seed": 220}
num_workers: 5
prompt: 'D:/lla/AutoPrompt-main/prompts/predictor_completion/prediction.prompt'
mini_batch_size: 1 #change to >1 if you want to include multiple samples in the one prompt
mode: 'prediction'

meta_prompts:
folder: 'prompts/meta_prompts_classification'
num_err_prompt: 1 # Number of error examples per sample in the prompt generation
num_err_samples: 2 # Number of error examples per sample in the sample generation
history_length: 4 # Number of sample in the meta-prompt history
num_generated_samples: 10 # Number of generated samples at each iteration
num_initialize_samples: 10 # Number of generated samples at iteration 0, in zero-shot case
samples_generation_batch: 10 # Number of samples generated in one call to the LLM
num_workers: 5 #Number of parallel workers
warmup: 4 # Number of warmup steps

eval:
function_name: 'accuracy'
num_large_errors: 4
num_boundary_predictions : 0
error_threshold: 0.5

llm:
name: 'D:/lla/AutoPrompt-main/GPT2Model' # 使用本地 GPT-2 模型
type: 'huggingfacepipeline' # 模型类型为 Hugging Face
temperature: 0.8
model_path: 'D:/lla/AutoPrompt-main/GPT2Model' # 本地模型路径
max_new_tokens: 100 # 控制生成的最大 token 数

stop_criteria:
max_usage: 2 #In $ in case of OpenAI models, otherwise number of tokens
patience: 10 # Number of patience steps
min_delta: 0.01 # Delta for the improvement definition

Eladlev · 2024-12-06T15:33:29Z

The llm part in the configuration is the meta-prompt llm.
This model must be a very strong model and should be some proprietary model (openAI/Anthropic/Google).
You can use a local model for the 'predictor' llm (depending on your task).

shuyu-rich · 2024-12-09T03:12:52Z

Is it necessary to modify this configuration information to change meta-prompt llm? Can I download the google/flan-t5-xl model to my local computer and use it? But please check the error message above. I guess it may be a problem with the dataset. However, I don't have a dataset. Could you share an open source dataset for me to test?

annotator:
method: 'llm'
config:
llm:
type: 'HuggingFacePipeline'
name: 'D:/lla/AutoPrompt-main/GPT2Model' # 使用本地的 GPT-2 模型
max_new_tokens: 100 # 控制生成的最大 token 数
instruction:
'Assess whether the text contains a harmful topic.
Answer Yes if it does and No otherwise.'
num_workers: 5
prompt: 'prompts/predictor_completion/prediction.prompt'
mini_batch_size: 1
mode: 'annotation'

Eladlev · 2024-12-09T07:13:07Z

In order to modify the meta-prompt llm you need to change the llm that is outside the annotator/predictor. Simply use the GPT-2 only in the predictor part (all the rest LLMs should be strong models)
Flan-T5 might be good as a predictor but not as the meta-prompt LLM
You don't need a dataset The system supports zero-shot setting, and generates synthetic samples for the dataset. The issue you encounter is because the meta-prompt LLM that is responsible also for generating the dataset is not strong enough (therefore it doesn't generate the dataset)

shuyu-rich · 2024-12-09T07:37:08Z

If I want to run everything locally, where should I modify the meta-prompt LLM?

Eladlev · 2024-12-09T08:01:11Z

If you want to run everything locally then your modifications are correct.
However, as I said unless you have access to gpt-4 and run it locally, this simply not going to work. The meta-prompt task is too challenging for basic open-source models and you will not get a valid response

shuyu-rich · 2024-12-09T08:29:00Z

Alright, now I know where the problem lies. Although it cannot run entirely locally, I still really appreciate your response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: a must be greater than 0 unless no samples are taken #110

ValueError: a must be greater than 0 unless no samples are taken #110

shuyu-rich commented Dec 6, 2024

shuyu-rich commented Dec 6, 2024

Eladlev commented Dec 6, 2024 •

edited

Loading

shuyu-rich commented Dec 9, 2024

Eladlev commented Dec 9, 2024

shuyu-rich commented Dec 9, 2024

Eladlev commented Dec 9, 2024

shuyu-rich commented Dec 9, 2024

ValueError: a must be greater than 0 unless no samples are taken #110

ValueError: a must be greater than 0 unless no samples are taken #110

Comments

shuyu-rich commented Dec 6, 2024

shuyu-rich commented Dec 6, 2024

Eladlev commented Dec 6, 2024 • edited Loading

shuyu-rich commented Dec 9, 2024

Eladlev commented Dec 9, 2024

shuyu-rich commented Dec 9, 2024

Eladlev commented Dec 9, 2024

shuyu-rich commented Dec 9, 2024

Eladlev commented Dec 6, 2024 •

edited

Loading