KeyError: 'samples' #51

YukiChen-yuxin · 2024-03-12T21:35:55Z

Hi! I tried to run the pipeline using Azure Open AI, using LLM as annotator, but got this error.

Processing samples: 100%|##########| 1/1 [00:24<00:00, 24.04s/it]
Traceback (most recent call last):
  File "prompt_model\AutoPrompt\run_pipeline.py", line 44, in <module>
    best_prompt = pipeline.run_pipeline(opt.num_steps)
  File "prompt_model\AutoPrompt\optimization_pipeline.py", line 274, in run_pipel
    stop_criteria = self.step()
  File "prompt_model\AutoPrompt\optimization_pipeline.py", line 233, in step
    self.generate_initial_samples()
  File "prompt_model\AutoPrompt\optimization_pipeline.py", line 194, in generate_tial_samples
    samples_list = [element for sublist in samples_batches for element in sublist['samples']]
  File "prompt_model\AutoPrompt\optimization_pipeline.py", line 194, in <listcomp
    samples_list = [element for sublist in samples_batches for element in sublist['samples']]
KeyError: 'samples'

The text was updated successfully, but these errors were encountered:

YukiChen-yuxin · 2024-03-12T21:52:15Z

I got the same error when I tried to use Argilla, is there any place I need to store my own samples data for this part? Thanks

Eladlev · 2024-03-12T22:04:17Z

Can you provide more details:

Are you running the classification pipeline or the generation?
Please provide all the config modifications

Also, please verify that you are not loading dumps (this error might be due to dumps issues).

YukiChen-yuxin · 2024-03-12T22:14:46Z

I'm running a classification pipeline. The config file is:

use_wandb: True
dataset:
    name: 'dataset'
    records_path: null
    initial_dataset: ''
    label_schema: ["toxic", "non-toxic"]
    max_samples: 50
    semantic_sampling: True # Change to True in case you don't have M1. Currently there is an issue with faiss and M1

annotator:
    method: 'llm'
    config:
        llm:
            type: 'Azure'
            name: 'gpt-4-32k-0613'
        instruction:
            'Assess whether the text contains a spam topic. 
            Answer toxic if it does and non-toxic otherwise.'
        num_workers: 5
        prompt: 'prompts/predictor_completion/prediction.prompt'
        mini_batch_size: 1
        mode: 'annotation'


predictor:
    method : 'llm'
    config:
        llm:
            type: 'Azure'
            name: 'gpt-4-32k-0613'
#            async_params:
#                retry_interval: 10
#                max_retries: 2
            model_kwargs: {"seed": 220}
        num_workers: 5
        prompt: 'prompts/predictor_completion/prediction.prompt'
        mini_batch_size: 1  #change to >1 if you want to include multiple samples in the one prompt
        mode: 'prediction'

meta_prompts:
    folder: 'prompts/meta_prompts_classification'
    num_err_prompt: 1  # Number of error examples per sample in the prompt generation
    num_err_samples: 2 # Number of error examples per sample in the sample generation
    history_length: 4 # Number of sample in the meta-prompt history
    num_generated_samples: 10 # Number of generated samples at each iteration
    num_initialize_samples: 10 # Number of generated samples at iteration 0, in zero-shot case
    samples_generation_batch: 10 # Number of samples generated in one call to the LLM
    num_workers: 5 #Number of parallel workers
    warmup: 4 # Number of warmup steps

eval:
    function_name: 'accuracy'
    num_large_errors: 4
    num_boundary_predictions : 0
    error_threshold: 0.5

llm:
    type: 'Azure'
    name: 'gpt-4-32k-0613'
    temperature: 0

stop_criteria:
    max_usage: 2 #In $ in case of OpenAI models, otherwise number of tokens
    patience: 10 # Number of patience steps
    min_delta: 0.01 # Delta for the improvement definition

Sorry where to check if i'm loading dumps or not?
Thanks

YukiChen-yuxin · 2024-03-12T22:41:13Z

Seems there is something wrong when pipeline tries to generate initial samples, I wonder if we can use our own sample datasets or not, I changed the initial_dataset in config file to 'dump/validated_test_file_temp.csv' which contains one column with several samples and then I got this error, is there any other config I need to modify if I want to use my own sample data?

prompt_model\AutoPrompt\run_pipeline.py:44 in <module>    |
|                                                                             |
|   41 pipeline = OptimizationPipeline(config_params, task_description, initi |
|   42 if (opt.load_path != ''):                                              |
|   43     pipeline.load_state(opt.load_path)                                 |
| > 44 best_prompt = pipeline.run_pipeline(opt.num_steps)                     |
|   45 print('\033[92m' + 'Calibrated prompt score:', str(best_prompt['score' |
|   46 print('\033[92m' + 'Calibrated prompt:', best_prompt['prompt'] + '\033 |
|   47                                                                        |
|                                                                             |
| prompt_model\AutoPrompt\optimization_pipeline.py:274 in   |
| run_pipeline                                                                |
|                                                                             |
|   271         # Run the optimization pipeline for num_steps                 |
|   272         num_steps_remaining = num_steps - self.batch_id               |
|   273         for i in range(num_steps_remaining):                          |
| > 274             stop_criteria = self.step()                               |
|   275             if stop_criteria:                                         |
|   276                 break                                                 |
|   277         final_result = self.extract_best_prompt()                     |
|                                                                             |
| prompt_model\AutoPrompt\optimization_pipeline.py:242 in   |
| step                                                                        |
|                                                                             |
|   239                 step=self.batch_id)                                   |
|   240                                                                       |
|   241         logging.info('Running annotator')                             |
| > 242         records = self.annotator.apply(self.dataset, self.batch_id)   |
|   243         self.dataset.update(records)                                  |
|   244                                                                       |
|   245         self.predictor.cur_instruct = self.cur_prompt                 |
|                                                                             |
| AutoPrompt\estimator\estimator_argilla.py:92 |
| in apply                                                                    |
|                                                                             |
|    89         except:                                                       |
|    90             self.initialize_dataset(dataset.name, dataset.label_schem |
|    91             rg_dataset = current_api.datasets.find_by_name(dataset.na |
| >  92         batch_records = dataset[batch_id]                             |
|    93         if batch_records.empty:                                       |
|    94             return []                                                 |
|    95         self.upload_missing_records(dataset.name, batch_id, batch_rec |
|                                                                             |
| AutoPrompt\dataset\base_dataset.py:41 in     |
| __getitem__                                                                 |
|                                                                             |
|    38         """                                                           |
|    39         Return the batch idx.                                         |
|    40         """                                                           |
| >  41         extract_records = self.records[self.records['batch_id'] == ba |
|    42         extract_records = extract_records.reset_index(drop=True)      |
|    43         return extract_records                                        |
|    44                                                                       |
|                                                                             |
| AutoPrompt\lib\site-packa |
| ges\pandas\core\frame.py:4090 in __getitem__                                |
|                                                                             |
|    4087         if is_single_key:                                           |
|    4088             if self.columns.nlevels > 1:                            |
|    4089                 return self._getitem_multilevel(key)                |
| >  4090             indexer = self.columns.get_loc(key)                     |
|    4091             if is_integer(indexer):                                 |
|    4092                 indexer = [indexer]                                 |
|    4093         else:                                                       |
|                                                                             |
| AutoPrompt\lib\site-packa |
| ges\pandas\core\indexes\base.py:3812 in get_loc                             |
|                                                                             |
|   3809                 and any(isinstance(x, slice) for x in casted_key)    |
|   3810             ):                                                       |
|   3811                 raise InvalidIndexError(key)                         |
| > 3812             raise KeyError(key) from err                             |
|   3813         except TypeError:                                            |
|   3814             # If we have a listlike key, _check_indexing_error will  |
|   3815             #  InvalidIndexError. Otherwise we fall through and re-r |
+-----------------------------------------------------------------------------+
KeyError: 'batch_id'

YukiChen-yuxin · 2024-03-12T23:33:59Z

I think I solved this based on the dataset info you provided in example.md. Just want to check what is the difference between prediction col and annotation col, as I know we need to provide both of them in input datasets right? And what are the metadata and score for here

Thanks for your help.

id,text,prediction,annotation,metadata,score,batch_id
0,"The cinematography was mesmerizing, especially during the scene where they finally reveal the mysterious room that captivated the main character.",No,Yes,,,0
1,"The director's bold choice to leave the world's fate unclear until the final frame will spark audience discussions.",No,Yes,,,0

Eladlev · 2024-03-13T05:44:25Z

Annotation: The annotator's response. This is the GT provided by the annotator.
Prediction: The predictor response. This is the result of the current prompt
Score: The score of the current sample, calculated by evaluating the score function on the prediction value and the annotation value (in the case of classification it's 0 if they are not the same or 1 if they are the same).
metadata: This is the metadata for the Argilla dataset, in practice it contains the batch_id value, and it's not necessary

YukiChen-yuxin · 2024-03-13T06:22:47Z

Oh got it, so if I want to input my own sample dataset file, can I leave the prediction col to be all empty?

Eladlev · 2024-03-13T06:56:17Z

Yes.
Another important thing:

If you remove completely the annotator from the config file he will use the annotation column in the csv
Otherwise he will update the annotation column (according to the annotator you choose), so you can leave also this column to be empty.

YukiChen-yuxin · 2024-03-13T16:08:19Z

Hi, I will get the AttributeError if I remove the whole annotator part from config file

|prompt_model\AutoPrompt\optimization_pipeline.py:57 in |
| init |
| |
| 54 self.cur_prompt = initial_prompt |
| 55 |
| 56 self.predictor = give_estimator(config.predictor) |
| > 57 self.annotator = give_estimator(config.annotator) |
| 58 self.eval = Eval(config.eval, self.meta_chain.error_analysis, |
| 59 self.batch_id = 0 |
| 60 self.patient = 0 |
+-----------------------------------------------------------------------------+
AttributeError: 'EasyDict' object has no attribute 'annotator'
(AutoPrompt)

Eladlev · 2024-03-13T16:23:29Z

Yes, you are right it should not be removed completely but you should modify the method to empty string:

annotator:
   method : ''

YukiChen-yuxin · 2024-03-13T16:27:50Z

Thanks.
And I also met this error, does it mean my llm didn't return any new prompt
| prompt_model\AutoPrompt\optimization_pipeline.py:116 in |
| run_step_prompt |
| |
| 113 prompt_suggestion = self.meta_chain.step_prompt_chain.invoke( |
| 114 print(f'prompt_suggestion: {prompt_suggestion}') |
| 115 self.log_and_print(f'Previous prompt score:\n{self.eval.mean_ |
| > 116 self.log_and_print(f'Get new prompt:\n{prompt_suggestion["pro |
| 117 self.batch_id += 1 |
| 118 if len(self.dataset) < self.config.dataset.max_samples: |
| 119 batch_input = {"num_samples": self.config.meta_prompts.sa |
+-----------------------------------------------------------------------------+
KeyError: 'prompt'

Eladlev · 2024-03-13T16:54:38Z

It might be an issue with openAI functions (although the model you are using should be support functions).
You can try to use the completion pipeline:

meta_prompts:
    folder: 'prompts/meta_prompts_completion'

YukiChen-yuxin · 2024-03-13T17:33:13Z

In this part in evaluator.py, seems it labeled all my prediction col to Discard and then delete all my data, and the self.dataset is empty and all metrix are empty due to this. I'm not sure what did I change so I re-download the code repo and only changed the dataset file and two config files, still got this. Do you have any idea what is going on here

def eval_score(self) -> float:
        """
        Calculate the score on each row and return the mean score.
        :return: The mean score
        """
        # filter out the discarded samples
        self.dataset = self.dataset[(self.dataset['prediction'] != 'Discarded') &
                                    (self.dataset['annotation'] != 'Discarded')]
        self.dataset = self.score_func(self.dataset)
        self.mean_score = self.dataset['score'].mean()
        return self.mean_score

| prompt_model\AutoPrompt\eval\evaluator.py:126 in          |
| add_history                                                                 |
|                                                                             |
|   123         analysis = self.analyzer.invoke(prompt_input)                 |
|   124                                                                       |
|   125         self.history.append({'prompt': prompt, 'score': self.mean_sco |
| > 126                              'errors': self.errors, 'confusion_matrix |
|   127                                                                       |
|   128     def extract_errors(self) -> pd.DataFrame:                         |
|   129         """                                                           |
+-----------------------------------------------------------------------------+
TypeError: 'NoneType' object is not subscriptable

Eladlev · 2024-03-13T17:44:32Z

It seems like a dataset structure issue.
We will soon add to the readme an example of how to use your own annotated dataset and apply only the optimization part to this data.
Meanwhile, I suggest iterating on your exact use case and data in our Discord channel, I think it will be easier and faster.

danielliu99 · 2024-03-27T10:00:00Z

It seems like a dataset structure issue. We will soon add to the readme an example of how to use your own annotated dataset and apply only the optimization part to this data. Meanwhile, I suggest iterating on your exact use case and data in our Discord channel, I think it will be easier and faster.

Has the example of adding my own annotated dataset been updated? : )

danielliu99 · 2024-03-27T11:00:24Z

Yes. Another important thing:

If you remove completely the annotator from the config file he will use the annotation column in the csv

Otherwise he will update the annotation column (according to the annotator you choose), so you can leave also this column to be empty.

If I have 30 samples, with text and annotation. Is it possible to use these samples in the first few iterations, while user llm-generated samples in the following iterations? How should I modify the config files?

Eladlev · 2024-03-27T11:57:03Z

Hi @danielliu99.
I still didn't have time to update the readme file.
But I will update here the exact steps and soon we will organize and add it with an example to the text:

In order to iterate on your own dataset you need to:

1.You need to transfer your data to AutoPrompt dataset format:
"Id","text","prediction","annotation","metadata","score","batch_id"
Where Id is a unique row Id, 'text' is the input to the prompt, 'prediction' should be empty, 'annotation' should be the GT (the class label), 'score' should be empty and 'batch_id' should be 0 for all the rows.
2. Put this csv in some empty folder (the important thing is that the history.pickle will not be in this folder)
3. You need to make the following changes in the default_config file:
As always modify the label schema to your label schema.
The fields:
max_samples: 50

should be modified to the number of samples in your csv (from 50)
Lastly, you should modify the annotator and change the method to the empty string:
method : ''
4. In the run_pipeline in the --load_path parameter, you should put the location of the folder with the csv

If you want to start with your dataset and then continue with synthetic data:

You need to follow the same steps as above and simply change the max_samples to be 30 + the number of synthetic samples you want to add
Additionally in this case the annotator method should be either argilla (human) or llm. This means that the model will ask you to reannotate your samples (we are doing it for consistency), if you want to skip the annotation of these samples you need to add in this location:
https://github.com/Eladlev/AutoPrompt/blob/cdddccf9f2105d8bf8e688818932b18e645f5136/estimator/estimator_argilla.py#L92C1-L93C1
another condition that will return an empty array in case all the samples are already annotated

Natasha-Databricks · 2024-08-06T21:06:31Z

Hi @Eladlev thank you for the instructions on how to use AutoPrompt with a ground truth. I have tested the approach and here are a few amendments to the instructions you shared above:

Changes in dependencies
Add the following dependencies in environment_dev.yml and / or requirements.txt

- langchain-community==0.0.8
- langchain-core==0.2.25

Name of the ground truth dataset
The datatset needs to be called dataset.csv which is what is read in optimization_pipeline.py.

Example dataset.csv
Let me share a working example of the dataset which slightly differs from the schema you shared above with the field "id" in small letters.

"id","text","prediction","annotation","metadata","score","batch_id"
0,"The cinematography was mesmerizing, especially during the scene where they finally reveal the mysterious room that captivated the main character.",,Yes,,,0
1,"The director's bold choice to leave the world's fate unclear until the final frame will spark audience discussions.",,Yes,,,0

YukiChen-yuxin closed this as completed Mar 13, 2024

Eladlev added the good first issue Good for newcomers label Mar 28, 2024

Eladlev mentioned this issue Aug 3, 2024

load dataset relative with prompt #79

Open

This was referenced Sep 13, 2024

run_generation_pipeline is not working #97

Closed

Generation with custom data & evaluator #98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: 'samples' #51

KeyError: 'samples' #51

YukiChen-yuxin commented Mar 12, 2024

YukiChen-yuxin commented Mar 12, 2024

Eladlev commented Mar 12, 2024

YukiChen-yuxin commented Mar 12, 2024

YukiChen-yuxin commented Mar 12, 2024

YukiChen-yuxin commented Mar 12, 2024 •

edited

Loading

Eladlev commented Mar 13, 2024

YukiChen-yuxin commented Mar 13, 2024

Eladlev commented Mar 13, 2024

YukiChen-yuxin commented Mar 13, 2024

Eladlev commented Mar 13, 2024 •

edited

Loading

YukiChen-yuxin commented Mar 13, 2024

Eladlev commented Mar 13, 2024 •

edited

Loading

YukiChen-yuxin commented Mar 13, 2024

Eladlev commented Mar 13, 2024 •

edited

Loading

danielliu99 commented Mar 27, 2024

danielliu99 commented Mar 27, 2024

Eladlev commented Mar 27, 2024

Natasha-Databricks commented Aug 6, 2024 •

edited

Loading

KeyError: 'samples' #51

KeyError: 'samples' #51

Comments

YukiChen-yuxin commented Mar 12, 2024

YukiChen-yuxin commented Mar 12, 2024

Eladlev commented Mar 12, 2024

YukiChen-yuxin commented Mar 12, 2024

YukiChen-yuxin commented Mar 12, 2024

YukiChen-yuxin commented Mar 12, 2024 • edited Loading

Eladlev commented Mar 13, 2024

YukiChen-yuxin commented Mar 13, 2024

Eladlev commented Mar 13, 2024

YukiChen-yuxin commented Mar 13, 2024

Eladlev commented Mar 13, 2024 • edited Loading

YukiChen-yuxin commented Mar 13, 2024

Eladlev commented Mar 13, 2024 • edited Loading

YukiChen-yuxin commented Mar 13, 2024

Eladlev commented Mar 13, 2024 • edited Loading

danielliu99 commented Mar 27, 2024

danielliu99 commented Mar 27, 2024

Eladlev commented Mar 27, 2024

Natasha-Databricks commented Aug 6, 2024 • edited Loading

YukiChen-yuxin commented Mar 12, 2024 •

edited

Loading

Eladlev commented Mar 13, 2024 •

edited

Loading

Eladlev commented Mar 13, 2024 •

edited

Loading

Eladlev commented Mar 13, 2024 •

edited

Loading

Natasha-Databricks commented Aug 6, 2024 •

edited

Loading