Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'samples' #51

Closed
YukiChen-yuxin opened this issue Mar 12, 2024 · 18 comments
Closed

KeyError: 'samples' #51

YukiChen-yuxin opened this issue Mar 12, 2024 · 18 comments
Labels
good first issue Good for newcomers

Comments

@YukiChen-yuxin
Copy link

Hi! I tried to run the pipeline using Azure Open AI, using LLM as annotator, but got this error.

Processing samples: 100%|##########| 1/1 [00:24<00:00, 24.04s/it]
Traceback (most recent call last):
  File "prompt_model\AutoPrompt\run_pipeline.py", line 44, in <module>
    best_prompt = pipeline.run_pipeline(opt.num_steps)
  File "prompt_model\AutoPrompt\optimization_pipeline.py", line 274, in run_pipel
    stop_criteria = self.step()
  File "prompt_model\AutoPrompt\optimization_pipeline.py", line 233, in step
    self.generate_initial_samples()
  File "prompt_model\AutoPrompt\optimization_pipeline.py", line 194, in generate_tial_samples
    samples_list = [element for sublist in samples_batches for element in sublist['samples']]
  File "prompt_model\AutoPrompt\optimization_pipeline.py", line 194, in <listcomp
    samples_list = [element for sublist in samples_batches for element in sublist['samples']]
KeyError: 'samples'

@YukiChen-yuxin
Copy link
Author

I got the same error when I tried to use Argilla, is there any place I need to store my own samples data for this part? Thanks

@Eladlev
Copy link
Owner

Eladlev commented Mar 12, 2024

Can you provide more details:

  1. Are you running the classification pipeline or the generation?
  2. Please provide all the config modifications

Also, please verify that you are not loading dumps (this error might be due to dumps issues).

@YukiChen-yuxin
Copy link
Author

I'm running a classification pipeline. The config file is:

use_wandb: True
dataset:
    name: 'dataset'
    records_path: null
    initial_dataset: ''
    label_schema: ["toxic", "non-toxic"]
    max_samples: 50
    semantic_sampling: True # Change to True in case you don't have M1. Currently there is an issue with faiss and M1

annotator:
    method: 'llm'
    config:
        llm:
            type: 'Azure'
            name: 'gpt-4-32k-0613'
        instruction:
            'Assess whether the text contains a spam topic. 
            Answer toxic if it does and non-toxic otherwise.'
        num_workers: 5
        prompt: 'prompts/predictor_completion/prediction.prompt'
        mini_batch_size: 1
        mode: 'annotation'


predictor:
    method : 'llm'
    config:
        llm:
            type: 'Azure'
            name: 'gpt-4-32k-0613'
#            async_params:
#                retry_interval: 10
#                max_retries: 2
            model_kwargs: {"seed": 220}
        num_workers: 5
        prompt: 'prompts/predictor_completion/prediction.prompt'
        mini_batch_size: 1  #change to >1 if you want to include multiple samples in the one prompt
        mode: 'prediction'

meta_prompts:
    folder: 'prompts/meta_prompts_classification'
    num_err_prompt: 1  # Number of error examples per sample in the prompt generation
    num_err_samples: 2 # Number of error examples per sample in the sample generation
    history_length: 4 # Number of sample in the meta-prompt history
    num_generated_samples: 10 # Number of generated samples at each iteration
    num_initialize_samples: 10 # Number of generated samples at iteration 0, in zero-shot case
    samples_generation_batch: 10 # Number of samples generated in one call to the LLM
    num_workers: 5 #Number of parallel workers
    warmup: 4 # Number of warmup steps

eval:
    function_name: 'accuracy'
    num_large_errors: 4
    num_boundary_predictions : 0
    error_threshold: 0.5

llm:
    type: 'Azure'
    name: 'gpt-4-32k-0613'
    temperature: 0

stop_criteria:
    max_usage: 2 #In $ in case of OpenAI models, otherwise number of tokens
    patience: 10 # Number of patience steps
    min_delta: 0.01 # Delta for the improvement definition


Sorry where to check if i'm loading dumps or not?
Thanks

@YukiChen-yuxin
Copy link
Author

Seems there is something wrong when pipeline tries to generate initial samples, I wonder if we can use our own sample datasets or not, I changed the initial_dataset in config file to 'dump/validated_test_file_temp.csv' which contains one column with several samples and then I got this error, is there any other config I need to modify if I want to use my own sample data?

prompt_model\AutoPrompt\run_pipeline.py:44 in <module>    |
|                                                                             |
|   41 pipeline = OptimizationPipeline(config_params, task_description, initi |
|   42 if (opt.load_path != ''):                                              |
|   43     pipeline.load_state(opt.load_path)                                 |
| > 44 best_prompt = pipeline.run_pipeline(opt.num_steps)                     |
|   45 print('\033[92m' + 'Calibrated prompt score:', str(best_prompt['score' |
|   46 print('\033[92m' + 'Calibrated prompt:', best_prompt['prompt'] + '\033 |
|   47                                                                        |
|                                                                             |
| prompt_model\AutoPrompt\optimization_pipeline.py:274 in   |
| run_pipeline                                                                |
|                                                                             |
|   271         # Run the optimization pipeline for num_steps                 |
|   272         num_steps_remaining = num_steps - self.batch_id               |
|   273         for i in range(num_steps_remaining):                          |
| > 274             stop_criteria = self.step()                               |
|   275             if stop_criteria:                                         |
|   276                 break                                                 |
|   277         final_result = self.extract_best_prompt()                     |
|                                                                             |
| prompt_model\AutoPrompt\optimization_pipeline.py:242 in   |
| step                                                                        |
|                                                                             |
|   239                 step=self.batch_id)                                   |
|   240                                                                       |
|   241         logging.info('Running annotator')                             |
| > 242         records = self.annotator.apply(self.dataset, self.batch_id)   |
|   243         self.dataset.update(records)                                  |
|   244                                                                       |
|   245         self.predictor.cur_instruct = self.cur_prompt                 |
|                                                                             |
| AutoPrompt\estimator\estimator_argilla.py:92 |
| in apply                                                                    |
|                                                                             |
|    89         except:                                                       |
|    90             self.initialize_dataset(dataset.name, dataset.label_schem |
|    91             rg_dataset = current_api.datasets.find_by_name(dataset.na |
| >  92         batch_records = dataset[batch_id]                             |
|    93         if batch_records.empty:                                       |
|    94             return []                                                 |
|    95         self.upload_missing_records(dataset.name, batch_id, batch_rec |
|                                                                             |
| AutoPrompt\dataset\base_dataset.py:41 in     |
| __getitem__                                                                 |
|                                                                             |
|    38         """                                                           |
|    39         Return the batch idx.                                         |
|    40         """                                                           |
| >  41         extract_records = self.records[self.records['batch_id'] == ba |
|    42         extract_records = extract_records.reset_index(drop=True)      |
|    43         return extract_records                                        |
|    44                                                                       |
|                                                                             |
| AutoPrompt\lib\site-packa |
| ges\pandas\core\frame.py:4090 in __getitem__                                |
|                                                                             |
|    4087         if is_single_key:                                           |
|    4088             if self.columns.nlevels > 1:                            |
|    4089                 return self._getitem_multilevel(key)                |
| >  4090             indexer = self.columns.get_loc(key)                     |
|    4091             if is_integer(indexer):                                 |
|    4092                 indexer = [indexer]                                 |
|    4093         else:                                                       |
|                                                                             |
| AutoPrompt\lib\site-packa |
| ges\pandas\core\indexes\base.py:3812 in get_loc                             |
|                                                                             |
|   3809                 and any(isinstance(x, slice) for x in casted_key)    |
|   3810             ):                                                       |
|   3811                 raise InvalidIndexError(key)                         |
| > 3812             raise KeyError(key) from err                             |
|   3813         except TypeError:                                            |
|   3814             # If we have a listlike key, _check_indexing_error will  |
|   3815             #  InvalidIndexError. Otherwise we fall through and re-r |
+-----------------------------------------------------------------------------+
KeyError: 'batch_id'

@YukiChen-yuxin
Copy link
Author

YukiChen-yuxin commented Mar 12, 2024

I think I solved this based on the dataset info you provided in example.md. Just want to check what is the difference between prediction col and annotation col, as I know we need to provide both of them in input datasets right? And what are the metadata and score for here

Thanks for your help.

id,text,prediction,annotation,metadata,score,batch_id
0,"The cinematography was mesmerizing, especially during the scene where they finally reveal the mysterious room that captivated the main character.",No,Yes,,,0
1,"The director's bold choice to leave the world's fate unclear until the final frame will spark audience discussions.",No,Yes,,,0

@Eladlev
Copy link
Owner

Eladlev commented Mar 13, 2024

  • Annotation: The annotator's response. This is the GT provided by the annotator.
  • Prediction: The predictor response. This is the result of the current prompt
  • Score: The score of the current sample, calculated by evaluating the score function on the prediction value and the annotation value (in the case of classification it's 0 if they are not the same or 1 if they are the same).
  • metadata: This is the metadata for the Argilla dataset, in practice it contains the batch_id value, and it's not necessary

@YukiChen-yuxin
Copy link
Author

Oh got it, so if I want to input my own sample dataset file, can I leave the prediction col to be all empty?

@Eladlev
Copy link
Owner

Eladlev commented Mar 13, 2024

Yes.
Another important thing:

  1. If you remove completely the annotator from the config file he will use the annotation column in the csv
  2. Otherwise he will update the annotation column (according to the annotator you choose), so you can leave also this column to be empty.

@YukiChen-yuxin
Copy link
Author

Hi, I will get the AttributeError if I remove the whole annotator part from config file

|prompt_model\AutoPrompt\optimization_pipeline.py:57 in |
| init |
| |
| 54 self.cur_prompt = initial_prompt |
| 55 |
| 56 self.predictor = give_estimator(config.predictor) |
| > 57 self.annotator = give_estimator(config.annotator) |
| 58 self.eval = Eval(config.eval, self.meta_chain.error_analysis, |
| 59 self.batch_id = 0 |
| 60 self.patient = 0 |
+-----------------------------------------------------------------------------+
AttributeError: 'EasyDict' object has no attribute 'annotator'
(AutoPrompt)

@Eladlev
Copy link
Owner

Eladlev commented Mar 13, 2024

Yes, you are right it should not be removed completely but you should modify the method to empty string:

annotator:
   method : ''

@YukiChen-yuxin
Copy link
Author

Thanks.
And I also met this error, does it mean my llm didn't return any new prompt
| prompt_model\AutoPrompt\optimization_pipeline.py:116 in |
| run_step_prompt |
| |
| 113 prompt_suggestion = self.meta_chain.step_prompt_chain.invoke( |
| 114 print(f'prompt_suggestion: {prompt_suggestion}') |
| 115 self.log_and_print(f'Previous prompt score:\n{self.eval.mean_ |
| > 116 self.log_and_print(f'Get new prompt:\n{prompt_suggestion["pro |
| 117 self.batch_id += 1 |
| 118 if len(self.dataset) < self.config.dataset.max_samples: |
| 119 batch_input = {"num_samples": self.config.meta_prompts.sa |
+-----------------------------------------------------------------------------+
KeyError: 'prompt'

@Eladlev
Copy link
Owner

Eladlev commented Mar 13, 2024

It might be an issue with openAI functions (although the model you are using should be support functions).
You can try to use the completion pipeline:

meta_prompts:
    folder: 'prompts/meta_prompts_completion'

@YukiChen-yuxin
Copy link
Author

In this part in evaluator.py, seems it labeled all my prediction col to Discard and then delete all my data, and the self.dataset is empty and all metrix are empty due to this. I'm not sure what did I change so I re-download the code repo and only changed the dataset file and two config files, still got this. Do you have any idea what is going on here

def eval_score(self) -> float:
        """
        Calculate the score on each row and return the mean score.
        :return: The mean score
        """
        # filter out the discarded samples
        self.dataset = self.dataset[(self.dataset['prediction'] != 'Discarded') &
                                    (self.dataset['annotation'] != 'Discarded')]
        self.dataset = self.score_func(self.dataset)
        self.mean_score = self.dataset['score'].mean()
        return self.mean_score
| prompt_model\AutoPrompt\eval\evaluator.py:126 in          |
| add_history                                                                 |
|                                                                             |
|   123         analysis = self.analyzer.invoke(prompt_input)                 |
|   124                                                                       |
|   125         self.history.append({'prompt': prompt, 'score': self.mean_sco |
| > 126                              'errors': self.errors, 'confusion_matrix |
|   127                                                                       |
|   128     def extract_errors(self) -> pd.DataFrame:                         |
|   129         """                                                           |
+-----------------------------------------------------------------------------+
TypeError: 'NoneType' object is not subscriptable

@Eladlev
Copy link
Owner

Eladlev commented Mar 13, 2024

It seems like a dataset structure issue.
We will soon add to the readme an example of how to use your own annotated dataset and apply only the optimization part to this data.
Meanwhile, I suggest iterating on your exact use case and data in our Discord channel, I think it will be easier and faster.

@danielliu99
Copy link

It seems like a dataset structure issue. We will soon add to the readme an example of how to use your own annotated dataset and apply only the optimization part to this data. Meanwhile, I suggest iterating on your exact use case and data in our Discord channel, I think it will be easier and faster.

Has the example of adding my own annotated dataset been updated? : )

@danielliu99
Copy link

Yes. Another important thing:

  1. If you remove completely the annotator from the config file he will use the annotation column in the csv
  2. Otherwise he will update the annotation column (according to the annotator you choose), so you can leave also this column to be empty.

If I have 30 samples, with text and annotation. Is it possible to use these samples in the first few iterations, while user llm-generated samples in the following iterations? How should I modify the config files?

@Eladlev
Copy link
Owner

Eladlev commented Mar 27, 2024

Hi @danielliu99.
I still didn't have time to update the readme file.
But I will update here the exact steps and soon we will organize and add it with an example to the text:

In order to iterate on your own dataset you need to:

1.You need to transfer your data to AutoPrompt dataset format:
"Id","text","prediction","annotation","metadata","score","batch_id"
Where Id is a unique row Id, 'text' is the input to the prompt, 'prediction' should be empty, 'annotation' should be the GT (the class label), 'score' should be empty and 'batch_id' should be 0 for all the rows.
2. Put this csv in some empty folder (the important thing is that the history.pickle will not be in this folder)
3. You need to make the following changes in the default_config file:
As always modify the label schema to your label schema.
The fields:
max_samples: 50

should be modified to the number of samples in your csv (from 50)
Lastly, you should modify the annotator and change the method to the empty string:
method : ''
4. In the run_pipeline in the --load_path parameter, you should put the location of the folder with the csv

If you want to start with your dataset and then continue with synthetic data:

You need to follow the same steps as above and simply change the max_samples to be 30 + the number of synthetic samples you want to add
Additionally in this case the annotator method should be either argilla (human) or llm. This means that the model will ask you to reannotate your samples (we are doing it for consistency), if you want to skip the annotation of these samples you need to add in this location:
https://github.com/Eladlev/AutoPrompt/blob/cdddccf9f2105d8bf8e688818932b18e645f5136/estimator/estimator_argilla.py#L92C1-L93C1
another condition that will return an empty array in case all the samples are already annotated

@Natasha-Databricks
Copy link

Natasha-Databricks commented Aug 6, 2024

Hi @Eladlev thank you for the instructions on how to use AutoPrompt with a ground truth. I have tested the approach and here are a few amendments to the instructions you shared above:

Changes in dependencies
Add the following dependencies in environment_dev.yml and / or requirements.txt

- langchain-community==0.0.8
- langchain-core==0.2.25

Name of the ground truth dataset
The datatset needs to be called dataset.csv which is what is read in optimization_pipeline.py.

Example dataset.csv
Let me share a working example of the dataset which slightly differs from the schema you shared above with the field "id" in small letters.

"id","text","prediction","annotation","metadata","score","batch_id"
0,"The cinematography was mesmerizing, especially during the scene where they finally reveal the mysterious room that captivated the main character.",,Yes,,,0
1,"The director's bold choice to leave the world's fate unclear until the final frame will spark audience discussions.",,Yes,,,0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants