Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ludwig's evaluation scaffolding #2

Merged
merged 26 commits into from
Mar 3, 2022
Merged

Ludwig's evaluation scaffolding #2

merged 26 commits into from
Mar 3, 2022

Conversation

dirkgr
Copy link
Member

@dirkgr dirkgr commented Feb 21, 2022

This PR adds scaffolding for computing a big matrix of model X evaluated on task Y, for many Xs and Ys.

Getting started

You can try running an evaluation by running python ludwig/main.py --model deepset/roberta-base-squad2 --task squad -d workspace. The final -d selects a directory where results (both intermediate and final) are written.

Since we're trying to sort out the APIs, I did not yet put in the capability to run on GPUs.

Tasks

Look in ludwig/tasks/ to see the definitions of the tasks. There are five different types of tasks there. Only two of them have an implementation (multiple choice and question answering).

The Task class also defines the types of the instances. Instances are always @dataclass. There are also InstanceResult classes there, which store a reference to the original instance, and also the predictions.

Models

Models live in ludwig/models/. The model base class, ModelForEvaluation, has one method for each of the tasks we want to perform. Many models will only perform one of those tasks and raise a NotImplementedError otherwise, but other models (like GPT-3) should be expected to perform multiple tasks.

There is no example of a few-shot model in here yet. My thinking is, models have access to a Task object when they do their work, so they can find the training instances they need there.

Two Tango Steps

Evaluations are performed in two Tango steps:

  1. Create model predictions alongside the instances from a dataset
  2. Calculate metrics from the predictions
    The results of each of these steps are cached.

It's a little annoying to have to break this into two steps like this, but it gives us flexibility down the line when we want to inspect the predictions, or do something else with them.

@dirkgr dirkgr self-assigned this Feb 21, 2022
@dirkgr dirkgr marked this pull request as ready for review February 22, 2022 21:07
Comment on lines 12 to 13
text1: str
text2: str
Copy link

@ibeltagy ibeltagy Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you lose the names of the fields. What happens to tasks like boolq that has a question and a context, or amazon polarity that has title and content.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it buys us much to have these as separate classes, especially when they are one-off. When you implement the model you have to remember which is which.

We could do this:

class QuestionClassificationTask(PairClassificationTask):
    @property
    def question(self):
        return self.text1

    @property
    def context(self):
        return self.text2

Currently there is a 1:1 correspondence between the methods on ModelForEvaluation and these task classes. I'm not sure it makes sense to add more methods on ModelForEvaluation for one-off cases, especially if the implementation would likely be the same for many tasks anyways.

Copy link

@ibeltagy ibeltagy Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a separate class. I was referring to a generic dict as you suggested here #2 (comment)

InstanceT = TypeVar('InstanceT')


class Task(ABC, Registrable, DetHashWithVersion):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be useful to include HF dataset source

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and other metadata

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a big fan of having a generic Dict[str, Any] field as metadata, just in case.
3ba0ac6

class QATask(Task, ABC):
@dataclass
class Instance(Task.Instance):
context: str

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is optional

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. 3180e31



class ModelForEvaluation(ABC, Registrable, DetHashWithVersion):
def do_summarization(
Copy link

@ibeltagy ibeltagy Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably also need a do_conditional_generation task. The user can probably repurpose do_summarization for such tasks, but that would be wrong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too familiar with how we want to evaluate summarization but would it be possible to have just a do_conditional_generation function which also does summarization using a summarization prompt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be my preference as well. I would prefer to use a small number of model interfaces so that the model implementer writes as little code as possible. If they want, the model implementer can still customize the implementation to specific tasks using information from the meta-data.
If we take this route, we merge do_summarization, do_conditional_generation, do_qa and also merge do_pair_classification, do_classification and maybe even do_multiple_choice. Otherwise, we keep them separate.
Anyway, this is a small difference. If you decide to keep them separate, the first model we implement will just combine them into two functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classification, pair classification, and multiple choice are all different from one another, because the data formats of the inputs and outputs are different from one another. The formats have to preserve enough fidelity that we can implement models that rely on those differences. While it's true that you can always express one task as another, that's not how many models are implemented, and as long as we want to support those models, we have to keep the structure.

Another consideration is the evaluation. It would be nice if every task category can have only one metrics calculation, i.e., all summarization tasks are evaluated the same way, all MC tasks are evaluated the same way, etc. So we might find that some tasks have the same format, but different evaluations, and therefore we want them to be different.

With all of that in mind, it seems to me that summarization and conditional generation should indeed be the same. They are both str -> str. But QA is Tuple[str, str] -> str, at least as long as you have a context string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ce7f443 renames "summarization" to "generation".

) -> Iterator['SummarizationTask.InstanceResult']:
raise NotImplementedError

def do_multiple_choice(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The do_task interface right now takes test instances, which is enough for zero-shot evaluation. What is the plan regarding few-shot evaluation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I think when Dirk and I talked we thought that this could be added through adding another train set for models that support few shot evaluation and we would sample k examples from that dataset to be append to the prompt for evaluation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The do_task() interface takes a list of instances, but also the Task object itself. We can use the Task object to get few-shot instances.

I agree that this is a little awkward. In the first version I just passed the Task to the do_* methods. I changed it this way because it allows us to recover from evaluations that failed half-way through. We don't want to pack a bunch of recovery logic into the model implementations. This way the recovery logic is outside, and the model just has to run on the instances presented to it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the pros and cons of different choices, but one thing to keep in mind is reproducibility. You want to be able to sample the same sequence of training shots every time. You also want to control how the sampling happens, do you have different training shots for every test instance, the same training shots for all test instances, or something in-between.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's implement a few-shot model sooner rather than later, so we can catch potential issues soon.

@@ -0,0 +1,54 @@
from typing import Union, Sequence, Optional

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename the root directory :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll call it the "Beltagy Framework" :-)

I think I'll move the whole thing up one level and replace the stuff that's already there. It was nice for exploration, but we don't need it for this I think.

@@ -0,0 +1,54 @@
from typing import Union, Sequence, Optional

from tango import Step

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command line is
python ludwig/main.py --model deepset/roberta-base-squad2 --task squad -d workspace

Does it support evaluating on a list of tasks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it does: 0c2fb5f

"bert-large-uncased": HFAutoModelForEvaluation("bert-large-uncased"),
"bert-large-cased": HFAutoModelForEvaluation("bert-large-cased"),
"roberta-base": HFAutoModelForEvaluation("roberta-base"),
"roberta-large": HFAutoModelForEvaluation("roberta-large"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are just for testing but note that while the deepest model with have a trained QA head when initialized with the AutoModelForQuestionAnswering.from_pretrained(self.model_name) hf function, the other models do not specify specific weights for that head so it will randomly initialized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, these are kind of nonsensical. HF prints a warning when you initialize them.

There are a million "Roberta trained on Squad" models on the HF hub, but they are not official and we can't be sure how exactly they were trained and evaluated. For a select group of high-profile models and tasks, it might make sense for us to train them ourselves, just running the HF code.

class HFAutoModelForEvaluation(ModelForEvaluation):
VERSION = "003"

def __init__(self, model_name: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like model_name = model_path in this case? We may want to decouple these so that we have more flexibility in naming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it: a4725c8

tokenizer = AutoTokenizer.from_pretrained(self.model_name)
instances = Tqdm.tqdm(instances, desc="Processing instances")
with torch.inference_mode():
for batch in more_itertools.chunked(instances, batch_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this structure for preprocessing the dataset would reduce computational efficiency. More specifically it seems like between each batch we are preprocessing and encoding each set of examples which I imagine adds some downtime between gpu's processing of each batch. I think it might be best to do the pre/post processing outside of the evaluation loop so that we can quickly push batches into gpu and be more efficient?
I totally agree that the model should pre/post process its own data but I think we can maybe create a data loader from the preprocessed samples before entering the evaluation loop and that way we make better use of gpu time. Lmk if I reached the wrong conclusion here :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mentioned having to do the evaluation in 2 steps for Tango is this maybe why this is written this way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely more efficient to do all the tokenizing up front, at the cost of memory. I kept this implementation simple because I imagine we'll replace it with HF pipelines anyways.

In practice, it doesn't make that much of a difference. Fast tokenizers are fast.

It has nothing to do with Tango. If we thought that tokenization is a really heavy process, so heavy that we want to cache the tokenized dataset, then it would make sense to break this up into two Tango steps. But I don't think that's the case. I can create 1M word pieces in 700ms on my laptop. It'll be hard to find speed gains by caching this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Also moving the tokenizer logic outside this function complicates things a lot, it is better to keep it here.

dataset="squad",
context_field="context",
question_field="question",
answer_field="answers.text.0",
Copy link
Contributor

@anas-awadalla anas-awadalla Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok if we just take the first answer as groundtruth for squad as most of the datapoints in squad list the same groundtruth answers. However from working with MRQA I find that other RC datasets in the squad format sometimes have very different answer possibilities. We would want to compare a model's answer to the best matching groundtruth answer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, that sounds like we need to change our definition of the QA task to allow for a set of answer choices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fc3216d has that change.

@@ -0,0 +1,54 @@
from typing import Union, Sequence, Optional

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add the infrastructure for default prompt as well or keep it in a separate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all we need is a few extra fields on the task, that's easy enough to do now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default prompt is a dataset-specific field, not a task-specific field. It might even be a dataset x model field. The default prompt is a list of functions that take an instance and returns a prompted string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's do it when we write the first model that takes advantage of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3


from tango import Step
from tango.format import JsonFormat
from tango.format import SqliteSequenceFormat

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to think about task adaptation methods in this PR? probably not. It is probably enough to implement them inside the model API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to put TA methods into model implementations, exactly. We should think about them now in so far as we have to change the API to allow for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added #5 and #6, but I think it's unlikely that we'll forget about this.

from tango import Step
from tango.format import JsonFormat
from tango.format import SqliteSequenceFormat
from tango.common.sequences import SqliteSparseSequence

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to do now or keep for a different PR is ingesting the CrossFit datasets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little uneasy about letting CrossFit squash away all the structure, and then parsing it back out. But the benefit of importing all those datasets with a single class are undeniable.

We have a lot of people on the project. Any interest? I'd be happy to walk you through the process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7

@@ -0,0 +1,54 @@
from typing import Union, Sequence, Optional
Copy link

@ibeltagy ibeltagy Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing I forgot to ask about. How are we addressing result storage and reporting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results are stored in the Tango workspace. We should decide on an NFS directory to use until @epwalsh can give us the cloud workspace. How about /net/nfs.cirrascale/allennlp/ai2-lm-eval?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reporting is a different story. The current main() method just prints a JSON blob. How do we want to see this? We could write a tool that will read the results from the cache and make us a graph or a table.

@dirkgr dirkgr merged commit 7a0e50d into main Mar 3, 2022
@dirkgr dirkgr deleted the Ludwig branch March 3, 2022 22:42
OyvindTafjord added a commit that referenced this pull request May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants