Ludwig's evaluation scaffolding #2

dirkgr · 2022-02-21T19:57:21Z

This PR adds scaffolding for computing a big matrix of model X evaluated on task Y, for many Xs and Ys.

Getting started

You can try running an evaluation by running python ludwig/main.py --model deepset/roberta-base-squad2 --task squad -d workspace. The final -d selects a directory where results (both intermediate and final) are written.

Since we're trying to sort out the APIs, I did not yet put in the capability to run on GPUs.

Tasks

Look in ludwig/tasks/ to see the definitions of the tasks. There are five different types of tasks there. Only two of them have an implementation (multiple choice and question answering).

The Task class also defines the types of the instances. Instances are always @dataclass. There are also InstanceResult classes there, which store a reference to the original instance, and also the predictions.

Models

Models live in ludwig/models/. The model base class, ModelForEvaluation, has one method for each of the tasks we want to perform. Many models will only perform one of those tasks and raise a NotImplementedError otherwise, but other models (like GPT-3) should be expected to perform multiple tasks.

There is no example of a few-shot model in here yet. My thinking is, models have access to a Task object when they do their work, so they can find the training instances they need there.

Two Tango Steps

Evaluations are performed in two Tango steps:

Create model predictions alongside the instances from a dataset
Calculate metrics from the predictions
The results of each of these steps are cached.

It's a little annoying to have to break this into two steps like this, but it gives us flexibility down the line when we want to inspect the predictions, or do something else with them.

…into two different steps

ibeltagy · 2022-02-23T18:31:31Z

ludwig/tasks/pair_classification_task.py

+        text1: str
+        text2: str


Here you lose the names of the fields. What happens to tasks like boolq that has a question and a context, or amazon polarity that has title and content.

I am not sure it buys us much to have these as separate classes, especially when they are one-off. When you implement the model you have to remember which is which.

We could do this:

class QuestionClassificationTask(PairClassificationTask): @property def question(self): return self.text1 @property def context(self): return self.text2

Currently there is a 1:1 correspondence between the methods on ModelForEvaluation and these task classes. I'm not sure it makes sense to add more methods on ModelForEvaluation for one-off cases, especially if the implementation would likely be the same for many tasks anyways.

No need for a separate class. I was referring to a generic dict as you suggested here #2 (comment)

ibeltagy · 2022-02-23T18:33:12Z

ludwig/tasks/task.py

+InstanceT = TypeVar('InstanceT')
+
+
+class Task(ABC, Registrable, DetHashWithVersion):


might be useful to include HF dataset source

and other metadata

I'm a big fan of having a generic Dict[str, Any] field as metadata, just in case.
3ba0ac6

ibeltagy · 2022-02-23T18:34:45Z

ludwig/tasks/qa_task.py

+class QATask(Task, ABC):
+    @dataclass
+    class Instance(Task.Instance):
+        context: str


this is optional

Of course. 3180e31

ibeltagy · 2022-02-23T18:40:20Z

ludwig/models/model.py

+
+
+class ModelForEvaluation(ABC, Registrable, DetHashWithVersion):
+    def do_summarization(


We probably also need a do_conditional_generation task. The user can probably repurpose do_summarization for such tasks, but that would be wrong

I am not too familiar with how we want to evaluate summarization but would it be possible to have just a do_conditional_generation function which also does summarization using a summarization prompt?

That would be my preference as well. I would prefer to use a small number of model interfaces so that the model implementer writes as little code as possible. If they want, the model implementer can still customize the implementation to specific tasks using information from the meta-data.
If we take this route, we merge do_summarization, do_conditional_generation, do_qa and also merge do_pair_classification, do_classification and maybe even do_multiple_choice. Otherwise, we keep them separate.
Anyway, this is a small difference. If you decide to keep them separate, the first model we implement will just combine them into two functions.

Classification, pair classification, and multiple choice are all different from one another, because the data formats of the inputs and outputs are different from one another. The formats have to preserve enough fidelity that we can implement models that rely on those differences. While it's true that you can always express one task as another, that's not how many models are implemented, and as long as we want to support those models, we have to keep the structure.

Another consideration is the evaluation. It would be nice if every task category can have only one metrics calculation, i.e., all summarization tasks are evaluated the same way, all MC tasks are evaluated the same way, etc. So we might find that some tasks have the same format, but different evaluations, and therefore we want them to be different.

With all of that in mind, it seems to me that summarization and conditional generation should indeed be the same. They are both str -> str. But QA is Tuple[str, str] -> str, at least as long as you have a context string.

ce7f443 renames "summarization" to "generation".

ibeltagy · 2022-02-23T18:44:39Z

ludwig/models/model.py

+    ) -> Iterator['SummarizationTask.InstanceResult']:
+        raise NotImplementedError
+
+    def do_multiple_choice(


The do_task interface right now takes test instances, which is enough for zero-shot evaluation. What is the plan regarding few-shot evaluation?

Yep I think when Dirk and I talked we thought that this could be added through adding another train set for models that support few shot evaluation and we would sample k examples from that dataset to be append to the prompt for evaluation.

The do_task() interface takes a list of instances, but also the Task object itself. We can use the Task object to get few-shot instances.

I agree that this is a little awkward. In the first version I just passed the Task to the do_* methods. I changed it this way because it allows us to recover from evaluations that failed half-way through. We don't want to pack a bunch of recovery logic into the model implementations. This way the recovery logic is outside, and the model just has to run on the instances presented to it.

I am not sure about the pros and cons of different choices, but one thing to keep in mind is reproducibility. You want to be able to sample the same sequence of training shots every time. You also want to control how the sampling happens, do you have different training shots for every test instance, the same training shots for all test instances, or something in-between.

Let's implement a few-shot model sooner rather than later, so we can catch potential issues soon.

ibeltagy · 2022-02-23T18:45:18Z

ludwig/__init__.py

@@ -0,0 +1,54 @@
+from typing import Union, Sequence, Optional


please rename the root directory :D

We'll call it the "Beltagy Framework" :-)

I think I'll move the whole thing up one level and replace the stuff that's already there. It was nice for exploration, but we don't need it for this I think.

ibeltagy · 2022-02-23T18:46:55Z

ludwig/__init__.py

@@ -0,0 +1,54 @@
+from typing import Union, Sequence, Optional
+
+from tango import Step


Command line is
python ludwig/main.py --model deepset/roberta-base-squad2 --task squad -d workspace

Does it support evaluating on a list of tasks?

Now it does: 0c2fb5f

anas-awadalla · 2022-02-23T19:05:06Z

ludwig/models/__init__.py

+    "bert-large-uncased": HFAutoModelForEvaluation("bert-large-uncased"),
+    "bert-large-cased": HFAutoModelForEvaluation("bert-large-cased"),
+    "roberta-base": HFAutoModelForEvaluation("roberta-base"),
+    "roberta-large": HFAutoModelForEvaluation("roberta-large"),


I think these are just for testing but note that while the deepest model with have a trained QA head when initialized with the AutoModelForQuestionAnswering.from_pretrained(self.model_name) hf function, the other models do not specify specific weights for that head so it will randomly initialized.

I know, these are kind of nonsensical. HF prints a warning when you initialize them.

There are a million "Roberta trained on Squad" models on the HF hub, but they are not official and we can't be sure how exactly they were trained and evaluated. For a select group of high-profile models and tasks, it might make sense for us to train them ourselves, just running the HF code.

anas-awadalla · 2022-02-23T19:06:22Z

ludwig/models/hf_auto_model.py

+class HFAutoModelForEvaluation(ModelForEvaluation):
+    VERSION = "003"
+
+    def __init__(self, model_name: str):


It seems like model_name = model_path in this case? We may want to decouple these so that we have more flexibility in naming.

I renamed it: a4725c8

anas-awadalla · 2022-02-23T19:14:53Z

ludwig/models/hf_auto_model.py

+        tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+        instances = Tqdm.tqdm(instances, desc="Processing instances")
+        with torch.inference_mode():
+            for batch in more_itertools.chunked(instances, batch_size):


I am wondering if this structure for preprocessing the dataset would reduce computational efficiency. More specifically it seems like between each batch we are preprocessing and encoding each set of examples which I imagine adds some downtime between gpu's processing of each batch. I think it might be best to do the pre/post processing outside of the evaluation loop so that we can quickly push batches into gpu and be more efficient?
I totally agree that the model should pre/post process its own data but I think we can maybe create a data loader from the preprocessed samples before entering the evaluation loop and that way we make better use of gpu time. Lmk if I reached the wrong conclusion here :)

I think you mentioned having to do the evaluation in 2 steps for Tango is this maybe why this is written this way?

It's definitely more efficient to do all the tokenizing up front, at the cost of memory. I kept this implementation simple because I imagine we'll replace it with HF pipelines anyways.

In practice, it doesn't make that much of a difference. Fast tokenizers are fast.

It has nothing to do with Tango. If we thought that tokenization is a really heavy process, so heavy that we want to cache the tokenized dataset, then it would make sense to break this up into two Tango steps. But I don't think that's the case. I can create 1M word pieces in 700ms on my laptop. It'll be hard to find speed gains by caching this.

+1. Also moving the tokenizer logic outside this function complicates things a lot, it is better to keep it here.

anas-awadalla · 2022-02-23T19:31:01Z

ludwig/tasks/__init__.py

+        dataset="squad",
+        context_field="context",
+        question_field="question",
+        answer_field="answers.text.0",


I think it is ok if we just take the first answer as groundtruth for squad as most of the datapoints in squad list the same groundtruth answers. However from working with MRQA I find that other RC datasets in the squad format sometimes have very different answer possibilities. We would want to compare a model's answer to the best matching groundtruth answer.

Alright, that sounds like we need to change our definition of the QA task to allow for a set of answer choices.

fc3216d has that change.

ibeltagy · 2022-02-23T19:43:54Z

ludwig/__init__.py

@@ -0,0 +1,54 @@
+from typing import Union, Sequence, Optional
+


Do you want to add the infrastructure for default prompt as well or keep it in a separate PR?

If all we need is a few extra fields on the task, that's easy enough to do now.

The default prompt is a dataset-specific field, not a task-specific field. It might even be a dataset x model field. The default prompt is a list of functions that take an instance and returns a prompted string.

Then let's do it when we write the first model that takes advantage of this.

ibeltagy · 2022-02-23T20:30:19Z

ludwig/__init__.py

+
+from tango import Step
+from tango.format import JsonFormat
+from tango.format import SqliteSequenceFormat


do we want to think about task adaptation methods in this PR? probably not. It is probably enough to implement them inside the model API

I want to put TA methods into model implementations, exactly. We should think about them now in so far as we have to change the API to allow for it.

I added #5 and #6, but I think it's unlikely that we'll forget about this.

ibeltagy · 2022-02-23T21:06:09Z

ludwig/__init__.py

+from tango import Step
+from tango.format import JsonFormat
+from tango.format import SqliteSequenceFormat
+from tango.common.sequences import SqliteSparseSequence


Another thing to do now or keep for a different PR is ingesting the CrossFit datasets.

I'm a little uneasy about letting CrossFit squash away all the structure, and then parsing it back out. But the benefit of importing all those datasets with a single class are undeniable.

We have a lot of people on the project. Any interest? I'd be happy to walk you through the process.

ibeltagy · 2022-02-24T01:51:35Z

ludwig/__init__.py

@@ -0,0 +1,54 @@
+from typing import Union, Sequence, Optional


One more thing I forgot to ask about. How are we addressing result storage and reporting?

Results are stored in the Tango workspace. We should decide on an NFS directory to use until @epwalsh can give us the cloud workspace. How about /net/nfs.cirrascale/allennlp/ai2-lm-eval?

Reporting is a different story. The current main() method just prints a JSON blob. How do we want to see this? We could write a tool that will read the results from the cache and make us a graph or a table.

Logging updates

Ludwig's evaluation scaffolding

deded0d

dirkgr self-assigned this Feb 21, 2022

dirkgr added 2 commits February 21, 2022 14:29

I didn't mean to check this in.

dba1797

Separate running the model and running the evaluation of the results …

4e6e69f

…into two different steps

dirkgr marked this pull request as ready for review February 22, 2022 21:07

ibeltagy reviewed Feb 23, 2022

View reviewed changes

anas-awadalla reviewed Feb 23, 2022

View reviewed changes

ibeltagy reviewed Feb 23, 2022

View reviewed changes

dirkgr added 8 commits February 23, 2022 14:01

Adds metadata

3ba0ac6

Optional context for QA

3180e31

Dataclasses don't support default arguments

cc292bb

Squad v2 task

446f16b

Run multiple tasks in one command

0c2fb5f

Rename model_name to pretrained_model_name_or_path

a4725c8

Add an option for only running a few instances, for testing

42673ef

We can now depend on Tango@main

06a8f24

ibeltagy reviewed Feb 24, 2022

View reviewed changes

dirkgr added 2 commits February 24, 2022 14:19

Summarization is now generation

ce7f443

Support multiple answers in QA tasks

fc3216d

Compatibility with the latest Tango

114ebe3

dirkgr mentioned this pull request Feb 25, 2022

Write the first model that uses prompting #3

Closed

dirkgr added 12 commits February 25, 2022 13:49

Update Tango version to a release version

ea5bfc8

Adds a perplexity task

ad93af9

More tasks

2329967

Gitignore

8763df3

Beautification

4270ebf

We updated Tango

20f9394

Don't call me Ludwig

58501c0

We're now called catwalk

08cafd5

Tests!

e6046c5

Ignore vim

9b22a10

Per-task metrics

ee7a0dd

Perplexity task

8aa38fd

dirkgr merged commit 7a0e50d into main Mar 3, 2022

dirkgr deleted the Ludwig branch March 3, 2022 22:42

OyvindTafjord added a commit that referenced this pull request May 19, 2023

Merge pull request #2 from OyvindTafjord/add-revisions

f0d3b76

Logging updates

		InstanceT = TypeVar('InstanceT')


		class Task(ABC, Registrable, DetHashWithVersion):



		class ModelForEvaluation(ABC, Registrable, DetHashWithVersion):
		def do_summarization(

		@@ -0,0 +1,54 @@
		from typing import Union, Sequence, Optional

		@@ -0,0 +1,54 @@
		from typing import Union, Sequence, Optional

		from tango import Step

Ludwig's evaluation scaffolding #2

Ludwig's evaluation scaffolding #2

Conversation

dirkgr commented Feb 21, 2022 • edited Loading

Getting started

Tasks

Models

Two Tango Steps

ibeltagy Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibeltagy Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibeltagy Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anas-awadalla Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibeltagy Feb 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr commented Feb 21, 2022 •

edited

Loading

ibeltagy Feb 23, 2022 •

edited

Loading

ibeltagy Feb 23, 2022 •

edited

Loading

ibeltagy Feb 23, 2022 •

edited

Loading

anas-awadalla Feb 23, 2022 •

edited

Loading

ibeltagy Feb 24, 2022 •

edited

Loading