-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ludwig's evaluation scaffolding #2
Conversation
…into two different steps
text1: str | ||
text2: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you lose the names of the fields. What happens to tasks like boolq that has a question and a context, or amazon polarity that has title and content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it buys us much to have these as separate classes, especially when they are one-off. When you implement the model you have to remember which is which.
We could do this:
class QuestionClassificationTask(PairClassificationTask):
@property
def question(self):
return self.text1
@property
def context(self):
return self.text2
Currently there is a 1:1 correspondence between the methods on ModelForEvaluation
and these task classes. I'm not sure it makes sense to add more methods on ModelForEvaluation
for one-off cases, especially if the implementation would likely be the same for many tasks anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a separate class. I was referring to a generic dict as you suggested here #2 (comment)
ludwig/tasks/task.py
Outdated
InstanceT = TypeVar('InstanceT') | ||
|
||
|
||
class Task(ABC, Registrable, DetHashWithVersion): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be useful to include HF dataset source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and other metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a big fan of having a generic Dict[str, Any]
field as metadata, just in case.
3ba0ac6
ludwig/tasks/qa_task.py
Outdated
class QATask(Task, ABC): | ||
@dataclass | ||
class Instance(Task.Instance): | ||
context: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. 3180e31
ludwig/models/model.py
Outdated
|
||
|
||
class ModelForEvaluation(ABC, Registrable, DetHashWithVersion): | ||
def do_summarization( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably also need a do_conditional_generation
task. The user can probably repurpose do_summarization
for such tasks, but that would be wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not too familiar with how we want to evaluate summarization but would it be possible to have just a do_conditional_generation function which also does summarization using a summarization prompt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be my preference as well. I would prefer to use a small number of model interfaces so that the model implementer writes as little code as possible. If they want, the model implementer can still customize the implementation to specific tasks using information from the meta-data.
If we take this route, we merge do_summarization
, do_conditional_generation
, do_qa
and also merge do_pair_classification
, do_classification
and maybe even do_multiple_choice
. Otherwise, we keep them separate.
Anyway, this is a small difference. If you decide to keep them separate, the first model we implement will just combine them into two functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Classification, pair classification, and multiple choice are all different from one another, because the data formats of the inputs and outputs are different from one another. The formats have to preserve enough fidelity that we can implement models that rely on those differences. While it's true that you can always express one task as another, that's not how many models are implemented, and as long as we want to support those models, we have to keep the structure.
Another consideration is the evaluation. It would be nice if every task category can have only one metrics calculation, i.e., all summarization tasks are evaluated the same way, all MC tasks are evaluated the same way, etc. So we might find that some tasks have the same format, but different evaluations, and therefore we want them to be different.
With all of that in mind, it seems to me that summarization and conditional generation should indeed be the same. They are both str -> str
. But QA is Tuple[str, str] -> str
, at least as long as you have a context string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ce7f443 renames "summarization" to "generation".
ludwig/models/model.py
Outdated
) -> Iterator['SummarizationTask.InstanceResult']: | ||
raise NotImplementedError | ||
|
||
def do_multiple_choice( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The do_task
interface right now takes test instances, which is enough for zero-shot evaluation. What is the plan regarding few-shot evaluation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I think when Dirk and I talked we thought that this could be added through adding another train set for models that support few shot evaluation and we would sample k examples from that dataset to be append to the prompt for evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The do_task()
interface takes a list of instances, but also the Task
object itself. We can use the Task
object to get few-shot instances.
I agree that this is a little awkward. In the first version I just passed the Task
to the do_*
methods. I changed it this way because it allows us to recover from evaluations that failed half-way through. We don't want to pack a bunch of recovery logic into the model implementations. This way the recovery logic is outside, and the model just has to run on the instances presented to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the pros and cons of different choices, but one thing to keep in mind is reproducibility. You want to be able to sample the same sequence of training shots every time. You also want to control how the sampling happens, do you have different training shots for every test instance, the same training shots for all test instances, or something in-between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's implement a few-shot model sooner rather than later, so we can catch potential issues soon.
ludwig/__init__.py
Outdated
@@ -0,0 +1,54 @@ | |||
from typing import Union, Sequence, Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rename the root directory :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll call it the "Beltagy Framework" :-)
I think I'll move the whole thing up one level and replace the stuff that's already there. It was nice for exploration, but we don't need it for this I think.
ludwig/__init__.py
Outdated
@@ -0,0 +1,54 @@ | |||
from typing import Union, Sequence, Optional | |||
|
|||
from tango import Step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Command line is
python ludwig/main.py --model deepset/roberta-base-squad2 --task squad -d workspace
Does it support evaluating on a list of tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it does: 0c2fb5f
ludwig/models/__init__.py
Outdated
"bert-large-uncased": HFAutoModelForEvaluation("bert-large-uncased"), | ||
"bert-large-cased": HFAutoModelForEvaluation("bert-large-cased"), | ||
"roberta-base": HFAutoModelForEvaluation("roberta-base"), | ||
"roberta-large": HFAutoModelForEvaluation("roberta-large"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are just for testing but note that while the deepest model with have a trained QA head when initialized with the AutoModelForQuestionAnswering.from_pretrained(self.model_name) hf function, the other models do not specify specific weights for that head so it will randomly initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, these are kind of nonsensical. HF prints a warning when you initialize them.
There are a million "Roberta trained on Squad" models on the HF hub, but they are not official and we can't be sure how exactly they were trained and evaluated. For a select group of high-profile models and tasks, it might make sense for us to train them ourselves, just running the HF code.
ludwig/models/hf_auto_model.py
Outdated
class HFAutoModelForEvaluation(ModelForEvaluation): | ||
VERSION = "003" | ||
|
||
def __init__(self, model_name: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like model_name = model_path in this case? We may want to decouple these so that we have more flexibility in naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it: a4725c8
ludwig/models/hf_auto_model.py
Outdated
tokenizer = AutoTokenizer.from_pretrained(self.model_name) | ||
instances = Tqdm.tqdm(instances, desc="Processing instances") | ||
with torch.inference_mode(): | ||
for batch in more_itertools.chunked(instances, batch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this structure for preprocessing the dataset would reduce computational efficiency. More specifically it seems like between each batch we are preprocessing and encoding each set of examples which I imagine adds some downtime between gpu's processing of each batch. I think it might be best to do the pre/post processing outside of the evaluation loop so that we can quickly push batches into gpu and be more efficient?
I totally agree that the model should pre/post process its own data but I think we can maybe create a data loader from the preprocessed samples before entering the evaluation loop and that way we make better use of gpu time. Lmk if I reached the wrong conclusion here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mentioned having to do the evaluation in 2 steps for Tango is this maybe why this is written this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely more efficient to do all the tokenizing up front, at the cost of memory. I kept this implementation simple because I imagine we'll replace it with HF pipelines anyways.
In practice, it doesn't make that much of a difference. Fast tokenizers are fast.
It has nothing to do with Tango. If we thought that tokenization is a really heavy process, so heavy that we want to cache the tokenized dataset, then it would make sense to break this up into two Tango steps. But I don't think that's the case. I can create 1M word pieces in 700ms on my laptop. It'll be hard to find speed gains by caching this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Also moving the tokenizer logic outside this function complicates things a lot, it is better to keep it here.
ludwig/tasks/__init__.py
Outdated
dataset="squad", | ||
context_field="context", | ||
question_field="question", | ||
answer_field="answers.text.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok if we just take the first answer as groundtruth for squad as most of the datapoints in squad list the same groundtruth answers. However from working with MRQA I find that other RC datasets in the squad format sometimes have very different answer possibilities. We would want to compare a model's answer to the best matching groundtruth answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, that sounds like we need to change our definition of the QA task to allow for a set of answer choices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fc3216d has that change.
ludwig/__init__.py
Outdated
@@ -0,0 +1,54 @@ | |||
from typing import Union, Sequence, Optional | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add the infrastructure for default prompt
as well or keep it in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all we need is a few extra fields on the task, that's easy enough to do now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default prompt is a dataset-specific field, not a task-specific field. It might even be a dataset x model field. The default prompt is a list of functions that take an instance and returns a prompted string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then let's do it when we write the first model that takes advantage of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ludwig/__init__.py
Outdated
|
||
from tango import Step | ||
from tango.format import JsonFormat | ||
from tango.format import SqliteSequenceFormat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to think about task adaptation methods in this PR? probably not. It is probably enough to implement them inside the model API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to put TA methods into model implementations, exactly. We should think about them now in so far as we have to change the API to allow for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ludwig/__init__.py
Outdated
from tango import Step | ||
from tango.format import JsonFormat | ||
from tango.format import SqliteSequenceFormat | ||
from tango.common.sequences import SqliteSparseSequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to do now or keep for a different PR is ingesting the CrossFit datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little uneasy about letting CrossFit squash away all the structure, and then parsing it back out. But the benefit of importing all those datasets with a single class are undeniable.
We have a lot of people on the project. Any interest? I'd be happy to walk you through the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ludwig/__init__.py
Outdated
@@ -0,0 +1,54 @@ | |||
from typing import Union, Sequence, Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing I forgot to ask about. How are we addressing result storage and reporting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results are stored in the Tango workspace. We should decide on an NFS directory to use until @epwalsh can give us the cloud workspace. How about /net/nfs.cirrascale/allennlp/ai2-lm-eval
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reporting is a different story. The current main()
method just prints a JSON blob. How do we want to see this? We could write a tool that will read the results from the cache and make us a graph or a table.
This PR adds scaffolding for computing a big matrix of model X evaluated on task Y, for many Xs and Ys.
Getting started
You can try running an evaluation by running
python ludwig/main.py --model deepset/roberta-base-squad2 --task squad -d workspace
. The final-d
selects a directory where results (both intermediate and final) are written.Since we're trying to sort out the APIs, I did not yet put in the capability to run on GPUs.
Tasks
Look in
ludwig/tasks/
to see the definitions of the tasks. There are five different types of tasks there. Only two of them have an implementation (multiple choice and question answering).The
Task
class also defines the types of the instances. Instances are always@dataclass
. There are alsoInstanceResult
classes there, which store a reference to the original instance, and also the predictions.Models
Models live in
ludwig/models/
. The model base class,ModelForEvaluation
, has one method for each of the tasks we want to perform. Many models will only perform one of those tasks and raise aNotImplementedError
otherwise, but other models (like GPT-3) should be expected to perform multiple tasks.There is no example of a few-shot model in here yet. My thinking is, models have access to a
Task
object when they do their work, so they can find the training instances they need there.Two Tango Steps
Evaluations are performed in two Tango steps:
The results of each of these steps are cached.
It's a little annoying to have to break this into two steps like this, but it gives us flexibility down the line when we want to inspect the predictions, or do something else with them.