-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] DocBERT Colab notebook #58
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -136,5 +136,4 @@ def evaluate_split(model, processor, tokenizer, args, split='dev'): | |
model = model.to(device) | ||
|
||
evaluate_split(model, processor, tokenizer, args, split='dev') | ||
evaluate_split(model, processor, tokenizer, args, split='test') | ||
|
||
evaluate_split(model, processor, tokenizer, args, split='test') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please remove this change? It's affecting an unrelated file. Also, as a convention, we have newlines at the end of files. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -58,7 +58,6 @@ def evaluate_split(model, vectorizer, processor, args, split='dev'): | |
args.n_gpu = n_gpu | ||
args.num_labels = dataset_map[args.dataset].NUM_CLASSES | ||
args.is_multilabel = dataset_map[args.dataset].IS_MULTILABEL | ||
args.vocab_size = min(args.max_vocab_size, dataset_map[args.dataset].VOCAB_SIZE) | ||
|
||
train_examples = None | ||
processor = dataset_map[args.dataset]() | ||
|
@@ -71,6 +70,12 @@ def evaluate_split(model, vectorizer, processor, args, split='dev'): | |
save_path = os.path.join(args.save_path, dataset_map[args.dataset].NAME) | ||
os.makedirs(save_path, exist_ok=True) | ||
|
||
if train_examples: | ||
train_features = vectorizer.fit_transform([x.text for x in train_examples]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm but this means that we would There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the same fit transform would also be applied in Would it be ok to pass the Not sure whats a better way to do this. |
||
dataset_map[args.dataset].VOCAB_SIZE = train_features.shape[1] | ||
|
||
args.vocab_size = min(args.max_vocab_size, dataset_map[args.dataset].VOCAB_SIZE) | ||
|
||
model = LogisticRegression(args) | ||
model.to(device) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be due to a mismatch in the scikit library version in your environment. In any case, rather than hard-coding the vocabulary size, would it be possible to infer it at run time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vocab size is hard coded for all the datasets. Should I add a method to calculate vocab size in the
BagOfWordsProcessor
class that's extended for each dataset?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup that would work