Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] DocBERT Colab notebook #58

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion datasets/bow_processors/reuters_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
class ReutersProcessor(BagOfWordsProcessor):
NAME = 'Reuters'
NUM_CLASSES = 90
VOCAB_SIZE = 36308
VOCAB_SIZE = 36311
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be due to a mismatch in the scikit library version in your environment. In any case, rather than hard-coding the vocabulary size, would it be possible to infer it at run time?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vocab size is hard coded for all the datasets. Should I add a method to calculate vocab size in theBagOfWordsProcessor class that's extended for each dataset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that would work

IS_MULTILABEL = True

def get_train_examples(self, data_dir):
Expand Down
3 changes: 1 addition & 2 deletions models/bert/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,5 +136,4 @@ def evaluate_split(model, processor, tokenizer, args, split='dev'):
model = model.to(device)

evaluate_split(model, processor, tokenizer, args, split='dev')
evaluate_split(model, processor, tokenizer, args, split='test')

evaluate_split(model, processor, tokenizer, args, split='test')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove this change? It's affecting an unrelated file. Also, as a convention, we have newlines at the end of files.

7 changes: 6 additions & 1 deletion models/lr/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ def evaluate_split(model, vectorizer, processor, args, split='dev'):
args.n_gpu = n_gpu
args.num_labels = dataset_map[args.dataset].NUM_CLASSES
args.is_multilabel = dataset_map[args.dataset].IS_MULTILABEL
args.vocab_size = min(args.max_vocab_size, dataset_map[args.dataset].VOCAB_SIZE)

train_examples = None
processor = dataset_map[args.dataset]()
Expand All @@ -71,6 +70,12 @@ def evaluate_split(model, vectorizer, processor, args, split='dev'):
save_path = os.path.join(args.save_path, dataset_map[args.dataset].NAME)
os.makedirs(save_path, exist_ok=True)

if train_examples:
train_features = vectorizer.fit_transform([x.text for x in train_examples])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but this means that we would fit_transform twice right? I am not sure if you ran this on our larger datasets (IMDB or Yelp) but it's a pretty memory and compute intensive operation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the same fit transform would also be applied in trainers/bow_trainer.py. It's just tricky because you need to initialize the LR model having the vocab size and the BagOfWordsTrainer requires that model.

Would it be ok to pass the train_features to the BagOfWordsTrainer? It's only used by the LR model anyway. It just feels clunky since vectorizer is only passed to BagOfWordsTrainer to compute train / eval features.

Not sure whats a better way to do this.

dataset_map[args.dataset].VOCAB_SIZE = train_features.shape[1]

args.vocab_size = min(args.max_vocab_size, dataset_map[args.dataset].VOCAB_SIZE)

model = LogisticRegression(args)
model.to(device)

Expand Down