Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart classnum #29

Closed
wants to merge 49 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
ff04385
Update setup.py package list
j-cahill Jul 20, 2019
a4e8311
Update README.md
j-cahill Jul 20, 2019
b4778e9
modified training code to create learning curve figures
j-cahill Jul 21, 2019
5531f0d
bug fix - learning curves
j-cahill Jul 22, 2019
6402478
Merge pull request #1 from j-cahill/learning_curves
j-cahill Jul 22, 2019
2c2beff
Update requirements.txt
j-cahill Jul 22, 2019
3ef89c3
fixed cuda bug for HAN
j-cahill Jul 23, 2019
c2b1fcf
attempt to make CUDA usage for HAN resemble BERT code
j-cahill Jul 23, 2019
454c5f3
attempt to make CUDA usage for HAN resemble BERT code -2
j-cahill Jul 23, 2019
9825e79
CUDA fix for LSTM
j-cahill Jul 23, 2019
43d0b98
Merge pull request #2 from j-cahill/cuda_fix
j-cahill Jul 23, 2019
4ee01c7
Add files via upload
naotominakawa Jul 23, 2019
15c6f65
Merge pull request #3 from j-cahill/naotominakawa-patch-1
j-cahill Jul 24, 2019
cf8e101
modified args and dataset map to include lyrics arguments
j-cahill Jul 24, 2019
b55cdc9
Changed class number to correct number, 10
j-cahill Jul 24, 2019
e402455
Merge pull request #4 from j-cahill/classnum_bug
j-cahill Jul 24, 2019
e02c98f
Update lyrics_processor.py
naotominakawa Jul 25, 2019
edfc7cd
Update lyrics_processor.py
naotominakawa Jul 25, 2019
e13fe20
Update __main__.py
j-cahill Jul 31, 2019
d0ac4b0
fixed cuda loading for all models
j-cahill Aug 1, 2019
c188fb5
Fix for weight_drop error
j-cahill Aug 1, 2019
2b5b609
added local-rank arg
j-cahill Aug 1, 2019
ccd8d8a
fixed local rank to only be in models/args
j-cahill Aug 1, 2019
11e6d00
attempt at char_cnn file not found fix
j-cahill Aug 1, 2019
2f34d5c
char-cnn fix
j-cahill Aug 2, 2019
eb51e2b
char_cnn fix
j-cahill Aug 2, 2019
42c596b
char_cnn fix
j-cahill Aug 2, 2019
b8215c7
char_cnn fix
j-cahill Aug 2, 2019
aac329a
LSTM fix
j-cahill Aug 2, 2019
ad2b8c9
LSTM fix
j-cahill Aug 2, 2019
9dba709
LSTM fix
j-cahill Aug 2, 2019
fd8276f
added Lyrics arg
j-cahill Aug 2, 2019
51cd6a2
added lyrics dataset to all __main__ files
j-cahill Aug 2, 2019
ed7882a
added lyrics to evaluators
j-cahill Aug 2, 2019
8221788
added lyrics to train
j-cahill Aug 2, 2019
dba4907
Merge pull request #5 from j-cahill/cuda_fix
j-cahill Aug 3, 2019
587f705
Update args.py
j-cahill Aug 3, 2019
d41615f
fix for num_classes for bert
j-cahill Aug 3, 2019
bdaec31
removed local-rank argument from bert.args
j-cahill Aug 3, 2019
ef4d618
monitoring class number and multilabel
j-cahill Aug 3, 2019
8ee8bb2
test
j-cahill Aug 3, 2019
b3837c6
fixed sst_processor code
j-cahill Aug 3, 2019
3b53a5b
multilabel true for testing
j-cahill Aug 3, 2019
46f60d8
fixed evaluation metrics for 2 class problem
j-cahill Aug 3, 2019
37e7ad2
changed pos_label to 1
j-cahill Aug 3, 2019
1ce9b95
removed testing print statements
j-cahill Aug 3, 2019
52f7bf2
Merge branch 'master' into num_classes
j-cahill Aug 3, 2019
4a0faff
Merge pull request #6 from j-cahill/num_classes
j-cahill Aug 3, 2019
8883909
infer class number from actual dataset
j-cahill Aug 4, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

This repo contains PyTorch deep learning models for document classification, implemented by the Data Systems Group at the University of Waterloo.

# Modifications from Original at castorini/hedwig
- added 'models/' in setup.py
- add boto3 in requirements.txt

## Models

+ [DocBERT](models/bert/) : DocBERT: BERT for Document Classification [(Adhikari et al., 2019)](https://arxiv.org/abs/1904.08398v1)
Expand Down
1 change: 1 addition & 0 deletions common/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ class EvaluatorFactory(object):
'AAPD': ClassificationEvaluator,
'IMDB': ClassificationEvaluator,
'Yelp2014': ClassificationEvaluator,
'Lyrics': ClassificationEvaluator,
'Robust04': RelevanceTransferEvaluator,
'Robust05': RelevanceTransferEvaluator,
'Robust45': RelevanceTransferEvaluator
Expand Down
13 changes: 10 additions & 3 deletions common/evaluators/bert_evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,11 +86,18 @@ def get_scores(self, silent=False):
nb_eval_examples += input_ids.size(0)
nb_eval_steps += 1

if self.args.is_multilabel:
score_method = 'micro'
pos_label = None
else:
score_method = 'binary'
pos_label = 1

predicted_labels, target_labels = np.array(predicted_labels), np.array(target_labels)
accuracy = metrics.accuracy_score(target_labels, predicted_labels)
precision = metrics.precision_score(target_labels, predicted_labels, average='micro')
recall = metrics.recall_score(target_labels, predicted_labels, average='micro')
f1 = metrics.f1_score(target_labels, predicted_labels, average='micro')
precision = metrics.precision_score(target_labels, predicted_labels, average=score_method, pos_label=pos_label)
recall = metrics.recall_score(target_labels, predicted_labels, average=score_method, pos_label=pos_label)
f1 = metrics.f1_score(target_labels, predicted_labels, average=score_method, pos_label=pos_label)
avg_loss = total_loss / nb_eval_steps

return [accuracy, precision, recall, f1, avg_loss], ['accuracy', 'precision', 'recall', 'f1', 'avg_loss']
1 change: 1 addition & 0 deletions common/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ class TrainerFactory(object):
'AAPD': ClassificationTrainer,
'IMDB': ClassificationTrainer,
'Yelp2014': ClassificationTrainer,
'Lyrics': ClassificationTrainer,
'Robust04': RelevanceTransferTrainer,
'Robust05': RelevanceTransferTrainer,
'Robust45': RelevanceTransferTrainer,
Expand Down
24 changes: 24 additions & 0 deletions common/trainers/bert_trainer.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# noinspection PyPackageRequirements
import datetime
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader, RandomSampler, TensorDataset
Expand Down Expand Up @@ -108,6 +113,8 @@ def train(self):

train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=self.args.batch_size)

# results for graphing learning curves
results = []
for epoch in trange(int(self.args.epochs), desc="Epoch"):
self.train_epoch(train_dataloader)
dev_evaluator = BertEvaluator(self.model, self.processor, self.args, split='dev')
Expand All @@ -118,6 +125,8 @@ def train(self):
tqdm.write(self.log_template.format(epoch + 1, self.iterations, epoch + 1, self.args.epochs,
dev_acc, dev_precision, dev_recall, dev_f1, dev_loss))

results.append([epoch + 1, dev_acc, dev_precision, dev_recall, dev_f1, dev_loss])

# Update validation results
if dev_f1 > self.best_dev_f1:
self.unimproved_iters = 0
Expand All @@ -130,3 +139,18 @@ def train(self):
self.early_stop = True
tqdm.write("Early Stopping. Epoch: {}, Best Dev F1: {}".format(epoch, self.best_dev_f1))
break

# create learning curves
results_frame = pd.DataFrame(data=np.array(results),
columns=['Epoch', 'Accuracy', 'Precision', 'Recall', 'F1', 'Loss']) \
.set_index('Epoch')


ax_acc = results_frame[['Accuracy', 'Precision', 'Recall', 'F1']].plot()
ax_loss = results_frame[['Loss']].plot()

ax_acc.get_figure().savefig('accuracy_curves.png')
ax_loss.get_figure().savefig('loss_curves.png')



1 change: 1 addition & 0 deletions common/trainers/classification_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ def train(self, epochs):
torch.save(self.model, self.snapshot_path)
else:
self.iters_not_improved += 1
torch.save(self.model, self.snapshot_path)
if self.iters_not_improved >= self.patience:
self.early_stop = True
print("Early Stopping. Epoch: {}, Best Dev F1: {}".format(epoch, self.best_dev_f1))
Expand Down
41 changes: 41 additions & 0 deletions datasets/bert_processors/lyrics_processor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import os

from datasets.bert_processors.abstract_processor import BertProcessor, InputExample


class LyricsProcessor(BertProcessor):
def __init__(self):
self.NAME = 'Lyrics'

def set_num_classes_(self, data_dir):
with open(os.path.join(data_dir, 'Lyrics', 'train.tsv'), 'r') as f:
l1 = f.readline().split('\t')

# from one-hot class vector
self.NUM_CLASSES = len(l1[0])
self.IS_MULTILABEL = self.NUM_CLASSES > 2

def get_train_examples(self, data_dir):
return self._create_examples(
self._read_tsv(os.path.join(data_dir, 'Lyrics', 'train.tsv')), 'train')

def get_dev_examples(self, data_dir):
return self._create_examples(
self._read_tsv(os.path.join(data_dir, 'Lyrics', 'dev.tsv')), 'dev')

def get_test_examples(self, data_dir):
return self._create_examples(
self._read_tsv(os.path.join(data_dir, 'Lyrics', 'test.tsv')), 'test')

def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
text_a = line[1]
label = line[0]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
15 changes: 11 additions & 4 deletions datasets/bert_processors/reuters_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,17 @@


class ReutersProcessor(BertProcessor):
NAME = 'Reuters'
NUM_CLASSES = 90
IS_MULTILABEL = True

def __init__(self):
self.NAME = 'Reuters'

def set_num_classes_(self, data_dir):
with open(os.path.join(data_dir, 'Reuters', 'train.tsv'), 'r') as f:
l1 = f.readline().split('\t')

# from one-hot class vector
self.NUM_CLASSES = len(l1[0])
self.IS_MULTILABEL = self.NUM_CLASSES > 2

def get_train_examples(self, data_dir):
return self._create_examples(
self._read_tsv(os.path.join(data_dir, 'Reuters', 'train.tsv')), 'train')
Expand Down
13 changes: 10 additions & 3 deletions datasets/bert_processors/sst_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,16 @@


class SST2Processor(BertProcessor):
NAME = 'SST-2'
NUM_CLASSES = 2
IS_MULTILABEL = False
def __init__(self):
self.NAME = 'SST-2'

def set_num_classes_(self, data_dir):
with open(os.path.join(data_dir, 'SST-2', 'train.tsv'), 'r') as f:
l1 = f.readline().split('\t')

# from one-hot class vector
self.NUM_CLASSES = len(l1[0])
self.IS_MULTILABEL = self.NUM_CLASSES > 2

def get_train_examples(self, data_dir):
return self._create_examples(
Expand Down
105 changes: 105 additions & 0 deletions datasets/lyrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
import os
import re

import numpy as np
import torch
from torchtext.data import NestedField, Field, TabularDataset
from torchtext.data.iterator import BucketIterator
from torchtext.vocab import Vectors


def clean_string(string):
"""
Performs tokenization and string cleaning for the Lyrics dataset
"""
string = re.sub(r"[^A-Za-z0-9(),!?\'`]", " ", string)
string = re.sub(r"\s{2,}", " ", string)
return string.lower().strip().split()


def split_sents(string):
string = re.sub(r"[!?]"," ", string)
return string.strip().split('.')


def char_quantize(string, max_length=1000):
identity = np.identity(len(LyricsCharQuantized.ALPHABET))
quantized_string = np.array([identity[LyricsCharQuantized.ALPHABET[char]] for char in list(string.lower()) if char in LyricsCharQuantized.ALPHABET], dtype=np.float32)
if len(quantized_string) > max_length:
return quantized_string[:max_length]
else:
return np.concatenate((quantized_string, np.zeros((max_length - len(quantized_string), len(LyricsCharQuantized.ALPHABET)), dtype=np.float32)))


def process_labels(string):
"""
Returns the label string as a list of integers
:param string:
:return:
"""
return [float(x) for x in string]


class Lyrics(TabularDataset):
NAME = 'Lyrics'
NUM_CLASSES = 2 #10
IS_MULTILABEL = False #True

TEXT_FIELD = Field(batch_first=True, tokenize=clean_string, include_lengths=True)
LABEL_FIELD = Field(sequential=False, use_vocab=False, batch_first=True, preprocessing=process_labels)

@staticmethod
def sort_key(ex):
return len(ex.text)

@classmethod
def splits(cls, path, train=os.path.join('Lyrics', 'train.tsv'),
validation=os.path.join('Lyrics', 'dev.tsv'),
test=os.path.join('Lyrics', 'test.tsv'), **kwargs):
return super(Lyrics, cls).splits(
path, train=train, validation=validation, test=test,
format='tsv', fields=[('label', cls.LABEL_FIELD), ('text', cls.TEXT_FIELD)]
)

@classmethod
def iters(cls, path, vectors_name, vectors_cache, batch_size=64, shuffle=True, device=0, vectors=None,
unk_init=torch.Tensor.zero_):
"""
:param path: directory containing train, test, dev files
:param vectors_name: name of word vectors file
:param vectors_cache: path to directory containing word vectors file
:param batch_size: batch size
:param device: GPU device
:param vectors: custom vectors - either predefined torchtext vectors or your own custom Vector classes
:param unk_init: function used to generate vector for OOV words
:return:
"""
if vectors is None:
vectors = Vectors(name=vectors_name, cache=vectors_cache, unk_init=unk_init)

train, val, test = cls.splits(path)
cls.TEXT_FIELD.build_vocab(train, val, test, vectors=vectors)
return BucketIterator.splits((train, val, test), batch_size=batch_size, repeat=False, shuffle=shuffle,
sort_within_batch=True, device=device)


class LyricsCharQuantized(Lyrics):
ALPHABET = dict(map(lambda t: (t[1], t[0]), enumerate(list("""abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"""))))
TEXT_FIELD = Field(sequential=False, use_vocab=False, batch_first=True, preprocessing=char_quantize)

@classmethod
def iters(cls, path, vectors_name, vectors_cache, batch_size=64, shuffle=True, device=0, vectors=None,
unk_init=torch.Tensor.zero_):
"""
:param path: directory containing train, test, dev files
:param batch_size: batch size
:param device: GPU device
:return:
"""
train, val, test = cls.splits(path)
return BucketIterator.splits((train, val, test), batch_size=batch_size, repeat=False, shuffle=shuffle, device=device)


class LyricsHierarchical(Lyrics):
NESTING_FIELD = Field(batch_first=True, tokenize=clean_string)
TEXT_FIELD = NestedField(NESTING_FIELD, tokenize=split_sents)
1 change: 1 addition & 0 deletions models/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ def get_args():
parser.add_argument('--seed', type=int, default=3435)
parser.add_argument('--patience', type=int, default=5)
parser.add_argument('--log-every', type=int, default=10)
parser.add_argument('--local-rank', type=int, default=-1, help='local rank for distributed training')
parser.add_argument('--data-dir', default=os.path.join(os.pardir, 'hedwig-data', 'datasets'))

return parser
15 changes: 10 additions & 5 deletions models/bert/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
from datasets.bert_processors.sogou_processor import SogouProcessor
from datasets.bert_processors.sst_processor import SST2Processor
from datasets.bert_processors.yelp2014_processor import Yelp2014Processor
from datasets.bert_processors.lyrics_processor import LyricsProcessor

from models.bert.args import get_args
from models.bert.model import BertForSequenceClassification
from utils.io import PYTORCH_PRETRAINED_BERT_CACHE
Expand Down Expand Up @@ -67,7 +69,8 @@ def evaluate_split(model, processor, args, split='dev'):
'AAPD': AAPDProcessor,
'AGNews': AGNewsProcessor,
'Yelp2014': Yelp2014Processor,
'Sogou': SogouProcessor
'Sogou': SogouProcessor,
'Lyrics': LyricsProcessor
}

if args.gradient_accumulation_steps < 1:
Expand All @@ -77,17 +80,19 @@ def evaluate_split(model, processor, args, split='dev'):
if args.dataset not in dataset_map:
raise ValueError('Unrecognized dataset')

processor = dataset_map[args.dataset]()
processor.set_num_classes_(args.data_dir)

args.batch_size = args.batch_size // args.gradient_accumulation_steps
args.device = device
args.n_gpu = n_gpu
args.num_labels = dataset_map[args.dataset].NUM_CLASSES
args.is_multilabel = dataset_map[args.dataset].IS_MULTILABEL
args.num_labels = processor.NUM_CLASSES
args.is_multilabel = processor.IS_MULTILABEL

if not args.trained_model:
save_path = os.path.join(args.save_path, dataset_map[args.dataset].NAME)
save_path = os.path.join(args.save_path, processor.NAME)
os.makedirs(save_path, exist_ok=True)

processor = dataset_map[args.dataset]()
args.is_lowercase = 'uncased' in args.model
args.is_hierarchical = False
tokenizer = BertTokenizer.from_pretrained(args.model, is_lowercase=args.is_lowercase)
Expand Down
4 changes: 2 additions & 2 deletions models/bert/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ def get_args():
parser = models.args.get_args()

parser.add_argument('--model', default=None, type=str, required=True)
parser.add_argument('--dataset', type=str, default='SST-2', choices=['SST-2', 'AGNews', 'Reuters', 'AAPD', 'IMDB', 'Yelp2014'])
parser.add_argument('--dataset', type=str, default='SST-2', choices=['SST-2', 'AGNews', 'Reuters', 'AAPD', 'IMDB',
'Yelp2014', 'Lyrics'])
parser.add_argument('--save-path', type=str, default=os.path.join('model_checkpoints', 'bert'))
parser.add_argument('--cache-dir', default='cache', type=str)
parser.add_argument('--trained-model', default=None, type=str)
parser.add_argument('--local-rank', type=int, default=-1, help='local rank for distributed training')
parser.add_argument('--fp16', action='store_true', help='use 16-bit floating point precision')

parser.add_argument('--max-seq-length',
Expand Down
Loading