diff --git a/README.md b/README.md index b07eae3..e9f6385 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,34 @@ +<<<<<<< HEAD +# Ubuntu Dialogue Corpus v2.0 +======= # README -- Ubuntu Dialogue Corpus v2.0 +>>>>>>> 481182591b91173713f23812ce726fe64ad91209 -We describe the files for generating the Ubuntu Dialogue Corpus, and the dataset itself. +Scripts for generating the Ubuntu Dialogue Corpus, and information about corpus contents. ## UPDATES FROM UBUNTU CORPUS v1.0: -There are several updates and bug fixes that are present in v2.0. The updates are significant enough that results on the two datasets will not be equivalent, and should not be compared. However, models that do well on the first dataset should transfer to the second dataset (with perhaps a new hyperparameter search). +Version 2.0 of the corpus is not compatible with Version 1.0, and performance results should not be compared between the two. However, models that do well on the first dataset should also do well on the second dataset but may require a new hyperparameter search. -- Separated the train/validation/test sets by time. The training set goes from the beginning (2004) to about April 27, 2012, the validation set goes from April 27 to August 7, 2012, and the test set goes from August 7 to December 1, 2012. This more closely mimics real life implementation, where you are training a model on past data to predict future data. +- Train/validation/test sets separated by date to more closely mimics real life implementation, where you are training a model on past data to predict future data. + - *Training set* -- Jan 2004 to April 27, 2012 + - *Validation set* -- April 27 to August 7, 2012 + - *Test set* -- August 7 to December 1, 2012 - Changed the sampling procedure for the context length in the validation and test sets, from an inverse distribution to a uniform distribution (between 2 and the max context size). This increases the average context length, which we consider desirable since we would like to model long-term dependencies. - Changed the tokenization and entity replacement procedure. After complaints stating v1 was too aggressive, we've decided to remove these. It is up to each person using the dataset to come up with their own tokenization/ entity replacement scheme. We plan to use the tokenization internally. - Added differentiation between the end of an utterance (`__eou__`) and end of turn (`__eot__`). In the original dataset, we concatenated all consecutive utterances by the same user in to one utterance, and put `__EOS__` at the end. Here, we also denote where the original utterances were (with `__eou__`). Also, the terminology should now be consistent between the training and test set (instead of both `__EOS__` and ``). - Fixed a bug that caused the distribution of false responses in the test and validation sets to be different from the true responses. In particular, the number of words in the false responses was shorter on average than for the true responses, which could have been exploited by some models. ## UBUNTU CORPUS GENERATION FILES: +<<<<<<< HEAD + +### `generate.sh` + +Script that calls `create_ubuntu_dataset.py`. This is the script you should run in order to download the dataset. The parameters passed to this script will be passed to `create_ubuntu_dataset.py`. Example usage: `./generate.sh -t -s -l`. + +### `create_ubuntu_dataset.py` + +======= ### generate.sh: #### DESCRIPTION: @@ -20,11 +36,18 @@ Script that calls `create_ubuntu_dataset.py`. This is the script you should run ### create_ubuntu_dataset.py: #### DESCRIPTION: +>>>>>>> 481182591b91173713f23812ce726fe64ad91209 Script for generation of train, test and valid datasets from Ubuntu Corpus 1 on 1 dialogs. The script downloads 1on1 dialogs from internet and then it randomly samples all the datasets with positive and negative examples. -Copyright IBM 2015 +<<<<<<< HEAD +(C) IBM 2015 + +#### ARGUMENTS: + +======= #### ARGUMENTS: +>>>>>>> 481182591b91173713f23812ce726fe64ad91209 - `--data_root`: directory where 1on1 dialogs will downloaded and extracted, the data will be downloaded from [cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz](http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz) (default = '.') - `--seed`: seed for random number generator (default = 1234) - `-o`, `--output`: output file for writing to csv (default = None) @@ -35,6 +58,10 @@ Copyright IBM 2015 *Note:* if both `-s` and `-l` are present, the stemmer is applied before the lemmatizer. #### Subparsers: +<<<<<<< HEAD + +======= +>>>>>>> 481182591b91173713f23812ce726fe64ad91209 `train`: train set generator - `-p`: positive example probability, ie. the ratio of positive examples to total examples in the training set (default = 0.5) - `-e`, `--examples`: number of training examples to generate. Note that this will generate slightly fewer examples than desired, as there is a 'post-processing' step that filters (default = 1000000) @@ -46,6 +73,37 @@ Copyright IBM 2015 - `-n`: number of distractor examples for each context (default = 9) +<<<<<<< HEAD +### meta folder + +trainfiles.csv +valfiles.csv +testfiles.csv + +#### DESCRIPTION: + +Maps the original dialogue files to the training, validation, and test sets. + + +## UBUNTU CORPUS FILES (after generating) + +### train.csv + +Contains the training set. It is separated into 3 columns: the context of the conversation, the candidate response or 'utterance', and a flag or 'label' (= 0 or 1) denoting whether the response is a 'true response' to the context (flag = 1), or a randomly drawn response from elsewhere in the dataset (flag = 0). This triples format is described in the paper. When generated with the default settings, train.csv is 463Mb, with 1,000,000 lines (ie. examples, which corresponds to 449,071 dialogues) and with a vocabulary size of ~~1,344,621~~. Note that, to generate the full dataset, you should use the `--examples` argument for the `create_ubuntu_dataset.py` file. + +### valid.csv + +Contains the validation set. Each row represents a question. Separated into 11 columns: the context, the true response or 'ground truth utterance', and 9 false responses or 'distractors' that were randomly sampled from elsewhere in the dataset. Your model gets a question correct if it selects the ground truth utterance from amongst the 10 possible responses. When generated with the default settings, `valid.csv` is 27Mb, with 19,561 lines and a vocabulary size of 115,688. + +### test.csv + +Contains the test set. Formatted in the same way as the validation set. When generated with the default settings, test.csv is 27Mb, with 18,921 lines and a vocabulary size of 115,623. + +## BASELINE RESULTS + +#### Dual Encoder LSTM model + +======= ### meta folder: trainfiles.csv, valfiles.csv, testfiles.csv: #### DESCRIPTION: Maps the original dialogue files to the training, validation, and test sets. @@ -65,6 +123,7 @@ Contains the test set. Formatted in the same way as the validation set. When gen ## BASELINE RESULTS #### Dual Encoder LSTM model: +>>>>>>> 481182591b91173713f23812ce726fe64ad91209 ``` 1 in 2: recall@1: 0.868730970907 @@ -74,7 +133,12 @@ Contains the test set. Formatted in the same way as the validation set. When gen recall@5: 0.924285351827 ``` +<<<<<<< HEAD +#### Dual Encoder RNN model + +======= #### Dual Encoder RNN model: +>>>>>>> 481182591b91173713f23812ce726fe64ad91209 ``` 1 in 2: recall@1: 0.776539210705, @@ -84,7 +148,12 @@ Contains the test set. Formatted in the same way as the validation set. When gen recall@5: 0.836350355691, ``` +<<<<<<< HEAD +#### TF-IDF model + +======= #### TF-IDF model: +>>>>>>> 481182591b91173713f23812ce726fe64ad91209 ``` 1 in 2: recall@1: 0.749260042283 @@ -98,7 +167,12 @@ Contains the test set. Formatted in the same way as the validation set. When gen Code for the model can be found here (might not be up to date with the new dataset): https://github.com/npow/ubottu +<<<<<<< HEAD +#### Dual Encoder LSTM model + +======= #### Dual Encoder LSTM model: +>>>>>>> 481182591b91173713f23812ce726fe64ad91209 ``` act_penalty=500 batch_size=256 @@ -128,7 +202,12 @@ use_pv=False xcov_penalty=0.0 ``` +<<<<<<< HEAD +#### Dual Encoder RNN model + +======= #### Dual Encoder RNN model: +>>>>>>> 481182591b91173713f23812ce726fe64ad91209 ``` act_penalty=500 batch_size=512 diff --git a/src/create_ubuntu_dataset.py b/src/create_ubuntu_dataset.py index 88ebf30..3e89ec2 100644 --- a/src/create_ubuntu_dataset.py +++ b/src/create_ubuntu_dataset.py @@ -23,36 +23,41 @@ end_of_turn_symbol = "__eot__" +def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs): + """ python2-3 csv.reader that can handle unicode (utf-8) files """ + csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs) + for row in csv_reader: + yield [unicode(cell, 'utf-8') for cell in row] + def translate_dialog_to_lists(dialog_filename): - """ - Translates the dialog to a list of lists of utterances. In the first - list each item holds subsequent utterances from the same user. The second level - list holds the individual utterances. + """ Translates a dialog into a list of lists of utterances. + + In the first list each item holds subsequent utterances from the same user. + The second level list holds the individual utterances. :param dialog_filename: :return: """ - dialog_file = open(dialog_filename, 'r') - dialog_reader = unicodecsv.reader(dialog_file, delimiter='\t',quoting=csv.QUOTE_NONE) - - # go through the dialog first_turn = True dialog = [] same_user_utterances = [] - #last_user = None dialog.append(same_user_utterances) - for dialog_line in dialog_reader: + with open(dialog_filename, 'r') as fin: + dialog_reader = unicode_csv_reader( + fin, delimiter='\t', quoting=csv.QUOTE_NONE) - if first_turn: - last_user = dialog_line[1] - first_turn = False + for dialog_line in dialog_reader: - if last_user != dialog_line[1]: - # user has changed - same_user_utterances = [] - dialog.append(same_user_utterances) + if first_turn: + last_user = dialog_line[1] + first_turn = False + + if last_user != dialog_line[1]: + # user has changed + same_user_utterances = [] + dialog.append(same_user_utterances) same_user_utterances.append(dialog_line[3]) @@ -63,28 +68,30 @@ def translate_dialog_to_lists(dialog_filename): return dialog -def get_random_utterances_from_corpus(candidate_dialog_paths,rng,utterances_num=9,min_turn=3,max_turn=20): - """ - Sample multiple random utterances from the whole corpus. - :param candidate_dialog_paths: - :param rng: - :param utterances_num: number of utterances to generate - :param min_turn: minimal index of turn that the utterance is selected from - :return: +def get_random_utterances_from_corpus(candidate_dialog_paths, rng, + utterances_num=9, min_turn=3, max_turn=20): + """ Sample multiple random utterances from the whole corpus. + + Args: + candidate_dialog_paths: + rng: + utterances_num: number of utterances to generate + min_turn: minimal index of turn that the utterance is selected from """ utterances = [] dialogs_num = len(candidate_dialog_paths) - for i in xrange(0,utterances_num): - # sample random dialog - dialog_path = candidate_dialog_paths[rng.randint(0,dialogs_num-1)] + for i in range(0, utterances_num): + # sample a random dialog + dialog_path = candidate_dialog_paths[rng.randint(0, dialogs_num - 1)] # load the dialog dialog = translate_dialog_to_lists(dialog_path) # we do not count the last _dialog_end__ urn dialog_len = len(dialog) - 1 - if(dialog_len rng.random(): # use the next utterance as positive example - response = singe_user_utterances_to_string(dialog[next_utterance_ix]) + response = single_user_utterances_to_string(dialog[next_utterance_ix]) label = 1.0 else: - response = get_random_utterances_from_corpus(candidate_dialog_paths,rng,1, + response = get_random_utterances_from_corpus(candidate_dialog_paths, rng, 1, min_turn=minimum_context_length+1, max_turn=max_context_length)[0] label = 0.0 @@ -179,12 +191,15 @@ def create_single_dialog_test_example(context_dialog_path, candidate_dialog_path dialog = translate_dialog_to_lists(context_dialog_path) - context_str, next_utterance_ix = create_random_context(dialog, rng, max_context_length=max_context_length) + context_str, next_utterance_ix = create_random_context( + dialog, rng, max_context_length=max_context_length) # use the next utterance as positive example - positive_response = singe_user_utterances_to_string(dialog[next_utterance_ix]) + positive_response = single_user_utterances_to_string( + dialog[next_utterance_ix]) - negative_responses = get_random_utterances_from_corpus(candidate_dialog_paths,rng,distractors_num) + negative_responses = get_random_utterances_from_corpus( + candidate_dialog_paths, rng, distractors_num) return context_str, positive_response, negative_responses @@ -200,12 +215,15 @@ def create_examples_train(candidate_dialog_paths, rng, positive_probability=0.5, examples = [] for context_dialog in candidate_dialog_paths: if i % 1000 == 0: - print str(i) + print(str(i)) dialog_path = candidate_dialog_paths[i] examples.append(create_single_dialog_train_example(dialog_path, candidate_dialog_paths, rng, positive_probability, max_context_length=max_context_length)) i+=1 - #return map(lambda dialog_path : create_single_dialog_train_example(dialog_path, candidate_dialog_paths, rng, positive_probability), candidate_dialog_paths) + # return map(lambda dialog_path : + # create_single_dialog_train_example(dialog_path, candidate_dialog_paths, + # rng, positive_probability), candidate_dialog_paths) + def create_examples(candidate_dialog_paths, examples_num, creator_function): """ @@ -222,13 +240,15 @@ def create_examples(candidate_dialog_paths, examples_num, creator_function): context_dialog = candidate_dialog_paths[i % unique_dialogs_num] # counter for tracking progress if i % 1000 == 0: - print str(i) + print(str(i)) i+=1 - examples.append(creator_function(context_dialog, candidate_dialog_paths)) + examples.append(creator_function( + context_dialog, candidate_dialog_paths)) return examples + def convert_csv_with_dialog_paths(csv_file): """ Converts CSV file with comma separated paths to filesystem paths. @@ -236,61 +256,61 @@ def convert_csv_with_dialog_paths(csv_file): :return: """ def convert_line_to_path(line): - file, dir = map(lambda x : x.strip(), line.split(",")) + file, dir = [x.strip() for x in line.split(",")] return os.path.join(dir, file) - return map(convert_line_to_path, csv_file) + return list(map(convert_line_to_path, csv_file)) def prepare_data_maybe_download(directory): - """ - Download and unpack dialogs if necessary. - """ - filename = 'ubuntu_dialogs.tgz' - url = 'http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz' - dialogs_path = os.path.join(directory, 'dialogs') - - # test it there are some dialogs in the path - if not os.path.exists(os.path.join(directory,"10","1.tst")): - # dialogs are missing - archive_path = os.path.join(directory,filename) - if not os.path.exists(archive_path): - # archive missing, download it - print("Downloading %s to %s" % (url, archive_path)) - filepath, _ = urllib.request.urlretrieve(url, archive_path) - print "Successfully downloaded " + filepath - - # unpack data - if not os.path.exists(dialogs_path): - print("Unpacking dialogs ...") - with tarfile.open(archive_path) as tar: + """Download ubuntu_dialogs.tgz file from cs.mcgill.ca/ and unpack it if dialogs/ doesn't exist. """ + archive_filename = 'ubuntu_dialogs.tgz' + url = 'http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz' + dialogs_path = os.path.join(directory, 'dialogs') + + # test it there are some dialogs in the path + if not os.path.exists(os.path.join(directory, "10", "1.tst")): + # dialogs are missing + archive_path = os.path.join(directory, archive_filename) + if not os.path.exists(archive_path): + # archive missing, download it + print(("Downloading %s to %s" % (url, archive_path))) + filepath, _ = urllib.request.urlretrieve(url, archive_path) + print("Successfully downloaded " + filepath) + + # unpack data + if not os.path.exists(dialogs_path): + print("Unpacking dialogs ...") + with tarfile.open(archive_path) as tar: tar.extractall(path=directory) - print("Archive unpacked.") + print("Archive unpacked.") - return + return -##################################################################################### +########################################################################## # Command line script related code -##################################################################################### +########################################################################## if __name__ == '__main__': - def create_eval_dataset(args, file_list_csv): rng = random.Random(args.seed) # training dataset f = open(os.path.join("meta", file_list_csv), 'r') - dialog_paths = map(lambda path: os.path.join(args.data_root, "dialogs", path), convert_csv_with_dialog_paths(f)) + dialog_paths = [os.path.join(args.data_root, "dialogs", path) + for path in convert_csv_with_dialog_paths(f)] + + def creator_function(context_dialog, candidates): + return create_single_dialog_test_example(context_dialog, candidates, rng, args.n, args.max_context_length) - data_set = create_examples(dialog_paths, - len(dialog_paths), - lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng, - args.n, args.max_context_length)) # output the dataset + data_set = create_examples(dialog_paths, + examples_num=len(dialog_paths), + creator_function=creator_function) w = unicodecsv.writer(open(args.output, 'w'), encoding='utf-8') # header header = ["Context", "Ground Truth Utterance"] - header.extend(map(lambda x: "Distractor_{}".format(x), xrange(args.n))) + header.extend(["Distractor_{}".format(x) for x in range(args.n)]) w.writerow(header) stemmer = SnowballStemmer("english") @@ -299,19 +319,20 @@ def create_eval_dataset(args, file_list_csv): for row in data_set: translated_row = [row[0], row[1]] translated_row.extend(row[2]) - + if args.tokenize: - translated_row = map(nltk.word_tokenize, translated_row) + translated_row = list(map(nltk.word_tokenize, translated_row)) if args.stem: - translated_row = map(lambda sub: map(stemmer.stem, sub), translated_row) + translated_row = [list(map(stemmer.stem, sub)) + for sub in translated_row] if args.lemmatize: - translated_row = map(lambda sub: map(lambda tok: lemmatizer.lemmatize(tok, pos='v'), sub), translated_row) - - translated_row = map(lambda x: " ".join(x), translated_row) + translated_row = [[lemmatizer.lemmatize( + tok, pos='v') for tok in sub] for sub in translated_row] - w.writerow(translated_row) - print("Dataset stored in: {}".format(args.output)) + translated_row = [" ".join(x) for x in translated_row] + w.writerow(translated_row) + print(("Dataset stored in: {}".format(args.output))) def train_cmd(args): @@ -319,11 +340,12 @@ def train_cmd(args): # training dataset f = open(os.path.join("meta", "trainfiles.csv"), 'r') - dialog_paths = map(lambda path: os.path.join(args.data_root, "dialogs", path), convert_csv_with_dialog_paths(f)) + dialog_paths = [os.path.join(args.data_root, "dialogs", path) + for path in convert_csv_with_dialog_paths(f)] train_set = create_examples(dialog_paths, args.examples, - lambda context_dialog, candidates : + lambda context_dialog, candidates: create_single_dialog_train_example(context_dialog, candidates, rng, args.p, max_context_length=args.max_context_length)) @@ -338,18 +360,20 @@ def train_cmd(args): translated_row = row if args.tokenize: - translated_row = [nltk.word_tokenize(row[i]) for i in [0,1]] + translated_row = [nltk.word_tokenize(row[i]) for i in [0, 1]] if args.stem: - translated_row = map(lambda sub: map(stemmer.stem, sub), translated_row) + translated_row = [list(map(stemmer.stem, sub)) + for sub in translated_row] if args.lemmatize: - translated_row = map(lambda sub: map(lambda tok: lemmatizer.lemmatize(tok, pos='v'), sub), translated_row) + translated_row = [[lemmatizer.lemmatize( + tok, pos='v') for tok in sub] for sub in translated_row] - translated_row = map(lambda x: " ".join(x), translated_row) + translated_row = [" ".join(x) for x in translated_row] translated_row.append(int(float(row[2]))) w.writerow(translated_row) - print("Train dataset stored in: {}".format(args.output)) + print(("Train dataset stored in: {}".format(args.output))) def valid_cmd(args): create_eval_dataset(args, "valfiles.csv") @@ -357,7 +381,6 @@ def valid_cmd(args): def test_cmd(args): create_eval_dataset(args, "testfiles.csv") - parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter, description="Script that creates train, valid and test set from 1 on 1 dialogs in Ubuntu Corpus. " + "The script downloads 1on1 dialogs from internet and then it randomly samples all the datasets with positive and negative examples.") @@ -386,16 +409,20 @@ def test_cmd(args): subparsers = parser.add_subparsers(help='sub-command help') parser_train = subparsers.add_parser('train', help='trainset generator') - parser_train.add_argument('-p', type=float, default=0.5, help='positive example probability') - parser_train.add_argument('-e', '--examples', type=int, default=1000000, help='number of examples to generate') + parser_train.add_argument( + '-p', type=float, default=0.5, help='positive example probability') + parser_train.add_argument('-e', '--examples', type=int, + default=1000000, help='number of examples to generate') parser_train.set_defaults(func=train_cmd) - parser_test = subparsers.add_parser('test', help='testset generator') - parser_test.add_argument('-n', type=int, default=9, help='number of distractor examples for each context') + parser_test = subparsers.add_parser('test', help='test set generator') + parser_test.add_argument('-n', type=int, default=9, + help='number of distractor examples for each context') parser_test.set_defaults(func=test_cmd) - parser_valid = subparsers.add_parser('valid', help='validset generator') - parser_valid.add_argument('-n', type=int, default=9, help='number of distractor examples for each context') + parser_valid = subparsers.add_parser('valid', help='validation set generator') + parser_valid.add_argument( + '-n', type=int, default=9, help='number of distractor examples for each context') parser_valid.set_defaults(func=valid_cmd) args = parser.parse_args() @@ -405,4 +432,3 @@ def test_cmd(args): # create dataset args.func(args) - diff --git a/src/create_ubuntu_dataset.py.bak b/src/create_ubuntu_dataset.py.bak new file mode 100644 index 0000000..88ebf30 --- /dev/null +++ b/src/create_ubuntu_dataset.py.bak @@ -0,0 +1,408 @@ +import argparse +import os +import unicodecsv +import random +from six.moves import urllib +import tarfile +import csv + +import nltk +from nltk.stem import SnowballStemmer, WordNetLemmatizer + +__author__ = 'rkadlec' + +""" +Script for generation of train, test and valid datasets from Ubuntu Corpus 1 on 1 dialogs. +Copyright IBM Corporation 2016 +LICENSE: Apache License 2.0 URL: ttp://www.apache.org/licenses/LICENSE-2.0 +Contact: Rudolf Kadlec (rudolf_kadlec@cz.ibm.com) +""" + +dialog_end_symbol = "__dialog_end__" +end_of_utterance_symbol = "__eou__" +end_of_turn_symbol = "__eot__" + + + +def translate_dialog_to_lists(dialog_filename): + """ + Translates the dialog to a list of lists of utterances. In the first + list each item holds subsequent utterances from the same user. The second level + list holds the individual utterances. + :param dialog_filename: + :return: + """ + + dialog_file = open(dialog_filename, 'r') + dialog_reader = unicodecsv.reader(dialog_file, delimiter='\t',quoting=csv.QUOTE_NONE) + + # go through the dialog + first_turn = True + dialog = [] + same_user_utterances = [] + #last_user = None + dialog.append(same_user_utterances) + + for dialog_line in dialog_reader: + + if first_turn: + last_user = dialog_line[1] + first_turn = False + + if last_user != dialog_line[1]: + # user has changed + same_user_utterances = [] + dialog.append(same_user_utterances) + + same_user_utterances.append(dialog_line[3]) + + last_user = dialog_line[1] + + dialog.append([dialog_end_symbol]) + + return dialog + + +def get_random_utterances_from_corpus(candidate_dialog_paths,rng,utterances_num=9,min_turn=3,max_turn=20): + """ + Sample multiple random utterances from the whole corpus. + :param candidate_dialog_paths: + :param rng: + :param utterances_num: number of utterances to generate + :param min_turn: minimal index of turn that the utterance is selected from + :return: + """ + utterances = [] + dialogs_num = len(candidate_dialog_paths) + + for i in xrange(0,utterances_num): + # sample random dialog + dialog_path = candidate_dialog_paths[rng.randint(0,dialogs_num-1)] + # load the dialog + dialog = translate_dialog_to_lists(dialog_path) + + # we do not count the last _dialog_end__ urn + dialog_len = len(dialog) - 1 + if(dialog_len rng.random(): + # use the next utterance as positive example + response = singe_user_utterances_to_string(dialog[next_utterance_ix]) + label = 1.0 + else: + response = get_random_utterances_from_corpus(candidate_dialog_paths,rng,1, + min_turn=minimum_context_length+1, + max_turn=max_context_length)[0] + label = 0.0 + return context_str, response, label + + +def create_single_dialog_test_example(context_dialog_path, candidate_dialog_paths, rng, distractors_num, max_context_length): + """ + Creates a single example for testing or validation. Each line contains a context, one positive example and N negative examples. + :param context_dialog_path: + :param candidate_dialog_paths: + :param rng: + :param distractors_num: + :return: triple (context, positive response, [negative responses]) + """ + + dialog = translate_dialog_to_lists(context_dialog_path) + + context_str, next_utterance_ix = create_random_context(dialog, rng, max_context_length=max_context_length) + + # use the next utterance as positive example + positive_response = singe_user_utterances_to_string(dialog[next_utterance_ix]) + + negative_responses = get_random_utterances_from_corpus(candidate_dialog_paths,rng,distractors_num) + return context_str, positive_response, negative_responses + + +def create_examples_train(candidate_dialog_paths, rng, positive_probability=0.5, max_context_length=20): + """ + Creates single training example. + :param candidate_dialog_paths: + :param rng: + :param positive_probability: probability of selecting positive training example + :return: + """ + i = 0 + examples = [] + for context_dialog in candidate_dialog_paths: + if i % 1000 == 0: + print str(i) + dialog_path = candidate_dialog_paths[i] + examples.append(create_single_dialog_train_example(dialog_path, candidate_dialog_paths, rng, positive_probability, + max_context_length=max_context_length)) + i+=1 + #return map(lambda dialog_path : create_single_dialog_train_example(dialog_path, candidate_dialog_paths, rng, positive_probability), candidate_dialog_paths) + +def create_examples(candidate_dialog_paths, examples_num, creator_function): + """ + Creates a list of training examples from a list of dialogs and function that transforms a dialog to an example. + :param candidate_dialog_paths: + :param creator_function: + :return: + """ + i = 0 + examples = [] + unique_dialogs_num = len(candidate_dialog_paths) + + while i < examples_num: + context_dialog = candidate_dialog_paths[i % unique_dialogs_num] + # counter for tracking progress + if i % 1000 == 0: + print str(i) + i+=1 + + examples.append(creator_function(context_dialog, candidate_dialog_paths)) + + return examples + +def convert_csv_with_dialog_paths(csv_file): + """ + Converts CSV file with comma separated paths to filesystem paths. + :param csv_file: + :return: + """ + def convert_line_to_path(line): + file, dir = map(lambda x : x.strip(), line.split(",")) + return os.path.join(dir, file) + + return map(convert_line_to_path, csv_file) + + +def prepare_data_maybe_download(directory): + """ + Download and unpack dialogs if necessary. + """ + filename = 'ubuntu_dialogs.tgz' + url = 'http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz' + dialogs_path = os.path.join(directory, 'dialogs') + + # test it there are some dialogs in the path + if not os.path.exists(os.path.join(directory,"10","1.tst")): + # dialogs are missing + archive_path = os.path.join(directory,filename) + if not os.path.exists(archive_path): + # archive missing, download it + print("Downloading %s to %s" % (url, archive_path)) + filepath, _ = urllib.request.urlretrieve(url, archive_path) + print "Successfully downloaded " + filepath + + # unpack data + if not os.path.exists(dialogs_path): + print("Unpacking dialogs ...") + with tarfile.open(archive_path) as tar: + tar.extractall(path=directory) + print("Archive unpacked.") + + return + +##################################################################################### +# Command line script related code +##################################################################################### + +if __name__ == '__main__': + + + def create_eval_dataset(args, file_list_csv): + rng = random.Random(args.seed) + # training dataset + f = open(os.path.join("meta", file_list_csv), 'r') + dialog_paths = map(lambda path: os.path.join(args.data_root, "dialogs", path), convert_csv_with_dialog_paths(f)) + + data_set = create_examples(dialog_paths, + len(dialog_paths), + lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng, + args.n, args.max_context_length)) + # output the dataset + w = unicodecsv.writer(open(args.output, 'w'), encoding='utf-8') + # header + header = ["Context", "Ground Truth Utterance"] + header.extend(map(lambda x: "Distractor_{}".format(x), xrange(args.n))) + w.writerow(header) + + stemmer = SnowballStemmer("english") + lemmatizer = WordNetLemmatizer() + + for row in data_set: + translated_row = [row[0], row[1]] + translated_row.extend(row[2]) + + if args.tokenize: + translated_row = map(nltk.word_tokenize, translated_row) + if args.stem: + translated_row = map(lambda sub: map(stemmer.stem, sub), translated_row) + if args.lemmatize: + translated_row = map(lambda sub: map(lambda tok: lemmatizer.lemmatize(tok, pos='v'), sub), translated_row) + + translated_row = map(lambda x: " ".join(x), translated_row) + + w.writerow(translated_row) + print("Dataset stored in: {}".format(args.output)) + + + def train_cmd(args): + + rng = random.Random(args.seed) + # training dataset + + f = open(os.path.join("meta", "trainfiles.csv"), 'r') + dialog_paths = map(lambda path: os.path.join(args.data_root, "dialogs", path), convert_csv_with_dialog_paths(f)) + + train_set = create_examples(dialog_paths, + args.examples, + lambda context_dialog, candidates : + create_single_dialog_train_example(context_dialog, candidates, rng, + args.p, max_context_length=args.max_context_length)) + + stemmer = SnowballStemmer("english") + lemmatizer = WordNetLemmatizer() + + # output the dataset + w = unicodecsv.writer(open(args.output, 'w'), encoding='utf-8') + # header + w.writerow(["Context", "Utterance", "Label"]) + for row in train_set: + translated_row = row + + if args.tokenize: + translated_row = [nltk.word_tokenize(row[i]) for i in [0,1]] + + if args.stem: + translated_row = map(lambda sub: map(stemmer.stem, sub), translated_row) + if args.lemmatize: + translated_row = map(lambda sub: map(lambda tok: lemmatizer.lemmatize(tok, pos='v'), sub), translated_row) + + translated_row = map(lambda x: " ".join(x), translated_row) + translated_row.append(int(float(row[2]))) + + w.writerow(translated_row) + print("Train dataset stored in: {}".format(args.output)) + + def valid_cmd(args): + create_eval_dataset(args, "valfiles.csv") + + def test_cmd(args): + create_eval_dataset(args, "testfiles.csv") + + + parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter, + description="Script that creates train, valid and test set from 1 on 1 dialogs in Ubuntu Corpus. " + + "The script downloads 1on1 dialogs from internet and then it randomly samples all the datasets with positive and negative examples.") + + parser.add_argument('--data_root', default='.', + help='directory where 1on1 dialogs will be downloaded and extracted, the data will be downloaded from cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz') + + parser.add_argument('--seed', type=int, default=1234, + help='seed for random number generator') + + parser.add_argument('--max_context_length', type=int, default=20, + help='maximum number of dialog turns in the context') + + parser.add_argument('-o', '--output', default=None, + help='output csv') + + parser.add_argument('-t', '--tokenize', action='store_true', + help='tokenize the output') + + parser.add_argument('-l', '--lemmatize', action='store_true', + help='lemmatize the output by nltk.stem.WorldNetLemmatizer (applied only when -t flag is present)') + + parser.add_argument('-s', '--stem', action='store_true', + help='stem the output by nltk.stem.SnowballStemmer (applied only when -t flag is present)') + + subparsers = parser.add_subparsers(help='sub-command help') + + parser_train = subparsers.add_parser('train', help='trainset generator') + parser_train.add_argument('-p', type=float, default=0.5, help='positive example probability') + parser_train.add_argument('-e', '--examples', type=int, default=1000000, help='number of examples to generate') + parser_train.set_defaults(func=train_cmd) + + parser_test = subparsers.add_parser('test', help='testset generator') + parser_test.add_argument('-n', type=int, default=9, help='number of distractor examples for each context') + parser_test.set_defaults(func=test_cmd) + + parser_valid = subparsers.add_parser('valid', help='validset generator') + parser_valid.add_argument('-n', type=int, default=9, help='number of distractor examples for each context') + parser_valid.set_defaults(func=valid_cmd) + + args = parser.parse_args() + + # download and unpack data if necessary + prepare_data_maybe_download(args.data_root) + + # create dataset + args.func(args) +