Skip to content

Commit

Permalink
Adds the multi-headed pointer model.
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 325046592
  • Loading branch information
adityakanade authored and copybara-github committed Aug 5, 2020
1 parent c5eb4f4 commit 0bfff50
Show file tree
Hide file tree
Showing 4 changed files with 395 additions and 44 deletions.
40 changes: 0 additions & 40 deletions cubert/BUILD.bazel

This file was deleted.

50 changes: 49 additions & 1 deletion cubert/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,49 @@
This is a repository for code and data accompanying the ICML 2020 paper [Learning and Evaluating Contextual Embedding of Source Code] (https://proceedings.icml.cc/static/paper_files/icml/2020/5401-Paper.pdf).
# CuBERT

## Introduction

This is a repository for code, models and data accompanying the ICML 2020 paper
[Learning and Evaluating Contextual Embedding of Source Code]
(https://proceedings.icml.cc/static/paper_files/icml/2020/5401-Paper.pdf).

**The model checkpoints and datasets will be linked from this README within
the next few weeks.**

If you use the code, models or data released through this repository, please
cite the following paper:
```
@inproceedings{cubert,
author = {Aditya Kanade and
Petros Maniatis and
Gogul Balakrishnan and
Kensen Shi},
title = {Learning and evaluating contextual embedding of source code},
booktitle = {Proceedings of the 37th International Conference on Machine Learning,
{ICML} 2020, 12-18 July 2020},
series = {Proceedings of Machine Learning Research},
publisher = {{PMLR}},
year = {2020},
```

## The CuBERT Tokenizer

The CuBERT tokenizer for Python is implemented in `cubert_tokenizer.py`, whereas
`unified_tokenizer.py` contains a language-agnostic tokenization mechanism,
which can be extended along the lines of the Python tokenizer for other languages.

The code within the `code_to_subtokenized_sentences.py` script can be used for
converting Python code into CuBERT sentences. This script can be evaluated on
the `source_code.py.test` file along with the CuBERT subword vocabulary
(**to be released**). It should produce the output as illustrated in the
`subtokenized_source_code.json` file. To obtain, token-ID sequences for use with
TensorFlow models, the `decode_list` logic from `code_to_subtokenized_sentences.py`
can be skipped.

## The Multi-Headed Pointer Model

The `finetune_varmisuse_pointer_lib.py` file provides an implementation of the
multi-headed pointer model described in [Neural Program Repair by Jointly Learning to Localize and Repair]
(https://openreview.net/pdf?id=ByloJ20qtm) on top of the pre-trained CuBERT
model. The `model_fn_builder` function should be integrated into an appropriate
fine-tuning script along the lines of the [fine-tuning script of the BERT model]
(https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/run_classifier.py#L847).
Loading

0 comments on commit 0bfff50

Please sign in to comment.