forked from google-research/google-research
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds the multi-headed pointer model.
PiperOrigin-RevId: 325046592
- Loading branch information
1 parent
c5eb4f4
commit 0bfff50
Showing
4 changed files
with
395 additions
and
44 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,49 @@ | ||
This is a repository for code and data accompanying the ICML 2020 paper [Learning and Evaluating Contextual Embedding of Source Code] (https://proceedings.icml.cc/static/paper_files/icml/2020/5401-Paper.pdf). | ||
# CuBERT | ||
|
||
## Introduction | ||
|
||
This is a repository for code, models and data accompanying the ICML 2020 paper | ||
[Learning and Evaluating Contextual Embedding of Source Code] | ||
(https://proceedings.icml.cc/static/paper_files/icml/2020/5401-Paper.pdf). | ||
|
||
**The model checkpoints and datasets will be linked from this README within | ||
the next few weeks.** | ||
|
||
If you use the code, models or data released through this repository, please | ||
cite the following paper: | ||
``` | ||
@inproceedings{cubert, | ||
author = {Aditya Kanade and | ||
Petros Maniatis and | ||
Gogul Balakrishnan and | ||
Kensen Shi}, | ||
title = {Learning and evaluating contextual embedding of source code}, | ||
booktitle = {Proceedings of the 37th International Conference on Machine Learning, | ||
{ICML} 2020, 12-18 July 2020}, | ||
series = {Proceedings of Machine Learning Research}, | ||
publisher = {{PMLR}}, | ||
year = {2020}, | ||
``` | ||
|
||
## The CuBERT Tokenizer | ||
|
||
The CuBERT tokenizer for Python is implemented in `cubert_tokenizer.py`, whereas | ||
`unified_tokenizer.py` contains a language-agnostic tokenization mechanism, | ||
which can be extended along the lines of the Python tokenizer for other languages. | ||
|
||
The code within the `code_to_subtokenized_sentences.py` script can be used for | ||
converting Python code into CuBERT sentences. This script can be evaluated on | ||
the `source_code.py.test` file along with the CuBERT subword vocabulary | ||
(**to be released**). It should produce the output as illustrated in the | ||
`subtokenized_source_code.json` file. To obtain, token-ID sequences for use with | ||
TensorFlow models, the `decode_list` logic from `code_to_subtokenized_sentences.py` | ||
can be skipped. | ||
|
||
## The Multi-Headed Pointer Model | ||
|
||
The `finetune_varmisuse_pointer_lib.py` file provides an implementation of the | ||
multi-headed pointer model described in [Neural Program Repair by Jointly Learning to Localize and Repair] | ||
(https://openreview.net/pdf?id=ByloJ20qtm) on top of the pre-trained CuBERT | ||
model. The `model_fn_builder` function should be integrated into an appropriate | ||
fine-tuning script along the lines of the [fine-tuning script of the BERT model] | ||
(https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/run_classifier.py#L847). |
Oops, something went wrong.