Pre-training PLBART using CodeSearchNet

In our work, we pre-trained PLBART on a large collection of source code and natural language description from Github and StackOverflow. On the other hand, other pre-trained models, such as CodeBERT, GraphCodeBERT are pre-trained on the CodeSearchNet dataset. Therefore, we investigate PLBART's performance if pre-trained on the CodeSearchNet dataset.

To pre-train PLBART on CodeSearchNet, do the following.

bash setup.sh
bash binarize.sh
bash pretrain.sh

Pre-training Data Statistics

Number of docstring used is 1,880,853 and number of functions used are detailed below.

	Num Examples
Java	1,524,722
Python	1,069,208
Javascript	1,841,822
PHP	921,770
Go	696,935
Ruby	159,342
Total	6,213,799

[Note]

We pre-trained PLBART on CodeSearchNet using 8 GeForce RTX 2080 (11gb) GPUs (took ~11.5 days).
We have published the checkpoint here.

Experiments

We fine-tuned PLBART-CSNet on all the downstream tasks PLBART evaluated on.
The scripts are provided in the root_directory/scripts/plbart_csnet directory.
We compare PLBART-CSNet to PLBART and the experiment results are as follows.

Code to Text Generation

Dataset: CodeSearchNet

	Ruby	Javascript	Go	Python	Java	PHP	Overall
CodeBERT	12.16	14.90	18.07	19.06	17.65	25.16	17.83
PLBART	14.11	15.56	18.91	19.30	18.45	23.58	18.32
PLBART-CSNet	14.48	16.00	17.61	20.07	19.81	24.48	18.74

Text to Code Generation

Dataset: Concode

	EM	BLEU	CodeBLEU
GPT-2	17.35	25.37	29.69
CodeGPT-2	18.25	28.69	32.71
CodeGPT-adapted	20.10	32.79	35.98
PLBART	18.75	36.69	38.52
PLBART-CSNet	18.60	36.79	38.81

Code to Code Generation

Task: Translation

Methods	Java to C#			C# to Java
Methods	BLEU	EM	CodeBLEU	BLEU	EM	CodeBLEU
CodeBERT	79.9	59.0	85.1	72.1	58.8	79.4
GraphCodeBERT	80.6	59.4	-	72.6	58.8	-
PLBART	83.0	64.6	87.9	78.4	65.0	85.3
PLBART-CSNet	81.6	61.6	86.8	78.0	63.5	84.9

Task: Defect Detection, Clone Detection

	Vulnerability Detection	Clone Detection
CodeBERT	62.08	96.5
GraphCodeBERT	-	97.1
PLBART	63.18	97.2
PLBART-CSNet	59.44	97.4

Task: Code Refinement

Methods	Small		Medium
Methods	EM	BLEU	EM	BLEU
CodeBERT	16.40	77.42	5.16	91.07
GraphCodeBERT	17.30	80.58	9.10	72.64
PLBART	19.21	77.02	8.98	88.50
PLBART-CSNet	19.13	76.95	11.60	88.08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pre-training PLBART using CodeSearchNet

Pre-training Data Statistics

Experiments

Code to Text Generation

Text to Code Generation

Code to Code Generation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pre-training PLBART using CodeSearchNet

Pre-training Data Statistics

Experiments

Code to Text Generation

Text to Code Generation

Code to Code Generation