In our work, we pre-trained PLBART on a large collection of source code and natural language description from Github and StackOverflow. On the other hand, other pre-trained models, such as CodeBERT, GraphCodeBERT are pre-trained on the CodeSearchNet dataset. Therefore, we investigate PLBART's performance if pre-trained on the CodeSearchNet dataset.
To pre-train PLBART on CodeSearchNet, do the following.
bash setup.sh
bash binarize.sh
bash pretrain.sh
Number of docstring used is 1,880,853 and number of functions used are detailed below.
Num Examples | |
---|---|
Java | 1,524,722 |
Python | 1,069,208 |
Javascript | 1,841,822 |
PHP | 921,770 |
Go | 696,935 |
Ruby | 159,342 |
Total | 6,213,799 |
[Note]
- We pre-trained PLBART on CodeSearchNet using 8
GeForce RTX 2080
(11gb) GPUs (took ~11.5 days). - We have published the checkpoint here.
- We fine-tuned PLBART-CSNet on all the downstream tasks PLBART evaluated on.
- The scripts are provided in the
root_directory/scripts/plbart_csnet
directory. - We compare PLBART-CSNet to PLBART and the experiment results are as follows.
Dataset: CodeSearchNet
Ruby | Javascript | Go | Python | Java | PHP | Overall | |
---|---|---|---|---|---|---|---|
CodeBERT | 12.16 | 14.90 | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 |
PLBART | 14.11 | 15.56 | 18.91 | 19.30 | 18.45 | 23.58 | 18.32 |
PLBART-CSNet | 14.48 | 16.00 | 17.61 | 20.07 | 19.81 | 24.48 | 18.74 |
Dataset: Concode
EM | BLEU | CodeBLEU | |
---|---|---|---|
GPT-2 | 17.35 | 25.37 | 29.69 |
CodeGPT-2 | 18.25 | 28.69 | 32.71 |
CodeGPT-adapted | 20.10 | 32.79 | 35.98 |
PLBART | 18.75 | 36.69 | 38.52 |
PLBART-CSNet | 18.60 | 36.79 | 38.81 |
Task: Translation
Methods | Java to C# | C# to Java | ||||
---|---|---|---|---|---|---|
BLEU | EM | CodeBLEU | BLEU | EM | CodeBLEU | |
CodeBERT | 79.9 | 59.0 | 85.1 | 72.1 | 58.8 | 79.4 |
GraphCodeBERT | 80.6 | 59.4 | - | 72.6 | 58.8 | - |
PLBART | 83.0 | 64.6 | 87.9 | 78.4 | 65.0 | 85.3 |
PLBART-CSNet | 81.6 | 61.6 | 86.8 | 78.0 | 63.5 | 84.9 |
Task: Defect Detection, Clone Detection
Vulnerability Detection |
Clone Detection |
|
---|---|---|
CodeBERT | 62.08 | 96.5 |
GraphCodeBERT | - | 97.1 |
PLBART | 63.18 | 97.2 |
PLBART-CSNet | 59.44 | 97.4 |
Task: Code Refinement
Methods | Small | Medium | ||
---|---|---|---|---|
EM | BLEU | EM | BLEU | |
CodeBERT | 16.40 | 77.42 | 5.16 | 91.07 |
GraphCodeBERT | 17.30 | 80.58 | 9.10 | 72.64 |
PLBART | 19.21 | 77.02 | 8.98 | 88.50 |
PLBART-CSNet | 19.13 | 76.95 | 11.60 | 88.08 |