This is an official PyTorch implementation of the CodeTransformer
model proposed in:
D. ZΓΌgner, T. Kirschstein, M. Catasta, J. Leskovec, and S. GΓΌnnemann, βLanguage-agnostic representation learning of source code from structure and contextβ
which appeared at ICLR'2021.
An online demo is available at https://code-transformer.org.
[Paper (PDF) | Poster | Slides | Online Demo]
The CodeTransformer
is a Transformer based architecture that jointly learns from source code (Context) and parsed abstract syntax trees (AST; Structure).
It does so by linking source code tokens to AST nodes and using pairwise distances (e.g., Shortest Paths, PPR) between the nodes to represent the AST.
This combined representation is processed in the model by adding the contributions of each distance type to the raw self-attention score between two input tokens (See the paper for more details).
Strengths of the CodeTransformer
:
- Outperforms other approaches on the source code summarization task.
- Effectively leverages similarities among different programming languages when trained in a multi-lingual setting.
- Produces useful embeddings that may be employed for other down-stream tasks such as finding similar code snippets across languages.
Please cite our paper if you use the model, experimental results, or our code in your own work:
@inproceedings{zuegner_code_transformer_2021,
title = {Language-Agnostic Representation Learning of Source Code from Structure and Context},
author = {Z{\"u}gner, Daniel and Kirschstein, Tobias and Catasta, Michele and Leskovec, Jure and G{\"u}nnemann, Stephan},
booktitle={International Conference on Learning Representations (ICLR)},
year = {2021} }
To run experiments with the CodeTransformer you can either:
- Create a new dataset from raw code snippets
or - Download the already preprocessed datasets we conducted our experiments on
To use our pipeline to generate a new dataset for code summarization, a collection of methods in the target language is needed. In our experiments, we use the following unprocessed datasets:
Name | Description | Obtain from |
---|---|---|
code2seq | We use java-small for Code Summarization as well as java-medium and java-large for Pretraining |
code2seq repository |
CodeSearchNet (CSN) | For our (multilingual) experiments on Python, JavaScript, Ruby and Go, we employ the raw data from the CSN challenge | CodeSearchNet repository |
java-pretrain |
For our Pretraining experiments, we compiled and deduplicated a large code method dataset based on java-small , java-medium and java-large . |
|
We make our preprocessed datasets available for a quick setup and easy reproducibility of our results: https://dataserv.ub.tum.de/s/m1647000/download?path=%2Fdata&files=python.tar.gz
Name | Language(s) | Based on | Download |
---|---|---|---|
Python | Python | CSN | python.tar.gz |
JavaScript | JavaScript | CSN | javascript.tar.gz |
Ruby | Ruby | CSN | ruby.tar.gz |
Go | Go | CSN | go.tar.gz |
Multi-language | Python, JavaScript, Ruby, Go | CSN | multi-language.tar.gz |
java-small | Java | code2seq |
|
java-pretrain | Java | code2seq | Full dataset available on request due its enormous size
|
The notebooks/
folder contains two example notebooks that showcase the CodeTransformer
:
- interactive_prediction.ipynb: Lets you load any of the models and specify an arbitrary code snippet to get a real-time prediction for its method name. Also showcases stage 1 and stage 2 preprocessing.
- deduplicate_java_pretrain.ipynb: Explains how we deduplicated the large
java-pretrain
dataset that we created
All environment variables (and thus external dependencies on the host machine) used in the project have to be specified in an .env
configuration file.
These have to be set to suit your local setup before anything can be run.
The .env.example file gives an example configuration. The actual configuration has to be put into ${HOME}/.config/code_transformer/.env
.
Alternatively, you can also directly specify the paths as environment variables, e.g., by sourcing the .env file.
Variable (+ CODE_TRANSFORMER_ prefix) |
Description | Preprocessing | Training | Inference/Evaluation |
---|---|---|---|---|
Mandatory | ||||
DATA_PATH |
Location for storing datasets | X | X | X |
BINARY_PATH |
Location for storing executables | X | - | - |
MODELS_PATH |
Location for storing model configs, snapshots and predictions | - | X | X |
LOGS_PATH |
Location for logging train metrics | - | X | - |
Optional | ||||
CSN_RAW_DATA_PATH |
Location of the downloaded raw CSN dataset files | X | - | - |
CODE2SEQ_RAW_DATA_PATH |
Location of the downloaded raw code2seq dataset files (Java classes) | X | - | - |
CODE2SEQ_EXTRACTED_METHODS_DATA_PATH |
Location of the code snippets extracted from the raw code2seq dataset with the JavaMethodExtractor | X | - | - |
DATA_PATH_STAGE_1 |
Location of stage 1 preprocessing result (parsed ASTs) | X | - | - |
DATA_PATH_STAGE_2 |
Location of stage 2 preprocessing result (computed distances) | X | X | X |
JAVA_EXECUTABLE |
Command for executing java on the machine | X | - | - |
JAVA_METHOD_EXTRACTOR_EXECUTABLE |
Path to the built .jar from the java-method-extractor submodule used for extracting methods from raw .java files |
X | - | - |
JAVA_PARSER_EXECUTABLE |
Path to the built .jar from the java-parser submodule used for parsing Java ASTs |
X | - | - |
SEMANTIC_EXECUTABLE |
Path to the built semantic executable used for parsing Python, JavaScript, Ruby and Go ASTs | X | - | - |
βββ notebooks # Example notebooks that showcase the CodeTransformer
βββ code_transformer # Main python package containing most functionality
β βββ configuration # Dict-style Configurations of ML models
β βββ experiments # Experiment setups for running preprocessing or training
β β βββ mixins # Lightweight experiment modules that can be easily
β β β # combined for different datasets and models
β β βββ code_transformer # Experiment configs for training the CodeTransformer
β β βββ great # Experiment configs for training GREAT
β β βββ xl_net # Experiment configs for training XLNet
β β βββ preprocessing # Implementation scripts for stage 1 and stage 2 preprocessing
β β β βββ preprocess-1.py # Parallelized stage 1 preprocessing (Generating of ASTs from methods + word counts)
β β β βββ preprocess-2.py # Parallelized stage 2 preprocessing (Calculating of distances in AST + vocabulary)
β β βββ paper # Train configs for reproducing results of all models mentioned in the paper
β β βββ experiment.py # Main experiment setup containing training loop, evaluation,
β β # metrics, logging and loading of pretrained models
β βββ modeling # PyTorch implementations of the Code Transformer,
β β β # GREAT and XLNet with different heads
β β βββ code_transformer # CodeTransformer implementation
β β βββ decoder # Transformer Decoder implementation with Pointer Network
β β βββ great_transformer # Adapted implementation of GREAT for code summarization
β β βββ xl_net # Adapted implementation of XLNet for code summarization
β β βββ modelmanager # Easy loading of stored model parameters
β β βββ constants.py # Several constants affecting preprocessing and vocabularies
β βββ preprocessing # Implementation of preprocessing pipeline + data loading
β β β # modeling, e.g., special tokens or number of method name tokens
β β βββ pipeline # Stage 1 and Stage 2 preprocessing of CodeSearchNet code snippets
β β β βββ code2seq.py # Adaptation of code2seq AST path walks for CSN datasets and languages
β β β βββ filter.py # Low-level textual code snippet processing used during stage 1 preprocessing
β β β βββ stage1.py # Applies filters to code snippets and calls semantic to generate ASTs
β β β βββ stage2.py # Contains only definitions of training samples, actual
β β β # graph distance calculation is contained in preprocessing/graph/distances.py
β β βββ datamanager # Easy loading and storing of preprocessed code snippets
β β β βββ c2s # Loading of raw code2seq dataset files
β β β βββ csn # Loading and storing of CSN dataset files
β β β βββ preprocessed.py # Dataset-agnostic loading of stage 1 and stage 2 preprocessed samples
β β βββ dataset # Final task-specific preprocessing before data is fed into model, i.e.,
β β β β # python modules to be used with torch.utils.data.DataLoader
β β β βββ base.py # task-agnostic preprocessing such as mapping sequence tokens to graph nodes
β β β βββ ablation.py # only-AST ablation
β β β βββ code_summarization.py # Main preprocessing for the Code Summarization task.
β β β β # Masking the function name in input, drop punctuation tokens
β β β βββ lm.py # Language Modeling pretraining task. Generate permutations
β β βββ graph # Algorithms on ASTs
β β β βββ alg.py # Graph distance metrics such as next siblings
β β β βββ ast.py # Generalized AST as graph that handles semantic and Java-parser ASTs.
β β β β # Allows assigning tokens that have no corresponding AST node
β β β βββ binning.py # Graph distance binning (equal and exponential)
β β β βββ distances.py # Higher level modularized distance and binning wrappers for use in preprocessing
β β β βββ transform.py # Core of stage2 preprocessing that calculates general graph distances
β β βββ nlp # Algorithms on text
β β βββ javaparser.py # Python wrapper of java-parser to generate ASTs from Java methods
β β βββ semantic.py # Python wrapper of semantic parser to generate ASTs
β β β # from languages supported by semantic
β β βββ text.py # Simple text handling such as positions of words in documents
β β βββ tokenization.py # Mostly wrapper around Pygments Tokenizer to tokenize (and sub-tokenize) code snippets
β β βββ vocab.py # Mapping of the most frequent tokens to IDs understandable for ML models
β βββ utils
β β βββ metrics.py # Evaluation metrics. Mostly, different F1-scores
β βββ env.py # Defines several environment variables such as paths to executables
βββ scripts # (Python) scripts intended to be run from the command line "python -m scripts/{SCRIPT_NAME}"
β βββ code2seq
β β βββ combine-vocabs-code2seq.sh # Creates code2seq vocabularies for multi-language setting
β β βββ preprocess-code2seq.py # Preprocessing for code2seq (Generating of tree paths).
β β β # Works with any datasets created by preprocess-1.py
β β βββ preprocess-code2seq.sh # Calls preprocess-code2seq.py and preprocess-code2seq-helper.py.
β β β # Generates everything else code2seq needs, such as vocabularies
β β βββ preprocess-code2seq-helper.py # Copied from code2seq. Performs vocabulary generation and normalization of snippets
β βββ evaluate.py # Loads a trained model and evaluates it on a specified dataset
β βββ evaluate-multilanguage.py # Loads a trained multi-language model and evaluates it on a multi-language database
β βββ deduplicate-java-pretrain.py # De-duplicates a directory of .java files (used for java-pretrain)
β βββ extract-java-methods.py # Extracts Java methods from raw .java files to feed into stage 1 preprocessing
β βββ run-experiment.py # Parses a .yaml config file and starts training of a CodeTransformer/GREAT/XLNet model
β βββ run-preprocessing.py # Parses a .yaml config file and starts stage 1 or stage 2 preprocessing
βββ sub_modules # Separate modules
β βββ code2seq # code2seq adaptation: Mainly modifies code2seq for multi-language setting
β βββ java-method-extractor # code2seq adaptation: Extracts Java methods from .java files as JSON
β βββ java-parser # Simple parser wrapper for generating Java ASTs
βββ tests # Unit Tests for parts of the project
βββ .env.example # Example environment variables configuration file
The first stage of our preprocessing pipeline makes use of semantic to generate ASTs from code snippet that are written in Python, JavaScript, Ruby or Go.
semantic
is a command line tool written in Haskell that is capable of parsing source code in a variety of languages.
The generated ASTs mostly share a common set of node types which is important for multi-lingual experiments.
For Java, we employ a separate AST parser, as the language currently is not supported by semantic
.
To obtain the ASTs, we rely on the --json-graph
option that has been dropped temporarily from semantic
.
As such, the stage 1 preprocessing requires a semantic
executable built from a revision before Mar 27, 2020.
E.g., the revision 34ea0d1dd6.
To enable stage 1 preprocessing, you can either:
- Build
semantic
on your machine using a revision with the--json-graph
option. We refer to thesemantic
documentation for build instructions.
or - Use the statically linked
semantic
executable that we built for our experiments: semantic.tar.gz
As Java is not currently supported by semantic, we employ a separate AST parser based on the javaparser project. Our adaption can be found in the /sub_modules/java-parser directory that also contains a prebuilt java-parser-1.0-SNAPSHOT.jar.
If you want to reproduce our experiments on the code2seq Java dataset or assemble your own Java dataset to train the CodeTransformer
you can also make use of the JavaMethodExtractor-1.0.0-SNAPSHOT.jar that gathers Java methods from a folder structure of class files.
Download the raw CSN dataset files as described in the raw data section.
- Compute ASTs (stage 1)
python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-1-csn.yaml {python|javascript|ruby|go} {train|valid|test}
- Compute graph distances (stage 2)
python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-2.yaml {python|javascript|ruby|go} {train|valid|test}
The .yaml
files contain configurations for preprocessing (e.g., which distance metrics to use and how the vocabularies are generated).
It is important to run the preprocessing for the train
partition first as statistics for generating the vocabulary that are needed for the other partitions are computed there.
Download the raw code2seq dataset files (Java classes) as described in the raw data section.
- Extract methods from the raw Java class files via
python -m scripts.extract-java-methods {java-small|java-medium|java-large}
- Compute ASTs (stage 1)
python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-1-code2seq.yaml {java-small|java-medium|java-large} {train|valid|test}
- Compute graph distances (stage 2)
python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-2.yaml {java-small|java-medium|java-large} {train|valid|test}
The .yaml
files contain configurations for preprocessing (e.g., which distance metrics to use and how the vocabularies are generated).
Ensure to run the preprocessing for the train
partition first as statistics for generating the vocabulary that are needed for the other partitions are computed there.
Builds upon the stage 1 CSN datasets computed as shown above. Datasets are then combined by simply running the stage 2 preprocessing with a comma-separated string containing the desired languages. For the experiments in our paper we combined Python, JavaScript, Ruby and Go:
python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-2.yaml python,javascript,ruby,go {train|valid|test}
Similar to how preprocessing works, training is configured via .yaml files that describe the data representation that is used, the model hyperparameters and how training should go about.
Train metrics are logged to a tensorboard in LOGS_PATH
.
python -m scripts.run-experiment code_transformer/experiments/code_transformer/code_summarization.yaml
This will start training of a CodeTransformer
model. The .yaml
file contains model and training configurations (e.g., which dataset to use, model hyperparameters or when to store checkpoints).
We adapted the Graph Relational Embedding Attention Transformer (GREAT) for comparison.
python -m scripts.run-experiment code_transformer/experiments/great/code_summarization.yaml
This will start training of a GREAT
model. The .yaml
file contains model and training configurations (e.g., which dataset to use, model hyperparameters or when to store checkpoints).
For comparison, we also employed an XLNet architecture that only learns from the source code token sequence.
python -m scripts.run-experiment code_transformer/experiments/xl_net/code_summarization.yaml
This will start training of a XLNet
model. The .yaml
file contains model and training configurations (e.g., which dataset to use, model hyperparameters or when to store checkpoints).
The performance of Transformer architectures can often be further improved by first pretraining the model on a language modeling task.
In our case, we make use of XL-Nets permutation based masked language modeling.
Language Modelling Pretraining can be run via:
python -m scripts.run-experiment code_transformer/experiments/code_transformer/language_modeling.yaml
The pretrained model can then be finetuned on the Code Summarization task using the regular training script as described above.
The transfer_learning
section in the .yaml configuration file is used to define the model and snapshot to be finetuned.
python -m scripts.evaluate {code_transformer|great|xl_net} {run_id} {snapshot} {valid|test}
where run_id
is the unique name of the run as printed during training. This also corresponds to the folder name that contains the respective stored snapshots of the model.
snapshot
it the training iteration in which the snapshot was stored, e.g., 50000
.
python -m scripts.evaluate-multilanguage {code_transformer|great|xl_net} {run_id} {snapshot} {valid|test} [--filter-language {language}]
The --filter-language
option can be used to run the evaluation only on one of the single languages that the respective multilanguage dataset is comprised of (used for CT-[11-14]
).
Name in Paper | Run ID | Snapshot | Language | Hyperparameters |
---|---|---|---|---|
Single Language | ||||
GREAT (Python) | GT-1 | 350000 | Python | great_python.yaml |
GREAT (Javascript) | GT-2 | 60000 | JavaScript | great_javascript.yaml |
GREAT (Ruby) | GT-3 | 30000 | Ruby | great_ruby.yaml |
GREAT (Go) | GT-4 | 220000 | Go | great_go.yaml |
Ours w/o structure (Python) | XL-1 | 400000 | Python | xl_net_python.yaml |
Ours w/o structure (Javascript) | XL-2 | 260000 | JavaScript | xl_net_javascript.yaml |
Ours w/o structure (Ruby) | XL-3 | 60000 | Ruby | xl_net_ruby.yaml |
Ours w/o structure (Go) | XL-4 | 200000 | Go | xl_net_go.yaml |
Ours w/o pointer net (Python) | CT-1 | 280000 | Python | ct_no_pointer_python.yaml |
Ours w/o pointer net (Javascript) | CT-2 | 120000 | JavaScript | ct_no_pointer_javascript.yaml |
Ours w/o pointer net (Ruby) | CT-3 | 520000 | Ruby | ct_no_pointer_ruby.yaml |
Ours w/o pointer net (Go) | CT-4 | 320000 | Go | ct_no_pointer_go.yaml |
Ours (Python) | CT-5 | 500000 | Python | ct_python.yaml |
Ours (Javascript) | CT-6 | 90000 | JavaScript | ct_javascript.yaml |
Ours (Ruby) | CT-7 | 40000 | Ruby | ct_ruby.yaml |
Ours (Go) | CT-8 | 120000 | Go | ct_go.yaml |
Multilanguage Models | ||||
Great (Multilang.) | GT-5 | 320000 | Python, JavaScript, Ruby and Go | great_multilang.yaml |
Ours w/o structure (Mult.) | XL-5 | 420000 | Python, JavaScript, Ruby and Go | xl_net_multilang.yaml |
Ours w/o pointer (Mult.) | CT-9 | 570000 | Python, JavaScript, Ruby and Go | ct_no_pointer_multilang.yaml |
Ours (Multilanguage) | CT-10 | 650000 | Python, JavaScript, Ruby and Go | ct_multilang.yaml |
Mult. Pretraining | ||||
Ours (Mult. + Finetune Python) | CT-11 | 120000 | Python, JavaScript, Ruby and Go | ct_multilang.yaml β ct_multilang_python.yaml |
Ours (Mult. + Finetune Javascript) | CT-12 | 20000 | Python, JavaScript, Ruby and Go | ct_multilang.yaml β ct_multilang_javascript.yaml |
Ours (Mult. + Finetune Ruby) | CT-13 | 10000 | Python, JavaScript, Ruby and Go | ct_multilang.yaml β ct_multilang_ruby.yaml |
Ours (Mult. + Finetune Go) | CT-14 | 60000 | Python, JavaScript, Ruby and Go | ct_multilang.yaml β ct_multilang_go.yaml |
Ours (Mult. + LM Pretrain) | CT-15 | 280000 | Python, JavaScript, Ruby and Go | ct_multilang_lm.yaml β ct_multilang_lm_pretrain.yaml |
Name in Paper | Run ID | Snapshot | Language | Hyperparameters |
---|---|---|---|---|
Without Pointer Net | ||||
Ours w/o structure | XL-6 | 400000 | Java | xl_net_no_pointer_java_small.yaml |
Ours w/o context | CT-16 | 150000 | Java | ct_no_pointer_java_small_only_ast.yaml |
Ours | CT-17 | 410000 | Java | ct_no_pointer_java_small.yaml |
With Pointer Net | ||||
GREAT | GT-6 | 170000 | Java | great_java_small.yaml |
Ours w/o structure | XL-7 | 170000 | Java | xl_net_java_small.yaml |
Ours w/o context | CT-18 | 90000 | Java | ct_java_small_only_ast.yaml |
Ours | CT-19 | 250000 | Java | ct_java_small.yaml |
Ours + Pretrain | CT-20 | 30000 | Java | ct_java_pretrain_lm.yaml β ct_java_small_pretrain.yaml |
Name in Paper | Run ID | Snapshot | Language | Hyperparameters |
---|---|---|---|---|
Sibling Shortest Paths | CT-21 | 310000 | Java | ct_java_small_ablation_only_sibling_sp.yaml |
Ancestor Shortest Paths | CT-22 | 250000 | Java | ct_java_small_ablation_only_ancestor_sp.yaml |
Shortest Paths | CT-23 | 190000 | Java | ct_java_small_ablation_only_shortest_paths.yaml |
Personalized Page Rank | CT-24 | 210000 | Java | ct_java_small_ablation_only_ppr.yaml |
We also make all our trained models that are mentioned in the paper available for easy reproducibility of our results:
Name | Description | Models | Download |
---|---|---|---|
CSN Single Language | All models trained on one of the Python, JavaScript, Ruby or Go datasets | GT-[1-4] , XL-[1-4] , CT-[1-8] |
csn-single-language-models.tar.gz |
CSN Multi-Language | All models trained on the multi-language dataset + Pretraining | GT-5 , XL-5 , CT-[9-15] , CT-LM-2 |
csn-multi-language-models.tar.gz |
code2seq | All models trained on the code2seq java-small dataset + Pretraining |
XL-[6+7] , GT-6 , CT-[16-20] , CT-LM-1 |
code2seq-models.tar.gz |
Ablation | The models trained for ablation purposes on java-small |
CT-[21-24] |
ablation-models.tar.gz |
Once downloaded, you can test any of the above models in the interactive_prediction.ipynb notebook.