Implement two-stage hash accurate and efficient code search based on GraphCodeBERT
Xin Huang
Email : [email protected]
WeChat: is_HuangXin
- Connetct to Node 13
ssh our_gpu_server_ip_address -p 50013
- Enevoriment
# conda virtual env
conda activate py37-1.7
- Dataset Path
# origin processed dataset
# C++
/mnt/silver/wucai/code_search/src/target_new/c++
- Dataset processed script
# 构造训练数据的思路是:triplet loss, one positive and one negative
- cpp-negative-construct.py 为所有数据集都随机选一个不重复的负样本
- cpp-train-valid-test.py 按一定比例划分train, valid, test数据集
- graphcodebert-train-valid-test-negative-construct.py 为codebase数据集构造一个负样本
- Code Path
# Without Hashing and Two-Stage
/home/wanyao/huangxin/graphcodebert-cpp
# With Hashing and Two-Stage Faiss
/home/wanyao/huangxin/graphcodebert-cpp-hash
- How to Train
python /home/wanyao/huangxin/graphcodebert-cpp-hash/run.py
--output_dir=./saved_models/cpp-without-dfg-triple-loss-hash
--config_name=microsoft/graphcodebert-base
--model_name_or_path=microsoft/graphcodebert-base
--tokenizer_name=microsoft/graphcodebert-base
--lang=java
--do_train
--train_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/train_with_negative.jsonl
--eval_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/valid_with_negative.jsonl
--test_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/test_with_negative.jsonl
--codebase_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/codebase_with_negative.jsonl
--num_train_epochs
10
--code_length
320
--data_flow_length
32
--nl_length
128
--train_batch_size
16
--eval_batch_size
64
--learning_rate
2e-5
--seed
123456
- How to Inference
python /home/wanyao/huangxin/graphcodebert-cpp-hash/run.py
--output_dir=./saved_models/cpp-without-dfg-triple-loss-hash
--config_name=microsoft/graphcodebert-base
--model_name_or_path=microsoft/graphcodebert-base
--tokenizer_name=microsoft/graphcodebert-base
--lang=java
--do_demo
--train_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/train_with_negative.jsonl
--eval_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/valid_with_negative.jsonl
--test_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/test_with_negative.jsonl
--codebase_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/codebase_with_negative.jsonl
--num_train_epochs
10
--code_length
320
--data_flow_length
32
--nl_length
128
--train_batch_size
16
--eval_batch_size
64
--learning_rate
2e-5
--seed
123456
- Deep Code Search
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages
- GraphCodeBERT: Pre-training Code Representations with Data Flow
- Accelerating Code Search with Deep Hashing and Code Classification
- R2PS: Retriever and Ranker Framework with Probabilistic Hard Negative Sampling for Code Search (in folder './paper')
- Efficient Passage Retrieval with Hashing for Open-domain Question Answering
- The Stack: 3 TB of permissively licensed source code