Skip to content

implement hash fast code-to-code serach by graphcodebert-based-code2code

Notifications You must be signed in to change notification settings

isHuangXin/R2CC

Repository files navigation

GraphCodeBERT-based two-stage hash fast code search

Implement two-stage hash accurate and efficient code search based on GraphCodeBERT

图片描述 图片描述

If you have any questions, plz contact me

Xin Huang
Email : [email protected]
WeChat: is_HuangXin

Script

  • Connetct to Node 13
ssh our_gpu_server_ip_address -p 50013
  • Enevoriment
# conda virtual env
conda activate py37-1.7
  • Dataset Path
# origin processed dataset
# C++
/mnt/silver/wucai/code_search/src/target_new/c++
  • Dataset processed script
# 构造训练数据的思路是:triplet loss, one positive and one negative
- cpp-negative-construct.py 为所有数据集都随机选一个不重复的负样本
- cpp-train-valid-test.py 按一定比例划分train, valid, test数据集
- graphcodebert-train-valid-test-negative-construct.py 为codebase数据集构造一个负样本
  • Code Path
# Without Hashing and Two-Stage
/home/wanyao/huangxin/graphcodebert-cpp

# With Hashing and Two-Stage Faiss
/home/wanyao/huangxin/graphcodebert-cpp-hash
  • How to Train
python /home/wanyao/huangxin/graphcodebert-cpp-hash/run.py
--output_dir=./saved_models/cpp-without-dfg-triple-loss-hash
--config_name=microsoft/graphcodebert-base
--model_name_or_path=microsoft/graphcodebert-base
--tokenizer_name=microsoft/graphcodebert-base
--lang=java
--do_train
--train_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/train_with_negative.jsonl
--eval_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/valid_with_negative.jsonl
--test_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/test_with_negative.jsonl
--codebase_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/codebase_with_negative.jsonl
--num_train_epochs
10
--code_length
320
--data_flow_length
32
--nl_length
128
--train_batch_size
16
--eval_batch_size
64
--learning_rate
2e-5
--seed
123456
  • How to Inference
python /home/wanyao/huangxin/graphcodebert-cpp-hash/run.py
--output_dir=./saved_models/cpp-without-dfg-triple-loss-hash
--config_name=microsoft/graphcodebert-base
--model_name_or_path=microsoft/graphcodebert-base
--tokenizer_name=microsoft/graphcodebert-base
--lang=java
--do_demo
--train_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/train_with_negative.jsonl
--eval_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/valid_with_negative.jsonl
--test_data_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/test_with_negative.jsonl
--codebase_file=/home/wanyao/huangxin/graphcodebert-cpp-hash/dataset/cpp-dataset-split/codebase_with_negative.jsonl
--num_train_epochs
10
--code_length
320
--data_flow_length
32
--nl_length
128
--train_batch_size
16
--eval_batch_size
64
--learning_rate
2e-5
--seed
123456

Dataset statistics

Dataset statistics

Result

Experiment Result


C++ Library

Reference


CPP Code Search Demo

  • Code Search Demo in CPP cpp_search_demo

  • Why 0.04s → 0.4s time_caluate

  • Why 0.04s → 0.4s: Tokenize also need time !!! tokenizer

  • Chrome 0.37 seconds chrome

About

implement hash fast code-to-code serach by graphcodebert-based-code2code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published