Code for ISSTA'23 paper "CONCORD: Clone-aware Contrastive Learning for Source Code"
conda create -n concord Python=3.8.12;
conda activate concord;
# install torch
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge;
# install apex
git clone https://github.com/NVIDIA/apex.git;
cd apex/;
git checkout feae3851a5449e092202a1c692d01e0124f977e4;
pip install -v --disable-pip-version-check --no-cache-dir ./;
cd ../
# install pip packages
cd CONCORD_ISSTA23;
pip install -r requirements.txt;
export PYTHONPATH=$(pwd);
You could find the pre-trained weights of the main CONCORD model and task-specific data here:
During CONCORD's pre-training, we need to align the syntax labels with code tokens, which requires the data pre-processing. To avoid the distribution shift, the task-specific fine-tuning data needs to do the same pre-processing with the following two steps:
- Parse the source code with Tree-sitter and tokenize the sequence following the grammar of corresponding programming languages.
- Sub-tokenize with the pre-trained BPE model.
You could refer to the following steps and customize your own data processing accordingly, or check out our data samples (finetune_data.zip) for the expected format of the pre-processed code.
Build Tree-sitter
cd data_processing;
bash build_tree_sitter.sh;
Example-1: Process CodeXGLUE Defects Detection Data
python process_code.py \
--task_name cxg_vuldetect \
--input_file <path_to_CodeXGLUE>/Code-Code/Defect-detection/dataset/test.jsonl \
--output_file cxg_vd_test.csv \
--spm_model ../vocab/multilingual_5k_repo_50k_vocab.model
Example-2: Process CodeXGLUE Clone Detection Data
python process_code.py \
--task_name cxg_clone \
--input_file <path_to_CodeXGLUE>/Code-Code/Clone-detection-POJ-104/dataset/test.jsonl \
--output_file cxg_clone_test.jsonl \
--spm_model ../vocab/multilingual_5k_repo_50k_vocab.model
Note that process_code.py
script is just to illustrate the basic data processing steps using CodeXGLUE raw data; you may need to customize the script for your specific goal.