In this work, we propose CODESCRIBE to model the hierarchical syntax structure of code by introducing a novel triplet position for code summarization. Specifically, CODESCRIBE leverages the graph neural network and Transformer to preserve the structural and sequential information of code, respectively. In addition, we propose a pointer-generator network that pays attention to both the structure and sequential tokens of code for a better summary generation. Experiments on two real-world datasets in Java and Python demonstrate the effectiveness of our proposed approach when compared with several state-of-the-art baselines.
- 4 NVIDIA 2080 Ti GPUs
- Ubuntu 16.04
- CUDA 10.2 (with CuDNN of the corresponding version)
- Anaconda
- Python 3.9 (base environment)
- Python 2.7 (virtual environment named as python27)
- PyTorch 1.9 for Python 3.9
- PyTorch Geometric 1.7 for Python 3.9
- Specifically, install our package with
pip install my-lib-0.0.6.tar.gz
for both Python 3.9 and Python 2.7. The package can be downloaded from Baidu Netdisk or Google Drive.
The whole datasets of Python and Java can be downloaded from Baidu Netdisk or Google Drive.
Note that: We provide 100 samples for train/valid/test datasets in the directory data/python/raw_data/
and data/java/raw_data/
. To run on the whole dataset, replace these samples with the data files downloaded.
- Step into the directory
src_code/python/
:cd src_code/python
- Proprecess the train/valid/test data:
python s1_preprocessor.py conda activate python27 python s2_preprocessor_py27.py conda deactivate python s3_preprocessor.py
- Run the model for training and testing: python s4_model.py
After running, the performance will be printed in the console, and the predicted results of test data and will be saved in the path data/python/result/result.json
, with ground truth and code involved for comparison.
We have provided the results of test dataset, you can get the evaluation results directly by running
python s5_eval_res.py"
Note that:
- all the parameters are set in
src_code/python/config.py
andsrc_code/python/config_py27.py
. - If the model has been trained, you can set the parameter "train_mode" in line 83 in
config.py
to "False". Then you can predict the test data directly by using the model that has been saved indata/python/model/
.
- Step into the directory
src_code/java/
:cd src_code/java
- Proprecess the train/valid/test data:
python s1_preprocessor.py
- Run the model for training and testing:
python s2_model.py
After running, the performance will be printed in the console, and the predicted results of test data and will be saved in the path data/java/result/result.json
, with ground truth and code involved for comparison.
We have provided the results of test dataset, you can get the evaluation results directly by running
python s3_eval_res.py"
Note that:
- all the parameters are set in
src_code/java/config.py
. - If the model has been trained, you can set the parameter "train_mode" in line 113 in
config.py
to "False". Then you can predict the test data directly by using the model that has been saved indata/java/model/
.
- The main parameter settings for CODESCRIBE are shown as:
- The time used per epoch and the memory usage are provided as:
We provide part of our experiment result as below.
- Comparison with state-of-the-arts.
- Qualitative examples.