This repository contains the evaluation scripts and detailed results of our paper 'Memorization and Generalization in Neural Code Intelligence Models' accepted at the IST Journal, Elsevier 2022.
├── code/
├── noise_generation # adding input/output noise to data
├── model_prediction # logging predictions by models
├── data/
├── logs # predictions on valid samples
├── summary # summary results and plots
├── others/ # other files
Workflow for evaluating memorization and generalization of models. |
-
Method Name Prediction (MethodName):
- Models: code2vec & code2seq
- Dataset: Java-Small, Java-Med, and Java-Large(Top10)
-
Variable Misuse (VarMisuse):
- Models: Transformer, GGNN, and Great
- Dataset: Py150
-
Code-to-Text Generation (Code-to-Text):
- Model: CodeBERT
- Dataset: CodeSearchNet(cleaned)
-
Natural Language Code Search (CodeSearch):
- Model: CodeBERT
- Dataset: CodeSearchNet(preprocessed)
Check the following repositories from @AftabHussain for the great variable misuse experiments:
- https://github.com/AftabHussain/noise-gen_great-varmisuse
- https://github.com/AftabHussain/save-trainstats_great-varmisuse
Memorization and Generalization in Neural Code Intelligence Models
@article{rabin2023memorization,
title = {Memorization and generalization in neural code intelligence models},
author = {Md Rafiqul Islam Rabin and Aftab Hussain and Mohammad Amin Alipour and Vincent J. Hellendoorn},
journal = {Information and Software Technology},
volume = {153},
pages = {107066},
year = {2023},
issn = {0950-5849},
doi = {https://doi.org/10.1016/j.infsof.2022.107066},
url = {https://www.sciencedirect.com/science/article/pii/S0950584922001756}
}