Group scientific article citations (references) using sentence-transformers embeddings + agglomerative clustering so that each cluster contains citations referring to the same article/book/report/etc.
Pipeline:
- For each citation, compute embedding vector using Huggingface sentence-transformers. Default model is
Snowflake/snowflake-arctic-embed-l
- Cluster the embeddings using scikit-learn's agglomerative clustering implementation with cosine distance
Number of clusters is controlled using distance_threshold
(between 0.0 and 1.0). Small values of distance_threshold
(e.g. 0.05) result in high clustering precision (references in same cluster are likely to refer to same article). Higher values of distance_threshold (e.g. 0.2) result in high clustering recall (references to same article are very likely assigned to same cluster). Default value is distance_threshold = 0.05
.
Input citations:
De Matos C.A., Rossi C.A.V., Word-of-mouth communications in marketing: A meta-analytic review of the antecedents and moderators, Journal of the Academy of Marketing Science, 36, 4, pp. 578-596, (2008)
Matos C.A.D., Rossi C.A.V., Word-of-mouth communications in marketing: a meta-analytic review of the antecedents and moderators, J. Acad. Market. Sci., 36, 4, pp. 578-596, (2008)
Eisingerich A.B., Bell S.J., Maintaining customer relationships in high credence services, Journal of Services Marketing, 21, 4, pp. 253-262, (2007)
Eisingerich A., Bell S., Maintaining Customer Relationships in High Credence Services, Journal of Service Marketing, 21, 4, pp. 253-262, (2007)
Eisingerich A., Bell S., Maintaining customer relationships in high credence services, Journal of Service Marketing, 21, 4, pp. 253-262, (2007)
Eisingerich A., Bell S., Maintaining customer relationships in high credence services, J. Serv. Market., 21, 4, pp. 253-262, (2007)
Hair J.F., Black W.C., Babin B.J., Anderson R.E., Multivariate Data Analysis, (2019)
Resulting clusters:
Cluster 1:
De Matos C.A., Rossi C.A.V., Word-of-mouth communications in marketing: A meta-analytic review of the antecedents and moderators, Journal of the Academy of Marketing Science, 36, 4, pp. 578-596, (2008)
Matos C.A.D., Rossi C.A.V., Word-of-mouth communications in marketing: a meta-analytic review of the antecedents and moderators, J. Acad. Market. Sci., 36, 4, pp. 578-596, (2008)
Cluster 2:
Eisingerich A.B., Bell S.J., Maintaining customer relationships in high credence services, Journal of Services Marketing, 21, 4, pp. 253-262, (2007)
Eisingerich A., Bell S., Maintaining Customer Relationships in High Credence Services, Journal of Service Marketing, 21, 4, pp. 253-262, (2007)
Eisingerich A., Bell S., Maintaining customer relationships in high credence services, Journal of Service Marketing, 21, 4, pp. 253-262, (2007)
Eisingerich A., Bell S., Maintaining customer relationships in high credence services, J. Serv. Market., 21, 4, pp. 253-262, (2007)
Cluster 3:
Hair J.F., Black W.C., Babin B.J., Anderson R.E., Multivariate Data Analysis, (2019)
Setup on Aalto Triton.
Clone this repo and change directory with
cd $WRKDIR
git clone https://github.com/AaltoRSE/citation-grouping.git
cd citation-grouping
Create and activate conda environment with GPU
module load mamba
mamba env create -f env-gpu.yml -p env-gpu/
source activate env-gpu/
NOTE: The GPU will be used to speed up embedding computation. The agglomerative clustering will always be run on CPUs.
Clone this repo and change directory with
cd $WRKDIR
git clone https://github.com/AaltoRSE/citation-grouping.git
cd citation-grouping
Create and activate conda environment with CPUs only
module load mamba
mamba env create -f env-cpu.yml -p env-cpu/
source activate env-cpu/
Run using GPUs
cd $WRKDIR/citation-grouping
sbatch submit_gpu.sh /path/to/inputfile
where inputfile
is a CSV file with column "References"
Results are written to (check the script output)
/path/to/data/results/
including the final result file which is the original CSV file with added "References augmented" column and intermediate files.
The optional probability threshold distance_threshold
(default is 0.05) can be tweaked inside the submit_gpu.sh
file.
Run using CPUs only
module load mamba
module load model-huggingface
cd $WRKDIR/citation-grouping
sbatch submit_cpu.sh /path/to/inputfile
where inputfile
is a CSV file with column "References".
Results are written to (check the script output)
/path/to/data/results/
including the final result file which is the original CSV file with added "References augmented" column and intermediate files.
The optional probability threshold distance_threshold
(default is 0.05) can be tweaked inside the submit_cpu.sh
file.
Agglomerative clustering uses O(N^2) memory but the implementation does a "preclustering" phase which reduces the memory requirements significantly if most of the references comprise a singleton cluster (which they do).
One could also use other, less memory intensive clustering methods but agglomerative works well and is easy to interpret.