This repository contains the datasets and source code used in our paper The Effect of Metadata on Scientific Literature Tagging: A Cross-Field Cross-Model Study.
- Datasets
- Additional Datasets with MeSH Labels
- Running Parabel
- Running Transformer
- Running OAG-BERT
- References
The MAPLE benchmark constructed by us contains 20 datasets across 19 fields for scientific literature tagging. You can download the datasets from HERE. Once you unzip the downloaded file, you can see a folder MAPLE/
. Please put the folder under the main directory ./
of this code repository.
There are 23 folders under MAPLE/
, corresponding to 23 datasets. 20 of them with MAG labels are mentioned in the main text of our paper; the other 3 datasets with MeSH labels will be introduced in the next section. Statistics of the 20 "main" datasets are as follows:
Folder | Field | #Papers | #Labels | #Venues | #Authors | #References |
---|---|---|---|---|---|---|
Art |
Art | 58,373 | 1,990 | 98 | 54,802 | 115,343 |
Philosophy |
Philosophy | 59,296 | 3,758 | 98 | 36,619 | 198,010 |
Geography |
Geography | 73,883 | 3,285 | 98 | 157,423 | 884,632 |
Business |
Business | 84,858 | 2,392 | 97 | 100,525 | 685,034 |
Sociology |
Sociology | 90,208 | 1,935 | 98 | 85,793 | 842,561 |
History |
History | 113,147 | 2,689 | 99 | 84,529 | 284,739 |
Political_Science |
Political Science | 115,291 | 4,990 | 98 | 93,393 | 480,136 |
Environmental_Science |
Environmental Science | 123,945 | 694 | 100 | 265,728 | 1,217,268 |
Economics |
Economics | 178,670 | 5,205 | 97 | 135,247 | 1,042,253 |
CSRankings |
Computer Science (Conference) | 263,393 | 13,613 | 75 | 331,582 | 1,084,440 |
Engineering |
Engineering | 270,006 | 10,683 | 100 | 430,046 | 1,867,276 |
Psychology |
Psychology | 372,954 | 7,641 | 100 | 460,123 | 2,313,701 |
Computer_Science |
Computer Science (Journal) | 410,603 | 15,540 | 96 | 634,506 | 2,751,996 |
Geology |
Geology | 431,834 | 7,883 | 100 | 471,216 | 1,753,762 |
Mathematics |
Mathematics | 490,551 | 14,271 | 98 | 404,066 | 2,150,584 |
Materials_Science |
Materials Science | 1,337,731 | 6,802 | 99 | 1,904,549 | 5,457,773 |
Physics |
Physics | 1,369,983 | 16,664 | 91 | 1,392,070 | 3,641,761 |
Biology |
Biology | 1,588,778 | 64,267 | 100 | 2,730,547 | 7,086,131 |
Chemistry |
Chemistry | 1,849,956 | 35,538 | 100 | 2,721,253 | 8,637,438 |
Medicine |
Medicine | 2,646,105 | 36,619 | 100 | 4,345,385 | 7,405,779 |
In each folder (e.g., Art/
), you can see four files: authors.txt
, labels.txt
, papers.json
, and venues.txt
.
authors.txt
has 3 columns: author id, normalized author name, and original author name:
12035 stephen rickerby Stephen Rickerby
127649 clementine deliss Clementine Deliss
1395514 tomas garciasalgado Tomás García-Salgado
...
venues.txt
has 3 columns: venue id, normalized venue name, and original venue name:
26308392 the journal of aesthetics and art criticism The Journal of Aesthetics and Art Criticism
93676754 modern language review Modern Language Review
998751717 classical world Classical World
...
labels.txt
has 3 columns: label id, label name, and depth of the label (1-5, with 1 being the coarsest and 5 being the finest):
2780583484 papyrus 2
2778949450 scientific writing 2
2780412351 purgatory 2
...
papers.json
has text and metadata information of each paper. Each line is a json record representing one paper. For example,
{
"paper": "2333162778",
"venue": "103229351",
"year": "1987",
"title": "the life and unusual ideas of adelbert ames jr",
"label": [
"554144382", "153349607"
],
"author": [
"2162173344"
],
"reference": [
"132232344", "378964350", "562124327", ...
],
"abstract": "this paper is a summary of the life and major achievements of adelbert ames jr an american ...",
"title_raw": "The Life and Unusual Ideas of Adelbert Ames, Jr.",
"abstract_raw": "This paper is a summary of the life and major achievements of Adelbert Ames, Jr., an American ..."
}
The three additional datasets: Biology_MeSH
, Chemistry_MeSH
, and Medicine_MeSH
are constructed from Biology
, Chemistry
, and Medicine
, respectively, by obtaining the MeSH labels of each paper (and removing those papers without MeSH labels).
Folder | Field | #Papers | #Labels | #Venues | #Authors | #References |
---|---|---|---|---|---|---|
Biology_MeSH |
Biology-MeSH | 1,379,393 | 25,039 | 100 | 2,486,814 | 6,876,739 |
Chemistry_MeSH |
Chemistry-MeSH | 762,129 | 21,585 | 87 | 1,498,358 | 5,928,908 |
Medicine_MeSH |
Medicine-MeSH | 1,536,660 | 25,188 | 100 | 2,791,165 | 7,190,021 |
In each folder (e.g., Biology_MeSH/
), you can see five files: authors.txt
, labels.txt
, labels_mesh.txt
, papers.json
, and venues.txt
.
authors.txt
and venues.txt
have the same format as in the 20 "main" datasets.
labels.txt
has 2 columns: MeSH label id and original MeSH label name:
D000818 Animals
D001824 Body Constitution
D005075 Biological Evolution
...
labels_mesh.txt
has >=2 columns: MeSH label id, normalized MeSH label name, and all entry terms (i.e., synonyms) of the MeSH label:
D000818 animals animalia
D001824 body constitution body constitutions constitution body constitutions body
D005075 biological evolution evolution biological
...
papers.json
has the same format as in the 20 "main" datasets. The only difference is that the "label" field now contains all MeSH labels of the paper. For example,
{
"paper": "1816482797",
"venue": "166515463",
"year": "2015",
"title": "proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic ...",
"label": [
"D005810", "D005808", "D020125", "D019295", "D030541", ...
],
"author": [
"2303839782", "2953263946", "2160643821", ...
],
"reference": [
"80748578", "1563940013", "1570281893", ...
],
"abstract": "the role of rare missense variants in disease causation remains difficult to interpret ...",
"title_raw": "Proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic ...",
"abstract_raw": "The role of rare missense variants in disease causation remains difficult to interpret ..."
}
The code of Parabel is written in C++. It is adapted from the original implementation by Prabhu et al. You need to run the following script.
cd ./Parabel/
./run.sh
P@k and NDCG@k scores (k=1,3,5) will be shown in the last several lines of the output as well as in ./Parabel/scores.txt
. The prediction results can be found in ./Parabel/Sandbox/Results/{dataset}/score_mat.txt
.
GPUs are required. We use one NVIDIA GeForce GTX 1080 Ti GPU in our experiments.
The code of Transformer is written in Python 3.6. It is adapted from the original implementation by Xun et al. You need to first install the dependencies like this:
cd ./Transformer/
pip3 install -r requirements.txt
Then, you need to download the GloVe embeddings (originally from here). Once you unzip the downloaded file, please put it (i.e., the data/
folder) under ./Transformer/
. Then, you can run the code.
./run.sh
P@k and NDCG@k scores (k=1,3,5) will be shown in the last several lines of the output as well as in ./Transformer/scores.txt
. The prediction results can be found in ./Transformer/predictions.txt
.
GPUs are required. We use one NVIDIA GeForce GTX 1080 Ti GPU in our experiments.
The code of OAG-BERT is written in Python 3.7. It is adapted from the original implementation by Liu et al. You need to first install PyTorch >= 1.7.1, and then the CogDL package. These two steps can be done by running the following:
cd ./OAGBERT/
./setup.sh
Then, you can run the code.
./run.sh
P@k and NDCG@k scores (k=1,3,5) will be shown in the last several lines of the output as well as in ./OAGBERT/Parabel/scores.txt
. The prediction results can be found in ./OAGBERT/Parabel/Sandbox/Results/{dataset}/score_mat.txt
.
If you find the MAPLE benchmark or this repository useful, please cite our paper:
@inproceedings{zhang2023effect,
title={The effect of metadata on scientific literature tagging: A cross-field cross-model study},
author={Zhang, Yu and Jin, Bowen and Zhu, Qi and Meng, Yu and Han, Jiawei},
booktitle={WWW'23},
year={2023}
}
The MAPLE benchmark is constructed from the Microsoft Academic Graph:
@inproceedings{sinha2015overview,
title={An overview of microsoft academic service (mas) and applications},
author={Sinha, Arnab and Shen, Zhihong and Song, Yang and Ma, Hao and Eide, Darrin and Hsu, Bo-June and Wang, Kuansan},
booktitle={WWW'15},
pages={243--246},
year={2015}
}
The three classifiers in this repository are from the following three papers:
@inproceedings{prabhu2018parabel,
title={Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising},
author={Prabhu, Yashoteja and Kag, Anil and Harsola, Shrutendra and Agrawal, Rahul and Varma, Manik},
booktitle={WWW'18},
pages={993--1002},
year={2018}
}
@inproceedings{xun2020correlation,
title={Correlation networks for extreme multi-label text classification},
author={Xun, Guangxu and Jha, Kishlay and Sun, Jianhui and Zhang, Aidong},
booktitle={KDD'20},
pages={1074--1082},
year={2020}
}
@inproceedings{liu2022oag,
title={Oag-bert: Towards a unified backbone language model for academic knowledge services},
author={Liu, Xiao and Yin, Da and Zheng, Jingnan and Zhang, Xingjian and Zhang, Peng and Yang, Hongxia and Dong, Yuxiao and Tang, Jie},
booktitle={KDD'22},
pages={3418--3428},
year={2022}
}