Error Analysis of using BART for Multi-Document Summarization: A Study for English and German Language
Authors: Timo Johner, Abhik Jana, Chris Biemann
Language Technology Group, Dept. of Informatics, Universitat Hamburg, Germany
Paper: Link
Accepted at the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021), held from May 31st to June 2nd, 2021. Published in the NEALT Proceedings Series by Linköping University Electronic Press and in the ACL Anthology.
This repository describes the implementation of our approach proposed in the paper.
name | language | topics | type | paper | source |
---|---|---|---|---|---|
CNN/DailyMail | en | 311,971 | single-document | Link | retrieved from here |
Multi-News | en | 56,216 | multi-document | Link | adaption for BART by Hokamp et al (2020) |
auto-hMDS | de | 2,100 | multi-document | Link | not publicly available, can be reproduced here |
The checkpoint for the fine-tuned BART model on the German auto-hMDS dataset can be downloaded here. The checkpoint file can be used to reproduce our results with the following setup.
We used the BART model based on the fairseq library. More information can be found here.
For fine-tuning on the three datasets (see above) we used the following parameters:
CUDA_VISIBLE_DEVICES=1 fairseq-train hMDS_2-bin \
--restore-file bart.large/model.pt \
--max-tokens 1024 \
--task translation \
--source-lang source --target-lang target \
--truncate-source \
--layernorm-embedding \
--share-all-embeddings \
--share-decoder-input-output-embed \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--arch bart_large \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \
--clip-norm 0.1 \
--lr-scheduler polynomial_decay --lr 3e-05 \
--update-freq 1 \
--skip-invalid-size-inputs-valid-test \
--find-unused-parameters;
If you find this paper interesting, please cite:
@inproceedings{johner-etal-2021-error,
title = "Error Analysis of using {BART} for Multi-Document Summarization: A Study for {E}nglish and {G}erman Language",
author = "Johner, Timo and Jana, Abhik and Biemann, Chris",
booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)",
month = may # " 31--2 " # jun,
year = "2021",
address = "Reykjavik, Iceland (Online)",
publisher = {Link{\"o}ping University Electronic Press, Sweden},
url = "https://aclanthology.org/2021.nodalida-main.43",
pages = "391--397",
}