ACE-2005 Translation and Annotation Alignment Pipeline

This pipeline translates the ACE 2005 corpus into Portuguese and aligns the translated annotations with the corresponding text. Additionally, it can be easily adapted for translating the ACE-2005 corpus into other languages.

Overview

This repository contains a Python translation and annotation alignment pipeline that is designed to translate the ACE 2005 dataset into Portuguese using machine translation and then align the translated annotations with the corresponding text. The pipeline was developed for automatic translation of ACE-2005 into Portuguese but can also be adapted for other languages with little effort.

It is composed of two main components:

Text translation: We used DeepL Translator and Google Translator to translate ACE-2005 texts and annotations into European and Brazilian Portuguese.
Annotations alignments: We developed an annotation alignment pipeline that aligns the translated annotations within the translated text.

Prerequisites

Prepare ACE 2005 dataset

Download: (https://catalog.ldc.upenn.edu/LDC2006T06). Note that ACE 2005 dataset is not freely accesible.)
ACE-2005 Pre-processing

We adopted a commonly used ACE-2005 pre-processing approach that can be found in this repository.

Install the packages. Create a python Env (Optional):

python3 -m venv myenv
myenv\Scripts\activate
source myenv/bin/activate

Intall python requirements:

pip install -r ./src/requirements.txt

Annotation Alignment Modules

Our pipeline is composed of a total of five annotation alignment components:

- Lemmatization
- Multiple word translation
- Synonyms
- BERT-based word aligner
- Fuzzy Match (Gestalt Patter Matching and Levenstein distance)

The pipeline operates sequentially, meaning that annotations aligned by earlier methods are not addressed by subsequent pipeline elements. According to our experiments, the list above corresponds to the best order sequence.

Usage

Translate ACE-2005 to Portuguese

By default we use Google Translate for the translation process. An API key is need in order to use DeepL Translator.
```
Usage: python3 translation.py <input_file> <output_dir>
Example: python src/translate.py data/sample_en.json data/sample_pt.json
```
Run the Annotation Alignment Pipeline

To align the translated annotations, the alignment pipeline can be executed with the following command:
```
Usage: python3 pipeline.py <input_file> <output_dir>
Example: python src/translate.py data/sample_pt.json data/sample_pt_aligned.json
```
The pipeline can be configured in the config.yaml file. Users can select the aligners they intend to use and must indicate the path for the alignment resources for each alignment component, such as multiple translations of annotations, previously calculated lemmas, synonyms, etc. All of these resources are already pre-calculated for the Portuguese language in the resources folder. Additionally, the input and output files can be configured in the config.yaml file as well.

Evaluation

To measure the effectiveness of the alignment pipeline, manual alignments were conducted on the entire ACE-2005-PT test set, which includes 1,310 annotations (triggers and arguments). These alignments were performed by a linguist expert to ensure high-quality annotations, following the same annotation guidelines of the original ACE-2005 corpus.

The evaluation results are presented in Table 1:

Table 1: Evaluation Results by pipeline component

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
img		img
resources		resources
src		src
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACE-2005 Translation and Annotation Alignment Pipeline

Overview

Prerequisites

Annotation Alignment Modules

Usage

Evaluation

Contributing

License

About

Releases

Packages

Contributors 2

Languages

License

LIAAD/ACE-2005-Translation-and-Alignment-Pipeline

Folders and files

Latest commit

History

Repository files navigation

ACE-2005 Translation and Annotation Alignment Pipeline

Overview

Prerequisites

Annotation Alignment Modules

Usage

Evaluation

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages