OSC Transformer Based Extractor

OS-Climate Data Extraction Tool

This project provides a CLI tool and Python scripts to train Transformer models (via Hugging Face) for two primary tasks: 1. Relevance Detection: Determines if a question-context pair is relevant. 2. KPI Detection: Fine-tunes models to extract key performance indicators (KPIs) from datasets like annual reports and perform inference.

Quick Start

To install the tool, use pip:

$ pip install osc-transformer-based-extractor

After installation, you can access the CLI tool with:

$ osc-transformer-based-extractor

This command will show the available commands and help via Typer, our CLI library.

Commands and Workflow

1. Relevance Detection

Fine-tuning the Model:

Assume your project structure looks like this:

project/
│
├── kpi_mapping.csv
├── training_data.csv
├── data/
│   └── (JSON files for inference)
├── model/
│   └── (Model-related files)
├── saved__model/
│   └── (Output from training)
├── output/
│   └── (Results from inference)

Use the following command to fine-tune the model:

$ osc-transformer-based-extractor relevance-detector fine-tune \
  --data_path "project/training_data.csv" \
  --model_name "bert-base-uncased" \
  --num_labels 2 \
  --max_length 128 \
  --epochs 3 \
  --batch_size 16 \
  --output_dir "project/saved__model/" \
  --save_steps 500

Running Inference:

$ osc-transformer-based-extractor relevance-detector perform-inference \
  --folder_path "project/data/" \
  --kpi_mapping_path "project/kpi_mapping.csv" \
  --output_path "project/output/" \
  --model_path "project/model/" \
  --tokenizer_path "project/model/" \
  --threshold 0.5

2. KPI Detection

The KPI detection functionality includes fine-tuning and inference.

Fine-tuning the KPI Model:

Assume your project structure looks like this:

project/
│
├── kpi_mapping.csv
├── training_data.csv
│
├── model/
│   └── (model-related files, e.g., tokenizer, config, checkpoints)
│
├── saved__model/
│   └── (Folder to store output from fine-tuning)
│
├── output/
│   └── (output files, e.g., inference_results.xlsx)

$ osc-transformer-based-extractor kpi-detection fine-tune \
    --data_path "project/training_data.csv" \
    --model_name "bert-base-uncased" \
    --max_length 128 \
    --epochs 3 \
    --batch_size 16 \
    --learning_rate 5e-5 \
    --output_dir "project/saved__model/" \
    --save_steps 500

Performing Inference:

$ osc-transformer-based-extractor kpi-detection inference \
    --data_file_path "project/data/input_dataset.csv" \
    --output_path "project/output/inference_results.xlsx" \
    --model_path "project/model/"

Training Data Requirements

Relevance Detection Training File:

The training file should have the following columns: - Question - Context - Label

Example:

Training Data Example

Question	Context	Label
What is the company name?	The Company is exposed to a risk...	0

KPI Detection Training File:

For KPI detection, the dataset should have these additional columns:

KPI Detection Training Example

Question	Context	Label	Company	Source File	KPI ID	Year	Answer	Data Type
What is the company name?	...	0	NOVATEK	04_NOVATEK_AR_2016_ENG_11.pdf	0	2016	PAO NOVATEK	TEXT

KPI Mapping File:

KPI Mapping File Example

kpi_id	question	sectors	add_year	kpi_category
1	In which year was the annual report...	OG, CM, CU	FALSE	TEXT

Developer Notes

Local Development

Clone the repository:

$ git clone https://github.com/os-climate/osc-transformer-based-extractor/

We use pdm for package management and tox for testing.

Install pdm:
```
$ pip install pdm
```
Sync dependencies:
```
$ pdm sync
```
Add new packages (e.g., numpy):
```
$ pdm add numpy
```

Run tox for linting and testing:

$ pip install tox
$ tox -e lint
$ tox -e test

Contributing

We welcome contributions! Please fork the repository and submit a pull request. Ensure you sign off each commit with the Developer Certificate of Origin (DCO). Read more: http://developercertificate.org/.

Governance Transition

On June 26, 2024, the Linux Foundation announced the merger of FINOS with OS-Climate. Projects are now transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance).

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
.github		.github
demo		demo
docs		docs
scripts		scripts
src/osc_transformer_based_extractor		src/osc_transformer_based_extractor
tests/osc_transformer_based_extractor		tests/osc_transformer_based_extractor
.coveragerc		.coveragerc
.devops-exclusions		.devops-exclusions
.flake8		.flake8
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.readthedocs.yml		.readthedocs.yml
AUTHORS.rst		AUTHORS.rst
CHANGELOG.rst		CHANGELOG.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE.txt		LICENSE.txt
README.rst		README.rst
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSC Transformer Based Extractor

OS-Climate Data Extraction Tool

Quick Start

Commands and Workflow

1. Relevance Detection

2. KPI Detection

Training Data Requirements

Developer Notes

Local Development

Contributing

Governance Transition

Shields

About

Releases 4

Packages

Contributors 6

Languages

License

os-climate/osc-transformer-based-extractor

Folders and files

Latest commit

History

Repository files navigation

OSC Transformer Based Extractor

OS-Climate Data Extraction Tool

Quick Start

Commands and Workflow

1. Relevance Detection

2. KPI Detection

Training Data Requirements

Developer Notes

Local Development

Contributing

Governance Transition

Shields

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 6

Languages

Packages