This project provides a CLI tool and Python scripts to train Transformer models (via Hugging Face) for two primary tasks: 1. Relevance Detection: Determines if a question-context pair is relevant. 2. KPI Detection: Fine-tunes models to extract key performance indicators (KPIs) from datasets like annual reports and perform inference.
To install the tool, use pip:
$ pip install osc-transformer-based-extractor
After installation, you can access the CLI tool with:
$ osc-transformer-based-extractor
This command will show the available commands and help via Typer, our CLI library.
Fine-tuning the Model:
Assume your project structure looks like this:
project/
│
├── kpi_mapping.csv
├── training_data.csv
├── data/
│ └── (JSON files for inference)
├── model/
│ └── (Model-related files)
├── saved__model/
│ └── (Output from training)
├── output/
│ └── (Results from inference)
Use the following command to fine-tune the model:
$ osc-transformer-based-extractor relevance-detector fine-tune \
--data_path "project/training_data.csv" \
--model_name "bert-base-uncased" \
--num_labels 2 \
--max_length 128 \
--epochs 3 \
--batch_size 16 \
--output_dir "project/saved__model/" \
--save_steps 500
Running Inference:
$ osc-transformer-based-extractor relevance-detector perform-inference \
--folder_path "project/data/" \
--kpi_mapping_path "project/kpi_mapping.csv" \
--output_path "project/output/" \
--model_path "project/model/" \
--tokenizer_path "project/model/" \
--threshold 0.5
The KPI detection functionality includes fine-tuning and inference.
Fine-tuning the KPI Model:
Assume your project structure looks like this:
project/
│
├── kpi_mapping.csv
├── training_data.csv
│
├── model/
│ └── (model-related files, e.g., tokenizer, config, checkpoints)
│
├── saved__model/
│ └── (Folder to store output from fine-tuning)
│
├── output/
│ └── (output files, e.g., inference_results.xlsx)
$ osc-transformer-based-extractor kpi-detection fine-tune \
--data_path "project/training_data.csv" \
--model_name "bert-base-uncased" \
--max_length 128 \
--epochs 3 \
--batch_size 16 \
--learning_rate 5e-5 \
--output_dir "project/saved__model/" \
--save_steps 500
Performing Inference:
$ osc-transformer-based-extractor kpi-detection inference \
--data_file_path "project/data/input_dataset.csv" \
--output_path "project/output/inference_results.xlsx" \
--model_path "project/model/"
- Relevance Detection Training File:
The training file should have the following columns:
- Question
- Context
- Label
Example:
Question | Context | Label |
---|---|---|
What is the company name? | The Company is exposed to a risk... | 0 |
- KPI Detection Training File:
For KPI detection, the dataset should have these additional columns:
Question | Context | Label | Company | Source File | KPI ID | Year | Answer | Data Type |
---|---|---|---|---|---|---|---|---|
What is the company name? | ... | 0 | NOVATEK | 04_NOVATEK_AR_2016_ENG_11.pdf | 0 | 2016 | PAO NOVATEK | TEXT |
- KPI Mapping File:
kpi_id | question | sectors | add_year | kpi_category |
---|---|---|---|---|
1 | In which year was the annual report... | OG, CM, CU | FALSE | TEXT |
Clone the repository:
$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
We use pdm for package management and tox for testing.
Install
pdm
:$ pip install pdm
Sync dependencies:
$ pdm sync
Add new packages (e.g., numpy):
$ pdm add numpy
Run
tox
for linting and testing:$ pip install tox $ tox -e lint $ tox -e test
We welcome contributions! Please fork the repository and submit a pull request. Ensure you sign off each commit with the Developer Certificate of Origin (DCO). Read more: http://developercertificate.org/.
On June 26, 2024, the Linux Foundation announced the merger of FINOS with OS-Climate. Projects are now transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance).