tuneBERTopic

tuneBERTopic is a tool designed to optimize the hyperparameters of the BERTopic model using various search strategies, including Bayesian optimization. The tool allows users to configure parameters, load data, and select evaluation metrics for tuning the topic model.

Features

Parameter Configuration: Load parameters from a YAML file.
Data Loading: Load data from a file or use sample data from the 20 Newsgroups dataset.
Search Strategies: Implement different search strategies, with Bayesian optimization as the default.
Evaluation Metrics: Support multiple evaluation metrics, including coherence, BLEU, ROUGE, and silhouette scores.
Logging: Utilize MLflow for tracking and logging experiments.

Installation

Clone the repository:

git clone https://github.com/yourusername/tuneBERTopic.git
cd tuneBERTopic

Install dependencies:
```
poetry install
```

Usage

Command-Line Interface

The main script main.py can be executed with various command-line arguments:

python main.py <parameter_file> [--data-path <data_path>] [--categories <categories>] [--max-num-samples <num_samples>] [--strategy <strategy>] [--metric <metric>] [--llm <llm>] [--log-level <log_level>]

Examples

Basic Example:

python main.py parameters.yaml --log-level INFO

Using Custom Data:

python main.py parameters.yaml --data-path /path/to/data.txt --log-level INFO

Specifying Categories and Maximum Samples:

python main.py parameters.yaml --categories "alt.atheism" "comp.graphics" --max-num-samples 500 --log-level INFO

Parameter File

The parameter file should be in YAML format. An example parameters.yaml file:

param_grid:
  umap__n_neighbors: [15, 50]
  umap__n_components: [5, 10]
  hdbscan__min_cluster_size: [5, 15]
  bertopic__nr_topics: [2, 5, 10, 50]

Components

Data Loading

Function load_parameter_file: Load parameters from a YAML file.
Function load_data: Load data from a file or sample dataset.

Search Strategies

Class SearchStrategy: Base class for search strategies.
Class BayesianOptimizationSearch: Implements Bayesian optimization using hyperopt.

Evaluation Metrics

tuneBERTopic supports various evaluation metrics to assess the quality of the topics generated by the BERTopic model. These include:

Coherence Score (c_v): Measures how consistently related the words in a topic are to each other, which helps in determining the interpretability and quality of the topics generated by BERTopic.
Silhouette Score: Measures how similar an object is to its own cluster (topic) compared to other clusters. It is used to evaluate the quality of clustering.
BLEU Score: (Bilingual Evaluation Understudy) Used for evaluating the quality of machine-translated text. In this context, BLEU scores are calculated by using an LLM backend to generate summaries from the topic keywords and comparing them against the input documents.
ROUGE Score: (Recall-Oriented Understudy for Gisting Evaluation) Another metric for evaluating automatic summarization and machine translation. ROUGE scores are obtained similarly to BLEU scores, using an LLM backend to generate summaries from the topic keywords and comparing them against the input documents.

Using Evaluation Metrics

The evaluate_model method in the SearchStrategy class calculates these metrics:

Coherence and Silhouette: Evaluated directly on the topic model and the input documents.
BLEU and ROUGE: Utilize an LLM backend to generate summaries from the topic keywords, which are then compared to the input documents to obtain the scores.

Logging

MLflow is used for tracking experiments and logging results. Ensure MLflow is properly configured before running the tuning process.

# setup the mlflow tracking server
mlflow server --host 127.0.0.1 --port 8080

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
tuneBERTopic		tuneBERTopic
.gitignore		.gitignore
README.md		README.md
main.py		main.py
parameters.yaml		parameters.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tuneBERTopic

Features

Installation

Usage

Command-Line Interface

Examples

Parameter File

Components

Data Loading

Search Strategies

Evaluation Metrics

Using Evaluation Metrics

Logging

Contributing

About

Releases

Packages

Languages

benjaminr/tuneBERTopic

Folders and files

Latest commit

History

Repository files navigation

tuneBERTopic

Features

Installation

Usage

Command-Line Interface

Examples

Parameter File

Components

Data Loading

Search Strategies

Evaluation Metrics

Using Evaluation Metrics

Logging

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages