tuneBERTopic
is a tool designed to optimize the hyperparameters of the BERTopic model using various search strategies, including Bayesian optimization. The tool allows users to configure parameters, load data, and select evaluation metrics for tuning the topic model.
- Parameter Configuration: Load parameters from a YAML file.
- Data Loading: Load data from a file or use sample data from the
20 Newsgroups
dataset. - Search Strategies: Implement different search strategies, with Bayesian optimization as the default.
- Evaluation Metrics: Support multiple evaluation metrics, including coherence, BLEU, ROUGE, and silhouette scores.
- Logging: Utilize MLflow for tracking and logging experiments.
-
Clone the repository:
git clone https://github.com/yourusername/tuneBERTopic.git cd tuneBERTopic
-
Install dependencies:
poetry install
The main script main.py
can be executed with various command-line arguments:
python main.py <parameter_file> [--data-path <data_path>] [--categories <categories>] [--max-num-samples <num_samples>] [--strategy <strategy>] [--metric <metric>] [--llm <llm>] [--log-level <log_level>]
-
Basic Example:
python main.py parameters.yaml --log-level INFO
-
Using Custom Data:
python main.py parameters.yaml --data-path /path/to/data.txt --log-level INFO
-
Specifying Categories and Maximum Samples:
python main.py parameters.yaml --categories "alt.atheism" "comp.graphics" --max-num-samples 500 --log-level INFO
The parameter file should be in YAML format. An example parameters.yaml
file:
param_grid:
umap__n_neighbors: [15, 50]
umap__n_components: [5, 10]
hdbscan__min_cluster_size: [5, 15]
bertopic__nr_topics: [2, 5, 10, 50]
- Function
load_parameter_file
: Load parameters from a YAML file. - Function
load_data
: Load data from a file or sample dataset.
- Class
SearchStrategy
: Base class for search strategies. - Class
BayesianOptimizationSearch
: Implements Bayesian optimization usinghyperopt
.
tuneBERTopic
supports various evaluation metrics to assess the quality of the topics generated by the BERTopic model. These include:
- Coherence Score (c_v): Measures how consistently related the words in a topic are to each other, which helps in determining the interpretability and quality of the topics generated by BERTopic.
- Silhouette Score: Measures how similar an object is to its own cluster (topic) compared to other clusters. It is used to evaluate the quality of clustering.
- BLEU Score: (Bilingual Evaluation Understudy) Used for evaluating the quality of machine-translated text. In this context, BLEU scores are calculated by using an LLM backend to generate summaries from the topic keywords and comparing them against the input documents.
- ROUGE Score: (Recall-Oriented Understudy for Gisting Evaluation) Another metric for evaluating automatic summarization and machine translation. ROUGE scores are obtained similarly to BLEU scores, using an LLM backend to generate summaries from the topic keywords and comparing them against the input documents.
The evaluate_model
method in the SearchStrategy
class calculates these metrics:
- Coherence and Silhouette: Evaluated directly on the topic model and the input documents.
- BLEU and ROUGE: Utilize an LLM backend to generate summaries from the topic keywords, which are then compared to the input documents to obtain the scores.
MLflow is used for tracking experiments and logging results. Ensure MLflow is properly configured before running the tuning process.
# setup the mlflow tracking server
mlflow server --host 127.0.0.1 --port 8080
Contributions are welcome! Please fork the repository and submit a pull request.