vocab-augmentor is a Python command-line tool designed to help language learners expand their vocabulary
effortlessly. By leveraging advanced language models such as facebook/mbart-large-50-many-to-many-mmt
,
Helsinki-NLP/opus-mt
, Gemini 1.0 Pro
, and llama 3-8b
, it identifies new words from any
given text, translates them, and updates your personal vocabulary table. Additionally, it supports
adding pronunciation sounds for new words using the MeloTTS
and facebook/mms-tts-por
text-to-speech
libraries for supported languages. The updated vocabulary table is then exported to a CSV file and a
feature-rich HTML file for easy access and further study.
HTML table displaying words extracted from source text, including translation data, audio links, and part of speech (POS) tags
-
Multi-Language Support: Supports a variety of languages, making it versatile for learners of different languages.
-
Advanced Translation Models: Choose from multiple state-of-the-art translation models for accurate translations.
facebook/mbart-large-50-many-to-many-mmt
Helsinki-NLP/opus-mt
Gemini 1.0 Pro
llama 3-8b
-
Audio Pronunciation: Adds pronunciation sounds for new words using
MeloTTS
for supported languages, including English, French, Spanish, Chinese, Korean, and Japanese.For Portuguese text-to-speech,
facebook/mms-tts-por
is used. Sound files are saved as.wav.
-
Example entences Generation: Generates two example sentences for each segmented word using one of the large language models (
Gemini 1.0 Pro
orllama 3-8b
). -
Selective Module Installation: Modules are installed only when needed for specific tasks.
-
CSV and HTML Export: Updates and exports the vocabulary table to a CSV file and a feature-rich HTML file.
- HTML features include
DataTables
stylesheet, column reordering, resizable columns, search functionality, pagination, and sortable columns. - Both CSV and HTML files include clickable links to audio files (
.wav
) containing pronunciation sounds of extracted words. - Saves new words along with comprehensive data including their translation, pinyin (if applicable), part of speech (POS), and example sentences.
- HTML features include
- Platform: tested on Linux
- Python: version 3.10.13+
langcodes
: to get the full name of a language from its short code
Modules are installed automatically by the vocab
script only when needed for specific
tasks. Below is a breakdown of the modules required based on different functionalities:
Functionality | Required Modules |
---|---|
Translation Models | - facebook/mbart-large-50-many-to-many-mmt (transformers ) - Helsinki-NLP/opus-mt ( transformers ) |
Translation & Sentence Generation | - gemini-pro (google.generativeai ) - meta-llama/Meta-Llama-3-8B-Instruct ( huggingface_hub , torch , transformers ) |
Text-to-Speech (TTS) | - facebook/mms-tts-por (transformers , torch , scipy , numpy ) - MeloTTS ( gradio_client ) |
Language Detection | - langdetect - langid - polyglot (pyicu , pycld2 ) |
Language-Specific | - Chinese (source/target): pypinyin |
Data Management | - Saving translations to CSV: pandas |
Text Segmentation | - Chinese: jieba - Other languages: spacy |
Note: When using GPU T4 with polyglot, pycld3
can't be installed,
hence polyglot can't be used. Use an alternative detection method
in such cases (e.g. langdetect
or langid
).
Install the package vocab_augmentor
with pip
:
pip install git+https://github.com/raul23/vocab-augmentor/#egg=vocab-augmentor
Test installation
- Test your installation by importing
vocab_augmentor
and printing its version:python -c "import vocab_augmentor; print(vocab_augmentor.__version__)"
- You can also test that you have access to the
vocab
script by showing the program's version:vocab --version
To utilize the advanced translation and text generation features of
Gemini 1.0 Pro
and llama 3-8b
, API keys are required. These keys must be
saved as environment variables. Follow the steps below to manage your API keys:
-
Gemini 1.0 Pro API Key
- Save your Gemini 1.0 Pro API key as an environment variable named
GEMINI_API_KEY
. - In a Unix-based system (Linux, macOS), add the following line to your
.bashrc
or.zshrc
file:export GEMINI_API_KEY='your_gemini_api_key'
- Save your Gemini 1.0 Pro API key as an environment variable named
-
Hugging Face API Key for Llama 3-8b
- Save your Hugging Face API key as an environment variable named
HF_API_KEY
. - In a Unix-based system, add the following line to your
.bashrc
or.zshrc
file:export HF_API_KEY='your_hugging_face_api_key'
- Save your Hugging Face API key as an environment variable named
- Ensure your API keys are kept confidential and not shared publicly.
- Using API keys allows the script to access powerful language models and generate accurate translations and text examples efficiently.
By following these steps, you can seamlessly integrate API keys into the
vocab
script and leverage its full capabilities for advanced language
learning tasks.
-
GPU Recommendation: When using
llama 3-8b
, GPU usage is highly recommended for faster processing. -
Text Generation: Use either
Gemini 1.0 Pro
orllama 3-8b
to generate example sentences.Gemini 1.0 Pro
is faster as it uses an API.
To display the vocab
script list of options and their descriptions:
$ vocab -h
usage: vocab [OPTIONS] {TEXT} {TGT_LANG}
Expand your vocabulary list by identifying and translating new words from provided text using various language models.
options:
-v, --version Show program's version number and
exit.
-h, --help Display detailed usage instructions
and exit the program.
-t, --text TEXT The source text that will be
processed to identify and translate
new words.
-l, --target_lang TGT_LANG Target language code into which the
source text will be translated (e.g.,
zh for Chinese, en for English, pt
for Portuguese).
-o, --text_origin ORIGIN Origin of the source text, e.g. movie
script, book, URL of website, etc.
-d, --lang_detector NAME Method to use for detecting the
language of the source text.
(default: polyglot)
-m, --transl_model NAME Translation model to use for
translating the text.
(default: helsinki)
-g, --gen_model NAME Language model to use for generating
example sentences in the source
language. (default:
gemini)
-c, --csv_filepath CSV_FILE Path to the vocabulary CSV file. If
the file does not exist, a new one
will be created.
-a, --audio_dirpath AUDIO_DIR Path to the main directory for
storing audio files. (default:
~/audio/)
-b, --audio_base_url URL Base URL to audio files of words.
(experimental)
--ap, --add_pos Flag to add or update part-of-speech
(POS) information for the words.
--as, --add_sentences Flag to add or update example
sentences in the vocabulary list.
--aut, --add_audio_text Flag to add or update audio
pronunciation for the source text.
--aaw, --add_audio_words Flag to add or update audio
pronunciation for the extracted words
from the text.
--ascb, --add_save_comments_button Flag to add 'Save Comments' button in
the HTML page of the table.
(experimental)
-
Run the script:
vocab -t 'El sol y un cielo' -l en -m gemini --ap --as --aaw
NOTES:
-t 'El sol y un cielo'
: the source text to be translated from Spanish to English (-l en
)-m gemini
: theGemini 1.0 Pro
model is used for translation from Spanish to English--ap
: Part-of-Speech (POS) tags will be added for each segmented word from the source text--as
: Two example sentences will be generated for each segmented word from the source text--aaw
: Audio pronounciations will be generated for each segmented word from the source text
-
The script will create an
audio/
directory (if it doesn't already exist) and save the audio files there. The CSV and HTML files will include clickable links to these audio files.
Terminal output showing new words extracted from the source text and the example sentences generated by
Gemini 1.0 Pro
The CSV file might have the following structure:
Word | Pinyin | Translation | POS | Audio (Word) |
---|---|---|---|---|
新词 | xīn cí | New word | noun | xinci.wav |
例子 | lì zi | Example | noun | lizi.wav |
- Language detection: When running the script on a GPU T4 and selecting
polyglot
for automatic language detection,pycld3
can't be installed, hencepolyglot
can't be used. Use an alternative language detection method in such cases (e.g.langid
orlangdetect
). - Chinese Text-to-Speech: For Chinese text,
MeloTTS
may have difficulties with single-character words and low volume on some words.- Spanish TTS is good except for very small words like "y".
Contributions are welcome!
This project is licensed under the MIT License. See the LICENSE file for details.
- Special thanks to the developers of the language models and
MeloTTS
used in this project. - Inspired by the need to make language learning more efficient and effective.