initial commit

jumon · Dec 19, 2022 · 56c515b · 56c515b
commit 56c515b
Show file tree

Hide file tree

Showing 4 changed files with 465 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,166 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+
+# Below are files that are specific to this project
+data/
+output/
+.vscode/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2022 Jumon Nozaki
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,49 @@
+# whisper-punctuator
+Zero-shot punctuation insertion using the [Whisper](https://github.com/openai/whisper) speech recognition model:
+* No additional training required
+* Works on any language that Whispers supports
+* Can change the style of punctuation by using a prompt
+
+Have you ever wanted to fine-tune a Whisper model using public data, but the data doesn't have punctuation? If so, this script is for you! It allows you to insert punctuation into unpunctuated text using the Whisper model in a zero-shot fashion, using a pair of unpunctuated text and audio files.
+
+## Quick Start
+To use the script, first install [Whisper](https://github.com/openai/whisper) and its dependencies. See the instructions [here](https://github.com/openai/whisper#setup) for more details.
+
+Run the following command to insert punctuation into text using a text and audio file pair as input:
+```
+python insert_punctuation.py --audio <path-to-audio-file> --text <text-to-be-punctuated>
+```
+Note that the audio needs to be shorter than 30 seconds; if it is longer, the first 30 seconds will be used.
+
+The default setting treats `,` `.` `?` as punctuation. To change the punctuation characters, use the `--punctuations` flag and specify a list of characters. For example, to treat `,` `.` `?` `!` as punctuation, run:
+```
+python insert_punctuation.py --audio <path-to-audio-file> --text <text-to-be-punctuated> --punctuations ",.?!"
+```
+To handle languages other than English, you can use the `--language` flag to specify the language. For example, to insert punctuation for a Japanese text and treat `。` `、` as punctuation, run:
+```
+python insert_punctuation.py --audio <path-to-audio-file> --text <text-to-be-punctuated> --language ja --punctuations "。、"
+```
+To change the style of punctuation, use the `--initial-prompt` flag to specify a prompt. This will make the model more likely insert punctuation in the style of the prompt. For example, to insert punctuation after every word (though this is not recommended), run:
+```
+python insert_punctuation.py --audio <path-to-audio-file> --text <text-to-be-punctuated> --initial-prompt "hello, how, are, you, today?"
+```
+
+For all available options, run:
+```
+python insert_punctuation.py --help
+```
+
+## How does it work?
+Whisper is an automatic speech recognition (ASR) model trained on a massive amount of labeled audio data collected from the internet.
+The data used in its training contains punctuation, so the model learns to recognize punctuation as well.
+This allows Whisper to insert punctuation into a text given an audio and text pair.
+To insert punctuation, the audio is first input into the encoder of the model to generate the encoder hidden states.
+Then, the decoder of the model processes each token in the text one by one, in an autoregressive fashion, along with the encoder hidden states. After each token, the model predicts the output probability for each punctuation character. If the highest probability is above a specified threshold (specified using the --min-prob flag), the punctuation is inserted after the token.
+
+## Limitations
+- The results are dependent on the dataset used to train Whisper, which is not publicly available. Punctuation marks that are rare in the training data may not be recognized well.
+- Since the Whisper decoder generates tokens in a left-to-right fashion, it cannot see future tokens when predicting the punctuation after the current token. This can lead to problems in some cases, such as when `まーはい` (uh yes, where まー means uh and はい means yes) in Japanese is often punctuated as `ま、ーはい` instead of `まー、はい` This is because the model cannot see the future `ー` (Japanese long vowel) when predicting the punctuation after `ま`, and `ま` itself also means `uh` in Japanese. We circumvent this problem by preventing the model from inserting punctuation after `ー` that is specified by the `--punctuation-suppressing-chars` flag. However, this is not a perfect solution and the model may still suffer from problems due to its left-to-right decoding nature.
+- If the model fails to insert punctuation when it should, it may enter a "no-punctuation" mode and not insert any further punctuation in the text. This is another issue caused by the left-to-right decoding nature. To mitigate this problem, you can use the `--initial-prompt` flag to induce the model to enter a "punctuation" mode.
+- The --min-prob and --initial-prompt flags may need to be fine-tuned to get the best results, depending on the data.
+- The current implementation does not allow for punctuation marks that consist of multiple tokens according to the Whisper tokenizer.
+- If you want to fine-tune a Whisper model using publiclly available data, you may want to ensure that the data is not only punctuated but also truecased. This script does not perform truecasing, but it may be possible to achievet this using similar (but probably more complicated) techinques.