Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the pointer to the Hokkien benchmark dataset #5589

Merged
merged 1 commit into from
Jan 9, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/hokkien/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ See our interactive [Demo page](https://huggingface.co/spaces/facebook/Hokkien_T
We create and release a Hokkien-English parallel speech dataset that is available for benchmarking Hokkien<>English speech to speech translation systems. The dataset was derived from TAT-Vol1-eval-lavalier (dev) and TAT-Vol1-test-lavalier (test) based on [Taiwanese Across Taiwan (TAT) corpus](https://sites.google.com/speech.ntut.edu.tw/fsw/home/tat-corpus), which contained audio recordings and transcripts in Taiwanese Hokkien.
We created the parallel dataset by first concatenating neighboring sentences to form longer utterances, translating the Hokkien text transcriptions into English via Hokkien-English bilinguals, and recording the English translations with human voices. Below are some summary statistics of the dataset.

The dataset is available [HERE](https://sites.google.com/nycu.edu.tw/speechlabx/tat_s2st_benchmark).
The benchmark dataset is available at https://sites.google.com/nycu.edu.tw/sarc/tat_s2st_benchmark.

## Open Sourced English-Hokkien S2ST Models

Expand Down
Loading