Neural Voice Cloning with a Few Samples #8

flrngel · 2018-03-04T04:23:02Z

https://arxiv.org/abs/1802.06006
Paper from Baidu Research

Abstract

Paper will do

Speaker adaption
- fine-tuning a multi-speaker generative model
Speaker encoding
- infer speaker embedding which will be used with a multi-speaker generative model

1. Introduction

Text carries linguistic information
Speaker representation captures speaker's characteristics (pitch, speech rate, accent)
This paper focuses on voice cloning
Compares speech naturalness, speaker similarity, cloning/inference time, model footprint

2. Voice Cloning

Paper Notations

f: multi-speaker generative model
g: speaker encoding function
t: text
s: speaker
a: audio
S: speaker set
A: audio set

2.1. Speaker adaption

Speaker adaption function

2.2. Speaker encoding

Speaker encoding function

Paper avoids mode collapse with training speaker encoder seperately

Loss function (L1)

Architecture

Spectral processing
Temporal processing
Cloning sample attention
- uses multi-head self-attention from Transformer

2.3. Discriminative models for evaluation

Because human is so expensive, paper propose those two solutions for evaluation

2.3.1. Speaker Classification

Put additional embedding layer before softmax function from whole architecture

2.3.2. Speaker Verification

binary classification wheter the test audio and enrolled audio are same speaker

Experiments

3.1. Datasets

LibriSpeech dataset for multi-speaker generative model & speaker encoder model
sampling from VCTK for voice cloning

flrngel added Attention Mechanism Speech Synthesis Speech labels Mar 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neural Voice Cloning with a Few Samples #8

Neural Voice Cloning with a Few Samples #8

flrngel commented Mar 4, 2018 •

edited

Loading

Neural Voice Cloning with a Few Samples #8

Neural Voice Cloning with a Few Samples #8

Comments

flrngel commented Mar 4, 2018 • edited Loading

Abstract

1. Introduction

2. Voice Cloning

2.1. Speaker adaption

Speaker adaption function

2.2. Speaker encoding

Speaker encoding function

Loss function (L1)

Architecture

2.3. Discriminative models for evaluation

2.3.1. Speaker Classification

2.3.2. Speaker Verification

Experiments

3.1. Datasets

flrngel commented Mar 4, 2018 •

edited

Loading