Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neural Voice Cloning with a Few Samples #8

Open
flrngel opened this issue Mar 4, 2018 · 0 comments
Open

Neural Voice Cloning with a Few Samples #8

flrngel opened this issue Mar 4, 2018 · 0 comments

Comments

@flrngel
Copy link
Owner

flrngel commented Mar 4, 2018

https://arxiv.org/abs/1802.06006
Paper from Baidu Research

Abstract

Paper will do

  • Speaker adaption
    • fine-tuning a multi-speaker generative model
  • Speaker encoding
    • infer speaker embedding which will be used with a multi-speaker generative model

1. Introduction

  • Text carries linguistic information
  • Speaker representation captures speaker's characteristics (pitch, speech rate, accent)
  • This paper focuses on voice cloning
  • Compares speech naturalness, speaker similarity, cloning/inference time, model footprint

2. Voice Cloning

image

Paper Notations

  • f: multi-speaker generative model
  • g: speaker encoding function
  • t: text
  • s: speaker
  • a: audio
  • S: speaker set
  • A: audio set

2.1. Speaker adaption

Speaker adaption function

image

2.2. Speaker encoding

Speaker encoding function

image
Paper avoids mode collapse with training speaker encoder seperately

Loss function (L1)

image

Architecture

image

  • Spectral processing
  • Temporal processing
  • Cloning sample attention
    • uses multi-head self-attention from Transformer

2.3. Discriminative models for evaluation

Because human is so expensive, paper propose those two solutions for evaluation

2.3.1. Speaker Classification

  • Put additional embedding layer before softmax function from whole architecture

2.3.2. Speaker Verification

  • binary classification wheter the test audio and enrolled audio are same speaker
    image

Experiments

3.1. Datasets

  • LibriSpeech dataset for multi-speaker generative model & speaker encoder model
  • sampling from VCTK for voice cloning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant