README

LabROSA Lyrics / Audio Repository
=================================

This is the documentation for LYRICAL: the Labrosa repositorY of Recorded Information Concerning Audio and Lyrics. Contained within are the lyrics to popular music tracks, annotated at the structure level (verse, chorus etc.) with accompanying acapella and polyphonic audio. 

The database is intended to be used for providing a common dataset for Automatic Lyric Transcription (ALT). Also contained within are therefore scripts for reproducing our baseline ALT performance using the Sphinx4 Automatic Speech Recognition system.

Installation
============

Lyrics
------
The easiest way of getting the lyrics is to say:

git clone https://github.com/mattmcvicar/lyric_database.git

Audio
-----
For queries relating to audio, please email mattjamesmcvicar@gmail.com

Sphinx4
-------
If you want to run Reproduce_baseline.py, you will need to set up a Sphinx4 project. The following list of conjures was found to be effective on Mac OSX 10.6.8, revision revision 12489. If the following doesn't work for you, it may be that the sphinx4 structure has changed, in which case email mattjamesmcvicar@gmail.com and I'll update this README.

1) Download latest stable Sphinx4 *source* release from:

http://cmusphinx.sourceforge.net/wiki/download/

2) unzip and cd into repo

jar xvf sphinx4-5prealpha-src.zip
cd sphinx4-5prealpha

3) build Sphinx4 and all demos

ant

4) We're going to adapt one of the demos for our own use. Open up

/src/apps/edu/cmu/sphinx/demo/transcriber/Transcriber.java

in an editor. First, note that Transcriber currently spits out quite a bit of info, so comment out everything in the while loop (lines 51-63) and replace it with the following:

System.out.format("%s\n", result.getHypothesis());

so that it just echoes the best hypothesis for each detected utterance to stdout

5) re-complile the demo:

ant

6) Run the demo, with 512mb starting heap memory, up to 1GB. Run the following:

java -Xms512m -Xmx1024m -jar ./bin/Transcriber.jar

It's had a pretty good go at transcribing some digits. You should see:

what zero is zero zero one
nine oh two one oh
cyril one eight zero three

7) The demo currently reads the audio file stored in 

src/apps/edu/cmu/sphinx/demo/aligner/10001-90210-01803.wav

and transcribes it using a fairly general-purpose model. Let's adapt it to take an audio file as an additional input. Set the line:

recognizer.startRecognition(new URL("file:src/apps/edu/cmu/sphinx/demo/aligner/10001-90210-01803.wav").openStream());

to be:

recognizer.startRecognition(new URL("file:" + args[0] ).openStream());

8) re-compile the demo for the last time:

ant

9) Now try running on your own audio:

java -Xms512m -Xmx1024m -jar ./bin/Transcriber.jar 

10) That's it! Note you can also change the acoustic model, language model and dictionary in Transcriber.java too. Just remember to re-compile after each change. To reproduce our baseline results, see the file 

resources/Reproduce_baseline.py

Contents
========

Lyrics
------
The database is split into two disjoint subsets: "Train" and "Test", and two genres: "Rap" and "Sing". The "Train" set is to be used for training or adapting acoustic or language models, the test set reserved for evaluation. We believe the acoustic and language models for rapped and sung audio to be quite different, so have kept them distinct. 

Each file is written in plain text, with each line being one of:
  
  - annotation line
  - lyric line
  - empty line

Annotation lines are enclosed within square brackets and contain structural information. lyric lines contain the lyrics to the song, made up of the lower and upper-case letters of the latin alphabet, together with the following punctuation set: 

,!?'-

(commas, exclamation marks, questions marks, single quotes and hyphens). Empty lines are used purely for aesthetic purposes. Example file snippet (first verse of "30_Seconds_to_Mars_-_The_Kill.lyrics"):

-----------------------------------
[VERSE]
What if I wanted to break
Laugh it all off in your face
What would you do? Oh, oh oh oh
What if I fell to the floor
Couldn't take all this anymore
What would you do do, do do, do do?
-----------------------------------

Resources
---------
Since the database contains a considerable amount of slang, we also include a file 

resources/Train_Test_Holdout.dict

which contains pronounciations from the CMU set of 39 unstressed phonemes for each word in the train, test, and holdout sets. These are not used in our baseline experiments (many will most likely not be in standard language models) except in evaluation, but will be essential in training new language models. 

Reproduce_baseline.py
---------------------
Also contained within the repository is a simple python script "Reproduce_baseline.py", used to reproduce the results reported in the accompanying paper (see below)

The script assumes you are familiar with Sphinx4. If you require assistance in installing and setting up Sphinx4, please see:

http://cmusphinx.sourceforge.net/wiki/tutorialsphinx4

Attribution
===========

If you make use of this database in your research, please consider citing the following paper:


Contact
=======

For all queries, please contact mattjamesmcvicar@gmail.com