Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MusicGen advice #3

Open
chlowden opened this issue Jun 28, 2024 · 6 comments
Open

MusicGen advice #3

chlowden opened this issue Jun 28, 2024 · 6 comments

Comments

@chlowden
Copy link

Hello
This is an issue but maybe you have an idea that can help me. I am using your MusicGen interface with the metadata below ;.

{ "_version": "0.0.1", "_hash_version": "0.0.3", "_type": "musicgen", "_audiocraft_version": "1.3.0", "models": {}, "prompt": "((piano)) acoustic key F minor minimalist low energy 4/4, 150bpm 320kbps 48.0kHz Stereo", "hash": "05f88c6f4049307a5209b74c368f62fda1575c7ab45668e06b39f54806e0fbcd", "date": "2024-06-21_23-06-50", "text": "((piano)) acoustic key F minor minimalist low energy 4/4, 150bpm 320kbps 48.0kHz Stereo", "melody": "94240fe69b46edc19d55977a5b38598da85708c28bd932d51dbbd5f00e609076", "model": "facebook/musicgen-stereo-melody-large", "duration": 360, "topk": 250, "topp": 0, "temperature": 1, "cfg_coef": 3, "seed": "1538577670", "use_multi_band_diffusion": false }

The melody reference is Philip Glass Metamorphosis 5
https://www.youtube.com/watch?v=Rebr_F53db8

This seemed to me to reasonably possible for a general AI model. After hours of playing around with different settings, I still don't get anything near the ref. I get an audio "soup" at best. As there seems to be rather little written on Musicgen, I was wondering if you have any ideas about the limits of the Musicgen model and what it might have been trained on?
Any thoughts are most welcome.
Thank you.

@rsxdalv
Copy link
Owner

rsxdalv commented Jun 28, 2024 via email

@rsxdalv
Copy link
Owner

rsxdalv commented Jun 30, 2024

Note - the following comment is copyrighted and is not meant to be freely redistributed on other pages or platforms. To anyone who wishes to reproduce this - please reach out to me.

I have done some work on the prompt, though this might not be the prompt you wanted it to be, this seems to work better:

(default settings, musicgen stereo melody large, seed: 4240372830)

piano solo, Philip Glass: Solo Piano, Metamorphosis: Five, piano, acoustic key F minor minimalist low energy 4/4, 150bpm 320kbps 48.0kHz Stereo
audio.51.mp4

Note that the elements are separated more. Although this reduces the expressiveness of the model, it can be better to split the prompt with more commas when the AI doesn't seem to notice something.

I also generated it without any melody.

I noticed that the audio is normalized and the quality isn't perfect, but that might always be a challenge since with all of the generative AI models.

I felt like the audio is a bit too fast, so I tried a lower BPM with this reference:

image

piano solo, Philip Glass: Solo Piano, Metamorphosis: Five, piano, acoustic key E minor, minimalist low energy 4/4, 120bpm, 320kbps, 48.0kHz Stereo, Studio
audio.53.mp4

By the way, I see that you generate a 360s long piece. Even on an A100 I see 3 seconds of generation time for 1 second of audio, so I recommend doing 1-3 seconds short generations and then when it sounds like something worthwhile reusing the last seed and increasing the generation time bit by bit, that's also what I did for some of this library to even see what the model is able to follow and what it isn't.

I tried to add some more tweaks, but it seems to need some skillful prompt-engineering (as much as prompt-engineering is a meme in other circles, here it's exactly what we need).

Parameters:
text : piano solo, Philip Glass: Solo Piano, Metamorphosis: Five, piano, acoustic key E minor, minimalist low energy 4/4, 120bpm, 320kbps, 48.0kHz Stereo, Studio, sorrow, sad, melancholy 
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 61
topk : 250
topp : 0
temperature : 1
cfg_coef : 3
seed : 4240372830
use_multi_band_diffusion : False
Generated in 150.288 seconds

With the longer durations musicgen begins to splice and give suboptimal transitions.

audio.54.mp4
Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 24
topk : 250
topp : 0
temperature : 1
cfg_coef : 3
seed : 4240372830
use_multi_band_diffusion : False
Generated in 60.155 seconds
audio.56.mp4

The problem that tends to happen is it gets fairly monotone; however it does slowly move forward.

Increasing the temparature gives a worse result:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 24
topk : 250
topp : 0
temperature : 1.15
cfg_coef : 3
seed : 4240372830
use_multi_band_diffusion : False
Generated in 60.682 seconds
audio.57.mp4

Even increasing CFG to 5 does not fix it:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 12
topk : 250
topp : 0
temperature : 1.1
cfg_coef : 5
seed : 4240372830
use_multi_band_diffusion : False
Generated in 29.895 seconds
audio.58.mp4

lowering temperature to 0.9 with CFG still at 5:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 12
topk : 250
topp : 0
temperature : 0.9
cfg_coef : 5
seed : 4240372830
use_multi_band_diffusion : False
Generated in 29.636 seconds
audio.59.mp4

dropping CFG to 2 improves the result:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 12
topk : 250
topp : 0
temperature : 0.9
cfg_coef : 2
seed : 4240372830
use_multi_band_diffusion : False
Generated in 29.522 seconds
audio.61.mp4

However, overall it seems that the model wants to have more information in the prompt.

At 0.7 CFG the model was too creative and added silence gaps in the music.

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 26
topk : 250
topp : 0
temperature : 0.9
cfg_coef : 0.7
seed : 4240372830
use_multi_band_diffusion : False
Generated in 64.441 seconds
audio.62.mp4

Even CFG 1.4 is not enough:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 26
topk : 250
topp : 0
temperature : 0.9
cfg_coef : 1.4
seed : 4240372830
use_multi_band_diffusion : False
Generated in 64.750 seconds
audio.63.mp4

1.7 gets better, but monotone

audio.64.mp4

CFG 1.7 with Temperature 1.1 again is not really useful:

audio.65.mp4

Tweaking CFG, temperature and adding more to the prompt:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm and 108bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
dynamic rhythm,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 18
topk : 250
topp : 0
temperature : 0.95
cfg_coef : 1.8
seed : 4240372830
use_multi_band_diffusion : False
Generated in 43.900 seconds
audio.66.mp4

Now, trying to expand that to 6 minutes we run into problems, and the audio slowly degenerates completely (warning, loud noises):

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm and 108bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
dynamic rhythm,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 360
topk : 250
topp : 0
temperature : 0.95
cfg_coef : 1.8
seed : 4240372830
use_multi_band_diffusion : False
Generated in 884.383 seconds
audio.67.mp4

Note - this comment is copyrighted and is not meant to be freely redistributed on other pages or platforms. To anyone who wishes to reproduce this - please reach out to me.

@rsxdalv
Copy link
Owner

rsxdalv commented Jun 30, 2024

Also notably it used 33.2GB of VRAM

@rsxdalv
Copy link
Owner

rsxdalv commented Jun 30, 2024

I've confirmed that stable audio destroys musicgen in this case, VRAM, speed, quality, control, ease of use. It's only the license, although musicgen does not have a very permissive license either.

@chlowden
Copy link
Author

chlowden commented Jul 1, 2024

Thank you so much for all your time. What you tried is very interesting.
I have been struggling with the difference between the large_stereo & large_melody_stereo models. I thought that by having a 30sec guide track, it would help, but it has not, maybe because the system was not able to identify a melody in the ref track.
Your examples are closer than mine to the ref, which is encouraging.
What intrigues me is that in your prompts you mention Glass and that implies that musicgen would been trained on copyrighted material. FB goes to great lengths to disown its own library and expects others to train their versions of musicgen on their own sources. I did not use the Glass text prompt as I figured that it should not have an impact. I struggled to find adjectives to describe the sound so, fell back on technical musical data, which seems to have little impact too. It is virtually a philosophical discussion on how to describe music without defining it by a name or style.
I'm looking forward to trying Stable audio as it does sound very promising.

PS. I have been listening to the SD radio channel
https://stableaudio.com/live
(I'm starting appreciate just how golden silence can be.)

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 1, 2024

I think melody is only like a relief, the prompt itself is very important. And yes, unfortunately we don't have CLIP for music that works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants