MusicGen advice #3

chlowden · 2024-06-28T12:04:47Z

Hello
This is an issue but maybe you have an idea that can help me. I am using your MusicGen interface with the metadata below ;.

{ "_version": "0.0.1", "_hash_version": "0.0.3", "_type": "musicgen", "_audiocraft_version": "1.3.0", "models": {}, "prompt": "((piano)) acoustic key F minor minimalist low energy 4/4, 150bpm 320kbps 48.0kHz Stereo", "hash": "05f88c6f4049307a5209b74c368f62fda1575c7ab45668e06b39f54806e0fbcd", "date": "2024-06-21_23-06-50", "text": "((piano)) acoustic key F minor minimalist low energy 4/4, 150bpm 320kbps 48.0kHz Stereo", "melody": "94240fe69b46edc19d55977a5b38598da85708c28bd932d51dbbd5f00e609076", "model": "facebook/musicgen-stereo-melody-large", "duration": 360, "topk": 250, "topp": 0, "temperature": 1, "cfg_coef": 3, "seed": "1538577670", "use_multi_band_diffusion": false }

The melody reference is Philip Glass Metamorphosis 5
https://www.youtube.com/watch?v=Rebr_F53db8

This seemed to me to reasonably possible for a general AI model. After hours of playing around with different settings, I still don't get anything near the ref. I get an audio "soup" at best. As there seems to be rather little written on Musicgen, I was wondering if you have any ideas about the limits of the Musicgen model and what it might have been trained on?
Any thoughts are most welcome.
Thank you.

The text was updated successfully, but these errors were encountered:

rsxdalv · 2024-06-28T13:58:24Z

I'll take a look at your parameters. I don't think there's huge public expertise on musicgen, probably nobody even knows all the genres it can and cannot do. That being said, I might invest some money into generating more data about musicgen.

…

On Fri, Jun 28, 2024, 3:05 PM Christopher Lowden ***@***.***> wrote: Hello This is an issue but maybe you have an idea that can help me. I am using your MusicGen interface with the metadata below ;. { "_version": "0.0.1", "_hash_version": "0.0.3", "_type": "musicgen", "_audiocraft_version": "1.3.0", "models": {}, "prompt": "((piano)) acoustic key F minor minimalist low energy 4/4, 150bpm 320kbps 48.0kHz Stereo", "hash": "05f88c6f4049307a5209b74c368f62fda1575c7ab45668e06b39f54806e0fbcd", "date": "2024-06-21_23-06-50", "text": "((piano)) acoustic key F minor minimalist low energy 4/4, 150bpm 320kbps 48.0kHz Stereo", "melody": "94240fe69b46edc19d55977a5b38598da85708c28bd932d51dbbd5f00e609076", "model": "facebook/musicgen-stereo-melody-large", "duration": 360, "topk": 250, "topp": 0, "temperature": 1, "cfg_coef": 3, "seed": "1538577670", "use_multi_band_diffusion": false } The melody reference is Philip Glass Metamorphosis 5 https://www.youtube.com/watch?v=Rebr_F53db8 This seemed to me to reasonably possible for a general AI model. After hours of playing around with different settings, I still don't get anything near the ref. I get an audio "soup" at best. As there seems to be rather little written on Musicgen, I was wondering if you have any ideas about the limits of the Musicgen model and what it might have been trained on? Any thoughts are most welcome. Thank you. — Reply to this email directly, view it on GitHub <#3>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTRXI6UIV4UZWU7OJFZ2B3ZJVGPHAVCNFSM6AAAAABKBYEPKOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM4DAMRYGA4DQNY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

rsxdalv · 2024-06-30T13:05:06Z

Note - the following comment is copyrighted and is not meant to be freely redistributed on other pages or platforms. To anyone who wishes to reproduce this - please reach out to me.

I have done some work on the prompt, though this might not be the prompt you wanted it to be, this seems to work better:

(default settings, musicgen stereo melody large, seed: 4240372830)

piano solo, Philip Glass: Solo Piano, Metamorphosis: Five, piano, acoustic key F minor minimalist low energy 4/4, 150bpm 320kbps 48.0kHz Stereo

audio.51.mp4

Note that the elements are separated more. Although this reduces the expressiveness of the model, it can be better to split the prompt with more commas when the AI doesn't seem to notice something.

I also generated it without any melody.

I noticed that the audio is normalized and the quality isn't perfect, but that might always be a challenge since with all of the generative AI models.

I felt like the audio is a bit too fast, so I tried a lower BPM with this reference:

piano solo, Philip Glass: Solo Piano, Metamorphosis: Five, piano, acoustic key E minor, minimalist low energy 4/4, 120bpm, 320kbps, 48.0kHz Stereo, Studio

audio.53.mp4

By the way, I see that you generate a 360s long piece. Even on an A100 I see 3 seconds of generation time for 1 second of audio, so I recommend doing 1-3 seconds short generations and then when it sounds like something worthwhile reusing the last seed and increasing the generation time bit by bit, that's also what I did for some of this library to even see what the model is able to follow and what it isn't.

I tried to add some more tweaks, but it seems to need some skillful prompt-engineering (as much as prompt-engineering is a meme in other circles, here it's exactly what we need).

Parameters:
text : piano solo, Philip Glass: Solo Piano, Metamorphosis: Five, piano, acoustic key E minor, minimalist low energy 4/4, 120bpm, 320kbps, 48.0kHz Stereo, Studio, sorrow, sad, melancholy 
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 61
topk : 250
topp : 0
temperature : 1
cfg_coef : 3
seed : 4240372830
use_multi_band_diffusion : False
Generated in 150.288 seconds

With the longer durations musicgen begins to splice and give suboptimal transitions.

audio.54.mp4

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 24
topk : 250
topp : 0
temperature : 1
cfg_coef : 3
seed : 4240372830
use_multi_band_diffusion : False
Generated in 60.155 seconds

audio.56.mp4

The problem that tends to happen is it gets fairly monotone; however it does slowly move forward.

Increasing the temparature gives a worse result:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 24
topk : 250
topp : 0
temperature : 1.15
cfg_coef : 3
seed : 4240372830
use_multi_band_diffusion : False
Generated in 60.682 seconds

audio.57.mp4

Even increasing CFG to 5 does not fix it:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 12
topk : 250
topp : 0
temperature : 1.1
cfg_coef : 5
seed : 4240372830
use_multi_band_diffusion : False
Generated in 29.895 seconds

audio.58.mp4

lowering temperature to 0.9 with CFG still at 5:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 12
topk : 250
topp : 0
temperature : 0.9
cfg_coef : 5
seed : 4240372830
use_multi_band_diffusion : False
Generated in 29.636 seconds

audio.59.mp4

dropping CFG to 2 improves the result:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 12
topk : 250
topp : 0
temperature : 0.9
cfg_coef : 2
seed : 4240372830
use_multi_band_diffusion : False
Generated in 29.522 seconds

audio.61.mp4

However, overall it seems that the model wants to have more information in the prompt.

At 0.7 CFG the model was too creative and added silence gaps in the music.

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 26
topk : 250
topp : 0
temperature : 0.9
cfg_coef : 0.7
seed : 4240372830
use_multi_band_diffusion : False
Generated in 64.441 seconds

audio.62.mp4

Even CFG 1.4 is not enough:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm, 
120bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 26
topk : 250
topp : 0
temperature : 0.9
cfg_coef : 1.4
seed : 4240372830
use_multi_band_diffusion : False
Generated in 64.750 seconds

audio.63.mp4

1.7 gets better, but monotone

audio.64.mp4

CFG 1.7 with Temperature 1.1 again is not really useful:

audio.65.mp4

Tweaking CFG, temperature and adding more to the prompt:

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm and 108bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
dynamic rhythm,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 18
topk : 250
topp : 0
temperature : 0.95
cfg_coef : 1.8
seed : 4240372830
use_multi_band_diffusion : False
Generated in 43.900 seconds

audio.66.mp4

Now, trying to expand that to 6 minutes we run into problems, and the audio slowly degenerates completely (warning, loud noises):

Parameters:
text : piano solo, 
Philip Glass: Solo Piano (1989), 
Metamorphosis: Five, 
piano, 
acoustic key E minor, 
minimalist low energy 4/4, 
120bpm and 108bpm piano, 
320kbps, 
48.0kHz Stereo, 
Studio, 
sorrow, 
minimalism genre,
Classical, Avant-Garde,
dynamic rhythm,
melody : None
model : facebook/musicgen-stereo-melody-large
duration : 360
topk : 250
topp : 0
temperature : 0.95
cfg_coef : 1.8
seed : 4240372830
use_multi_band_diffusion : False
Generated in 884.383 seconds

audio.67.mp4

Note - this comment is copyrighted and is not meant to be freely redistributed on other pages or platforms. To anyone who wishes to reproduce this - please reach out to me.

rsxdalv · 2024-06-30T13:07:44Z

Also notably it used 33.2GB of VRAM

rsxdalv · 2024-06-30T15:34:40Z

I've confirmed that stable audio destroys musicgen in this case, VRAM, speed, quality, control, ease of use. It's only the license, although musicgen does not have a very permissive license either.

chlowden · 2024-07-01T09:40:59Z

Thank you so much for all your time. What you tried is very interesting.
I have been struggling with the difference between the large_stereo & large_melody_stereo models. I thought that by having a 30sec guide track, it would help, but it has not, maybe because the system was not able to identify a melody in the ref track.
Your examples are closer than mine to the ref, which is encouraging.
What intrigues me is that in your prompts you mention Glass and that implies that musicgen would been trained on copyrighted material. FB goes to great lengths to disown its own library and expects others to train their versions of musicgen on their own sources. I did not use the Glass text prompt as I figured that it should not have an impact. I struggled to find adjectives to describe the sound so, fell back on technical musical data, which seems to have little impact too. It is virtually a philosophical discussion on how to describe music without defining it by a name or style.
I'm looking forward to trying Stable audio as it does sound very promising.

PS. I have been listening to the SD radio channel
https://stableaudio.com/live
(I'm starting appreciate just how golden silence can be.)

rsxdalv · 2024-07-01T10:12:59Z

I think melody is only like a relief, the prompt itself is very important. And yes, unfortunately we don't have CLIP for music that works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MusicGen advice #3

MusicGen advice #3

chlowden commented Jun 28, 2024

rsxdalv commented Jun 28, 2024 via email

rsxdalv commented Jun 30, 2024

rsxdalv commented Jun 30, 2024

rsxdalv commented Jun 30, 2024

chlowden commented Jul 1, 2024

rsxdalv commented Jul 1, 2024

MusicGen advice #3

MusicGen advice #3

Comments

chlowden commented Jun 28, 2024

rsxdalv commented Jun 28, 2024 via email

rsxdalv commented Jun 30, 2024

rsxdalv commented Jun 30, 2024

rsxdalv commented Jun 30, 2024

chlowden commented Jul 1, 2024

rsxdalv commented Jul 1, 2024