Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio embedding differs significantly each time #166

Open
hungchiayu1 opened this issue Oct 12, 2024 · 1 comment
Open

Audio embedding differs significantly each time #166

hungchiayu1 opened this issue Oct 12, 2024 · 1 comment

Comments

@hungchiayu1
Copy link

Is such behaviour expected? When i use get_audio_embedding_from_data multiple times and take the cosine similarity between the audio, the cosine similarity varies significantly.

import torch
import laion_clap
import numpy as np

cos_sim = torch.nn.CosineSimilarity()

clap = laion_clap.CLAP_Module(enable_fusion=False)
clap.load_ckpt() # download the default pretrained checkpoint.

def int16_to_float32(x):
    return (x / 32767.0).astype(np.float32)


def float32_to_int16(x):
    x = np.clip(x, a_min=-1., a_max=1.)
    return (x * 32767.).astype(np.int16)

audio_data, _ = librosa.load('test0.wav', sr=48000) # sample rate should be 48000
audio_data = audio_data.reshape(1, -1) # Make it (1,T) or (N,T)
audio_data = torch.from_numpy(int16_to_float32(float32_to_int16(audio_data))).float() # quantize before send it in to the model
for _ in range(5):
    audio_embed1 = clap.get_audio_embedding_from_data(x = audio_data, use_tensor=True)
    audio_embed2 = clap.get_audio_embedding_from_data(x = audio_data, use_tensor=True)
    print(cos_sim(audio_embed1,audio_embed2))
 ### Outputs
tensor([0.9915], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.2983], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.5371], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.3576], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.9719], device='cuda:0', grad_fn=<SumBackward1>)
@waldleitner
Copy link

waldleitner commented Oct 22, 2024

@hungchiayu1 the audio waveform is randomly cropped to a maximum of 10s with sample rate = 48000 - this will result in different audio embeddings for the same file. Also the differences will increase with the length and musical variation of the track.

This might also explain why the differences increase when using enableFusion=false (default) as seen in #90

see:

# random crop to max_len (for compatibility)
overflow = len(audio_data) - max_len
idx = np.random.randint(0, overflow + 1)
audio_data = audio_data[idx: idx + max_len]

Workaround: You can fix it by chunking your audio file in 10s bits and pass them as a batch to the model. Then you could average the 10s embeddings to receive a track embedding.

@lukewys @RetroCirce the max length is currently set to 480 000 samples (10s) for the audio encoder inputs. Does the audio encoder only support up to 10 seconds or can the max length be changed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants