You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is such behaviour expected? When i use get_audio_embedding_from_data multiple times and take the cosine similarity between the audio, the cosine similarity varies significantly.
import torch
import laion_clap
import numpy as np
cos_sim = torch.nn.CosineSimilarity()
clap = laion_clap.CLAP_Module(enable_fusion=False)
clap.load_ckpt() # download the default pretrained checkpoint.
def int16_to_float32(x):
return (x / 32767.0).astype(np.float32)
def float32_to_int16(x):
x = np.clip(x, a_min=-1., a_max=1.)
return (x * 32767.).astype(np.int16)
audio_data, _ = librosa.load('test0.wav', sr=48000) # sample rate should be 48000
audio_data = audio_data.reshape(1, -1) # Make it (1,T) or (N,T)
audio_data = torch.from_numpy(int16_to_float32(float32_to_int16(audio_data))).float() # quantize before send it in to the model
for _ in range(5):
audio_embed1 = clap.get_audio_embedding_from_data(x = audio_data, use_tensor=True)
audio_embed2 = clap.get_audio_embedding_from_data(x = audio_data, use_tensor=True)
print(cos_sim(audio_embed1,audio_embed2))
### Outputs
tensor([0.9915], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.2983], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.5371], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.3576], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.9719], device='cuda:0', grad_fn=<SumBackward1>)
The text was updated successfully, but these errors were encountered:
@hungchiayu1 the audio waveform is randomly cropped to a maximum of 10s with sample rate = 48000 - this will result in different audio embeddings for the same file. Also the differences will increase with the length and musical variation of the track.
This might also explain why the differences increase when using enableFusion=false (default) as seen in #90
Workaround: You can fix it by chunking your audio file in 10s bits and pass them as a batch to the model. Then you could average the 10s embeddings to receive a track embedding.
@lukewys@RetroCirce the max length is currently set to 480 000 samples (10s) for the audio encoder inputs. Does the audio encoder only support up to 10 seconds or can the max length be changed?
Is such behaviour expected? When i use get_audio_embedding_from_data multiple times and take the cosine similarity between the audio, the cosine similarity varies significantly.
The text was updated successfully, but these errors were encountered: