Audio embedding differs significantly each time #166

hungchiayu1 · 2024-10-12T14:47:28Z

Is such behaviour expected? When i use get_audio_embedding_from_data multiple times and take the cosine similarity between the audio, the cosine similarity varies significantly.

import torch
import laion_clap
import numpy as np

cos_sim = torch.nn.CosineSimilarity()

clap = laion_clap.CLAP_Module(enable_fusion=False)
clap.load_ckpt() # download the default pretrained checkpoint.

def int16_to_float32(x):
    return (x / 32767.0).astype(np.float32)


def float32_to_int16(x):
    x = np.clip(x, a_min=-1., a_max=1.)
    return (x * 32767.).astype(np.int16)

audio_data, _ = librosa.load('test0.wav', sr=48000) # sample rate should be 48000
audio_data = audio_data.reshape(1, -1) # Make it (1,T) or (N,T)
audio_data = torch.from_numpy(int16_to_float32(float32_to_int16(audio_data))).float() # quantize before send it in to the model
for _ in range(5):
    audio_embed1 = clap.get_audio_embedding_from_data(x = audio_data, use_tensor=True)
    audio_embed2 = clap.get_audio_embedding_from_data(x = audio_data, use_tensor=True)
    print(cos_sim(audio_embed1,audio_embed2))
 ### Outputs
tensor([0.9915], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.2983], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.5371], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.3576], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.9719], device='cuda:0', grad_fn=<SumBackward1>)

The text was updated successfully, but these errors were encountered:

waldleitner · 2024-10-22T08:15:16Z

@hungchiayu1 the audio waveform is randomly cropped to a maximum of 10s with sample rate = 48000 - this will result in different audio embeddings for the same file. Also the differences will increase with the length and musical variation of the track.

This might also explain why the differences increase when using enableFusion=false (default) as seen in #90

see:

CLAP/src/laion_clap/training/data.py

Lines 465 to 468 in 8e55881

    
           # random crop to max_len (for compatibility) 
        
           overflow = len(audio_data) - max_len 
        
           idx = np.random.randint(0, overflow + 1) 
        
           audio_data = audio_data[idx: idx + max_len]

Workaround: You can fix it by chunking your audio file in 10s bits and pass them as a batch to the model. Then you could average the 10s embeddings to receive a track embedding.

@lukewys @RetroCirce the max length is currently set to 480 000 samples (10s) for the audio encoder inputs. Does the audio encoder only support up to 10 seconds or can the max length be changed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio embedding differs significantly each time #166

Audio embedding differs significantly each time #166

hungchiayu1 commented Oct 12, 2024

waldleitner commented Oct 22, 2024 •

edited

Loading

Audio embedding differs significantly each time #166

Audio embedding differs significantly each time #166

Comments

hungchiayu1 commented Oct 12, 2024

waldleitner commented Oct 22, 2024 • edited Loading

waldleitner commented Oct 22, 2024 •

edited

Loading