You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper, zeroshot classification accuracy (top-1) for the best model (K2C aug) is reported at 91.0%. I assume that this is the 630k-audioset-best.pt checkpoint.
I am only able to report 60.2%top-1 accuracy for the ESC-50 dataset
ESC-50 was downloaded from the google drive folder.
I've found it quite difficult to follow the given evaluation and data preprocessing code, so I wrote my own.
Reproduce
I use the set of 50 unique captions in the test dataset, which are found in the text attr of each example's json file, e.g. "The sound of the crow".
Here's the loader for ESC-50:
classESC50Dataset(Dataset):
def__init__(
self,
path_to_esc50="./data/ESC50",
split="test",
audio_len=480000,
):
super().__init__()
self.data_path=Path(path_to_esc50)
self.audio_len=audio_lenself.audio_files=sorted(glob.glob(str(self.data_path/split/"*.flac")))
self.meta_files=sorted(glob.glob(str(self.data_path/split/"*.json")))
assertlen(self.audio_files) ==len(self.meta_files), "Number of audio files and meta files must match"assert [osp.splitext(osp.basename(x))[0] forxinself.audio_files] == [osp.splitext(osp.basename(x))[0] forxinself.meta_files], "Audio files and meta files must have the same names"self.tags= []
self.texts= []
forfinself.meta_files:
withopen(f, 'r') asjson_file:
data=json.load(json_file)
self.tags.append(data["tag"][0])
self.texts.append(data["text"][0])
def__getitem__(self, idx):
x, _=load_audio_torch(self.audio_files[idx], target_sr=48000, mono=True)
x=random_slice(x, self.audio_len)
returnx, self.texts[idx]
def__len__(self):
returnlen(self.audio_files)
And the zeroshot retrieval script:
importosimporttorchimportlaion_clapfromdata.loadersimportESC50Datasetckpt_path="CLAP_checkpoints/laion_clap/"model_params= {"ckpt": "630k-audioset-best.pt", "amodel": "HTSAT-tiny"}
model=laion_clap.CLAP_Module(enable_fusion=False, amodel=model_params["amodel"])
model.load_ckpt(os.path.join(ckpt_path, model_params["ckpt"]))
dataset=ESC50Dataset()
texts=list(set(dataset.texts)) # get the unique texts, e.g "The sound of the crow"# get the text embeddings for each tagz_text=torch.cat([torch.tensor(model.get_text_embedding([t, t])[0:1]) fortintexts])
z_audio= []
text_idxs= []
foritemindataset:
x, text=itemidx=texts.index(text) # get the index of this example's texttext_idxs.append(idx)
z_audio.append(torch.tensor(model.get_audio_embedding_from_data(x.numpy()))) # get its CLAP audio embeddingz_audio=torch.cat(z_audio)
sim=model.model.logit_scale_a.cpu() *z_audio @ z_text.T# compute pairwise dot products# top-1 accuracyacc=float(torch.sum(torch.argmax(sim, dim=1) ==torch.tensor(text_idxs)) /len(sim))
print(f"Accuracy: {acc}")
Accuracy: 0.6025000214576721
Hopefully I am missing something significant?
The text was updated successfully, but these errors were encountered:
Overview
I have attempted to reproduce the zeroshot classification results for ESC-50 outlined in the publication Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.
In the paper, zeroshot classification accuracy
(top-1)
for the best model (K2C aug) is reported at91.0%
. I assume that this is the630k-audioset-best.pt
checkpoint.60.2%
top-1
accuracy for the ESC-50 datasetReproduce
I use the set of 50 unique captions in the test dataset, which are found in the
text
attr of each example's json file, e.g."The sound of the crow"
.Here's the loader for
ESC-50
:And the zeroshot retrieval script:
Hopefully I am missing something significant?
The text was updated successfully, but these errors were encountered: