-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example Dataset for Inference with DrCaps_Zeroshot_Audio_Captioning #169
Comments
Please refer to #170 |
Hi thanks for following our work, we have uploaded an example inference data for Audiocaps and Clotho in examples/drcap_zeroshot_aac/data_examples/ . Feel free to check it out. For each filed, "target" is the ground truth caption, "text" is the caption fed to CLAP text encoder during training. "Text" and "target" are the same in the last version. But we previously conducted experiments on replacing certain words in the ground truth captions to enhance model robustness, which is why there are both 'text' and 'target' fields. And similar_captions are captions similar to "target" (i.e. GT captions) to perform RAG. |
Thanks for your timely response. Is it possible to infer the caption for an audio file when "text" and "target" are unknown? If I have misunderstood, please correct me. |
Yes, as long as you have audio_source and similar_captions it is possible to perform inference. |
Can the developers provide an example JSONL file for running inference on unlabeled audio using DrCaps_Zeroshot_Audio_Captioning?
It appears that the dataset JSONL must have this form:
but the content for each field is not clear to me. What should populate
"target"
,"text"
and ""similar_captions"
?Thank you!
The text was updated successfully, but these errors were encountered: