Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example Dataset for Inference with DrCaps_Zeroshot_Audio_Captioning #169

Closed
javanasse opened this issue Nov 8, 2024 · 4 comments
Closed
Assignees

Comments

@javanasse
Copy link

Can the developers provide an example JSONL file for running inference on unlabeled audio using DrCaps_Zeroshot_Audio_Captioning?

It appears that the dataset JSONL must have this form:

{"source": "/path/to/a_file.wav", "key": "", "target": "", "text": "", "similar_captions": ""}

but the content for each field is not clear to me. What should populate "target", "text" and ""similar_captions"?

Thank you!

@ddlBoJack ddlBoJack assigned ddlBoJack and Andreas-Xi and unassigned ddlBoJack Nov 8, 2024
@ddlBoJack ddlBoJack mentioned this issue Nov 9, 2024
7 tasks
@ddlBoJack
Copy link
Collaborator

Please refer to #170

@Andreas-Xi
Copy link
Collaborator

Andreas-Xi commented Nov 9, 2024

Hi thanks for following our work, we have uploaded an example inference data for Audiocaps and Clotho in examples/drcap_zeroshot_aac/data_examples/ . Feel free to check it out. For each filed, "target" is the ground truth caption, "text" is the caption fed to CLAP text encoder during training. "Text" and "target" are the same in the last version. But we previously conducted experiments on replacing certain words in the ground truth captions to enhance model robustness, which is why there are both 'text' and 'target' fields. And similar_captions are captions similar to "target" (i.e. GT captions) to perform RAG.

@javanasse
Copy link
Author

Thanks for your timely response. Is it possible to infer the caption for an audio file when "text" and "target" are unknown? If I have misunderstood, please correct me.

@Andreas-Xi
Copy link
Collaborator

Yes, as long as you have audio_source and similar_captions it is possible to perform inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants