Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manipulating large amounts of Wavs #1454

Open
domklement opened this issue Feb 13, 2025 · 1 comment
Open

Manipulating large amounts of Wavs #1454

domklement opened this issue Feb 13, 2025 · 1 comment

Comments

@domklement
Copy link
Contributor

Hello,

I'm wondering if there's a more efficiet way of handling large datasets in Lhotse similarly to HuggingFace datasets? I can see that datasets consisting of potentially millions of audio files can put strain on the network FS and slow down the data transfer. Also, moving such datasets around can be quite complicated.

I know that Lhotse supports a compressed format shar but it's not suitable for random access during training afaik.

Do you think including support for something like Apache Arrows would be a good idea? This way, one would be able to convert the entire dataset (at least the audio files) into a few shard files and still access the individual files randomly without a significant performance loss.

Thanks for the response.

Best,
Dominik

@domklement domklement changed the title Manipulating with large amounts of Wavs Manipulating large amounts of Wavs Feb 13, 2025
@pzelasko
Copy link
Collaborator

I don't believe that you can have both random access and IO efficiency on slow NFS setups. A few years ago I was playing around with Apache Arrow for audio + manifests and couldn't get anything usable out of it - although it doesn't mean that it cannot be done. Overall I found the approach of sequential reading + shard shuffling + multiplexing datasets much more efficient and composable.

However, if you'd like to give it a shot, an easy way may be through HF dataset support in Lhotse #1433

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants