Manipulating large amounts of Wavs #1454

domklement · 2025-02-13T11:06:39Z

Hello,

I'm wondering if there's a more efficiet way of handling large datasets in Lhotse similarly to HuggingFace datasets? I can see that datasets consisting of potentially millions of audio files can put strain on the network FS and slow down the data transfer. Also, moving such datasets around can be quite complicated.

I know that Lhotse supports a compressed format shar but it's not suitable for random access during training afaik.

Do you think including support for something like Apache Arrows would be a good idea? This way, one would be able to convert the entire dataset (at least the audio files) into a few shard files and still access the individual files randomly without a significant performance loss.

Thanks for the response.

Best,
Dominik

The text was updated successfully, but these errors were encountered:

pzelasko · 2025-02-19T17:54:11Z

I don't believe that you can have both random access and IO efficiency on slow NFS setups. A few years ago I was playing around with Apache Arrow for audio + manifests and couldn't get anything usable out of it - although it doesn't mean that it cannot be done. Overall I found the approach of sequential reading + shard shuffling + multiplexing datasets much more efficient and composable.

However, if you'd like to give it a shot, an easy way may be through HF dataset support in Lhotse #1433

domklement changed the title ~~Manipulating with large amounts of Wavs~~ Manipulating large amounts of Wavs Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manipulating large amounts of Wavs #1454

Manipulating large amounts of Wavs #1454

domklement commented Feb 13, 2025

pzelasko commented Feb 19, 2025

Manipulating large amounts of Wavs #1454

Manipulating large amounts of Wavs #1454

Comments

domklement commented Feb 13, 2025

pzelasko commented Feb 19, 2025