Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when use streaming data ,how about read speed when training, just save on local. compare to just use pytorch dataloader #9

Open
yja1 opened this issue Jan 20, 2025 · 1 comment

Comments

@yja1
Copy link

yja1 commented Jan 20, 2025

when use streaming data ,how about read speed when training, just save on local. compare to just use pytorch dataloader

@VSehwag
Copy link

VSehwag commented Jan 20, 2025

Though StreamingDataset offers loading from a remote, I recommend loading from local because it's fast. In loading from a remote (e.g., AWS s3 buckets) one would need a large cache to avoid thrashing. When training the large parameter models with descent SSDs and <=8 gpus, the dataloading shouldn't be a bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants