Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training data setup #7

Open
gopinath1509 opened this issue Feb 18, 2025 · 1 comment
Open

Training data setup #7

gopinath1509 opened this issue Feb 18, 2025 · 1 comment

Comments

@gopinath1509
Copy link

gopinath1509 commented Feb 18, 2025

In this paper, it says the feature extraction model has been fine-tuned on a subset of COYO-700M, whats the size used since the entire dataset is in TBs? also it would be great if the code for extraction and setup of the subset could be released too!

@kliyer-ai
Copy link
Collaborator

Hey, the subset we refer to includes all images from COYO-700M with a short side of at least 512 px. You can obtain it by downloading COYO-700M using the img2dataset utility (https://github.com/rom1504/img2dataset) and setting the min_image_size flag to 512. However, note that we train for only 400 steps, meaning we use only a small portion of this subset. Since we randomly shuffle the shards at the start of training, I cannot provide the exact samples the model was trained on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants