Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn MTEB-Arena-logs into HF dataset? #25

Open
Muennighoff opened this issue Jul 30, 2024 · 4 comments
Open

Turn MTEB-Arena-logs into HF dataset? #25

Muennighoff opened this issue Jul 30, 2024 · 4 comments

Comments

@Muennighoff
Copy link
Contributor

The logs are becoming pretty big & will soon be infeasible to fully have in this repository / require git-lfs. Maybe we move them to an HF dataset? @orionw is probably the expert here, wdyt is the best approach?

(/env/lib/conda/gritkto) niklas@dojo-a3-ghpc-55:/data/niklas/arena$ du -sh MTEB-Arena-logs
4.9M    MTEB-Arena-logs
@orionw
Copy link
Collaborator

orionw commented Jul 30, 2024

What do the logs contain again / why are we saving them?

I'm less familiar with the logging package, is there a way to change the file we log to mid-run? If so, I can add something to change log files daily and upload the previous one.

@Muennighoff
Copy link
Contributor Author

Good point, maybe we don't need to save the logs? It gives us exact timestamps of when what happened but I think we also have timestamps in the results so could remove them? Should I just add them to gitignore and remove?

Another one is the results which is at 5.7M right now. It should be fine to only store latest & remove older ones, correct? If so will add them to gitignore. Though maybe it'd be nice to keep them to go back in time but could probably also do that by just having a flag to filter results for some time & date.

du -sh MTEB-Arena-logs/
17M     MTEB-Arena-logs/
du -sh results
5.7M    results

@orionw
Copy link
Collaborator

orionw commented Aug 20, 2024

Yes, there are timestamps in the results, so I think we could filter to get exact times from that. Is there other information that we want to save from the logs or just the results?

For debugging I assume you can keep them locally to see if there are errors (no need to push).

@Muennighoff
Copy link
Contributor Author

Okay put them all in ignore: https://github.com/embeddings-benchmark/arena/blob/main/.gitignore
& will remove them soon unless someone objects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants