Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Credit/Origin? #13

Open
ddofer opened this issue Dec 19, 2017 · 10 comments
Open

Credit/Origin? #13

ddofer opened this issue Dec 19, 2017 · 10 comments

Comments

@ddofer
Copy link

ddofer commented Dec 19, 2017

  1. Nice resource! I may add some to it in future (although the ones I use for benchmarking are considerably "rarer" than the ones here - time-series + raw text + locations, entities, etc') .
  2. The varied datasets dont seem to have credit as to their origin. (e.g. "winered" - I assume is the wine datasets from UCI, but there's nothing about that in the data folder or the csv.gz file).
    Adding the origin (even at the "site" level, e.g. "UCI", "open-ML", "kaggle datasets", "KDD") would make it much easier to analyze the original datasets, context ,domain and interpretation (e.g. "Looking for datasets on time-series + predictive maintenance").
@ddofer
Copy link
Author

ddofer commented Dec 19, 2017

This could be a seperate readme file, no need to go overboard. e.g. "analcatdata" = "http://people.stern.nyu.edu/jsimonof/AnalCatData/" ?

@rhiever
Copy link
Contributor

rhiever commented Dec 20, 2017

Good idea. Not sure if we have the bandwidth to get around to doing that anytime soon, but we'll keep it filed here in case anyone wants to take this issue on.

@darwinbandoy
Copy link

Thanks for this wonderful resource and I am also interested in tracing the origin and background of each dataset as the read me file just contains " breast tumors". A line about the original source or the accompanying publication would be helpful. Thanks

@csinva
Copy link

csinva commented Jul 8, 2019

Also interested in this!

@codrin-kruijne
Copy link

Compliments from my side for gathering these datasets too! I agree it would be helpful to have information about the dataset source. Ideally a link to where the original is published, so you can find the description of the dataset at its origin. Maybe add it as a column to summary_stats?

@trangdata
Copy link
Collaborator

Thanks @codrin-kruijne for this input! We actually tried to streamline this effort of adding sources last year. We now have a metadata.yaml file for each dataset but not all have non-empty source field yet, but we're looking for contributors to add this information. See for example here.

Alternatively, you can get to the metadata by clicking on the octocat in the last column of the summary table on our main website: https://epistasislab.github.io/pmlb/ Hope that helps!

@codrin-kruijne
Copy link

Thanks @trang1618 I added all the links to metadata.yaml files to our summary_stats table in a metadata column for easy access and I will encourage my colleagues to contribute when they find an incomplete one. I will explore a bit and then see how I might contribute.

@trangdata
Copy link
Collaborator

Amazing @codrin-kruijne ! Thank you!!! 🙏🏽

@jpgard
Copy link

jpgard commented Apr 27, 2023

+1 here -- this is a great resource, thank you for your work on it.

But also, the missing metadata is a major pain point. There really isn't any way for another contributor to even find where many of these datasets are from, let alone understand more about the dataset itself (what do the labels mean? what are the features? etc.).

We (as users of the package) have no idea where the individual datasets were drawn from, and there isn't any information even in the .yaml files or the published paper. This is really something the developers will need to lead the charge on, or at least provide more information so that others can help :) . Can you at minumum provide a link to each dataset's original source in the metadata table (https://github.com/EpistasisLab/pmlb/blob/master/pmlb/all_summary_stats.tsv), or any information at all about where it is from to narrow the search (Kaggle, UCI, etc.)?

@lacava
Copy link
Collaborator

lacava commented Apr 27, 2023

hi @jpgard, unfortunately the dev team for this project has turned over a few times since 2017 and we don't have perfect verifiable source info for many datasets, so it takes time. Still, we've annotated many datasets with source info; out of 420 current datasets, we still need metadata on about 246 of them. We have a contribution guide for verifying source: https://epistasislab.github.io/pmlb/contributing.html and some example PRs using colab (e.g. #86). Every little bit helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants