-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate datasets. #167
Comments
Thank you so much for your detailed investigation of the dataset collection @alexzwanenburg! Would you have the bandwidth to make a PR to address (even part of) the duplications? |
Yes I can create the PR to address this issue, it may take a few weeks to fully address these issues though. I have two questions:
|
all those suggestions look good to me. |
Hi @alexzwanenburg , thanks again for your work spearheading this. Do you still plan to make a PR for these changes? 🙏 |
Yes, but I still need to update the four final datasets. I can create a PR for the work I have already done. |
ping on this @alexzwanenburg , hopefully we could pick up where you left off if you create a PR |
@alexzwanenburg I'm ready to help finish this PR. Is your fork up-to-date with your changes documented in this issue? |
I made a PR. I haven't addressed the last four datasets. |
While trying to identify which data sets from the modeldata R package are already present in pmlb, I found that quite a few datasets are duplicates or simple subsets of other datasets.
cmc
andcontraceptive
datasets.symboling
feature underwent a shift between both datasets. Note: the underlying dataset seems to be the same as the one used for auto. The difference between the datasets is the target, which is price for195_auto_price
and207_autoPrice
, and symboling forauto
, as well as how missing values were removed. The original dataset may be found on the UCI ML repository.Description
of each new dataset references the other.195_auto_price
,207_autoPrice
andauto
datasets.glass
andprnn_fglass
datasets.cleve
andheart_c
data sets have a binarized target (vs. ordinal in the other two datasets); thecleveland_nominal
data set contains only a feature subset. The original can be found on the UCI ML repository.cleve
data set.heart_c
,cleve
,cleveland_nominal
,cleveland
,heart_statlog
,heart_h
andhungarian
datasets.colic
andhorse_colic
datasets.vote
andhouse_votes_84
datasets.breast_cancer_wisconsin
andwdbc
datasets.australian
,buggyCrx
,credit_a
andcrx
datasets.breast
dataset has aSample code number
feature that is not present inbreast_w
. The original can be found on the UCI ML repository.breast_w
andbreast
datasets.Parse data from the original into the expected format.diabetes
andpima
datasets.credit_g
andgerman
datasets.solar_flare_2
also contains two additional features.solar_flare_2
are in fact the other two targets.solar_flare_2
andflare
datasets.car_evaluation
dataset several categorical (ordinal) features fromcar
are one-hot-encoded. The original can be found on the UCI ML repository. This issue was also mention in car and car_evaluation seem to be identical #84.car
andcar_evaluation
datasets.chess
andkr_vs_kp
datasets.294_satellite_image
incorrectly specifies a regression problem. The original can be found on the UCI ML repository, and has multiple (6) classes as target.satimage
and294_satellite_image
datasets.227_cpu_small
and562_cpu_small
have fewer features.197_cpu_act
,227_cpu_small
,562_cpu_small
and573_cpu_act
datasets.poker
and1595_poker
datasets.My proposal is to remove duplicates, using an original dataset where this can be found. This might also address the following issues:
The text was updated successfully, but these errors were encountered: