Add benchmark/dataset for classical ML algorithms #188

ksangeek · 2019-03-09T11:09:01Z

I don't see any datasets in MLPERF, which can be solved with classical machine learning algorithms ( e.g. Linear or Logistic Regression, Decision Trees, Random Forest etc.).
Some examples of datasets I can reference here are :

https://www.kaggle.com/c/criteo-display-ad-challenge/data for binary classification.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques for regression.

These would be useful for use in real-world scenarios where interpretability of the prediction is of utmost importance. Generalized Linear Models have a good share in the real world for this very reason!
I did not find a reference which states that MLPERF is only for deep learning problems, so I think this kind of benchmark/dataset should be added for the democratization of these suit of benchmarks.
Thanks!

psyhtest · 2019-03-09T12:11:18Z

I totally agree that ML != DL, but do you have any data on how widely these models are used in production?

ksangeek · 2019-03-09T15:45:24Z

Well, I think they target different problem space(though sometimes overlap). I can't confidently say much about the actual usage in production, but based on Kaggle survey 2018 I still see sizable importance given by data science practitioners to sklearn, random forest and xgboost. There are also new promising players like snapML and cuML which continue to invest in the classic machine learning space.

TheKanter · 2019-03-09T16:14:23Z

Facebook is quite public that they use gradient-boosted decision trees for sigma - their anomaly detector. I would strongly support more traditional forms of ML. David

…

On Sat, Mar 9, 2019 at 7:45 AM ksangeek ***@***.***> wrote: Well, I think they target different problem space(though sometimes overlap). I can't confidently say much about the actual usage in production, but based on Kaggle survey 2018 <https://www.kaggle.com/paultimothymooney/2018-kaggle-machine-learning-data-science-survey> I still see sizable importance given by data science practitioners to sklearn, random forest and xgboost. There are also new promising players like snapML <https://www.zurich.ibm.com/snapml/> and cuML <https://rapids.ai/> which continue to invest in the classic machine learning space. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <mlcommons/policies#188 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Am63K4A1b6Ufbj19MV771DTY6AfxyTMhks5vU9cVgaJpZM4bmnqa> .

petermattson added the Backlog An issue to be discussed in a future Working Group, but not the immediate next one. label May 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark/dataset for classical ML algorithms #188

Add benchmark/dataset for classical ML algorithms #188

ksangeek commented Mar 9, 2019

psyhtest commented Mar 9, 2019

ksangeek commented Mar 9, 2019

TheKanter commented Mar 9, 2019 via email

Add benchmark/dataset for classical ML algorithms #188

Add benchmark/dataset for classical ML algorithms #188

Comments

ksangeek commented Mar 9, 2019

psyhtest commented Mar 9, 2019

ksangeek commented Mar 9, 2019

TheKanter commented Mar 9, 2019 via email