Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Preparation Tasks #27

Open
siebert-julien opened this issue Mar 7, 2023 · 2 comments
Open

Data Preparation Tasks #27

siebert-julien opened this issue Mar 7, 2023 · 2 comments

Comments

@siebert-julien
Copy link

Dear all,

My name is Julien and I am a researcher working for the Fraunhofer Institute for Experimental Software Engineering (IESE) in Kaiserslautern, Germany. I am quite new to the topic of ontologies, so please excuse me if I ask naive questions.

I am interested in ontologie(s) representing data preparation aspects. The underlying context has to do with how preparation tasks influence the quality the prediction and how to reason about it. One can think of missing values, outliers, colinear features, imbalanced features, etc. as data characteristics that can have an impact on the prediction.

I recently started with the state-of-the art (reading published papers), I haven't looked so much yet into the state-of the practice (e.g., getting my hands dirty on some libraries).

My first impression is that existing ontologies seems to be more focused on the prediction part of the data analysis pipelines, is that correct? or am I missing something?

@joaquinvanschoren
Copy link
Contributor

In MLSchema, an 'implementation' can be any complex preprocessing pipeline, but I think that you are right that most ontologies don't express exactly which preprocessing happens in the pipeline.

There are certainly ways to do that, e.g.
https://docs.datadrivendiscovery.org/devel/write_pipeline.html
https://onnx.ai/sklearn-onnx/auto_tutorial/plot_abegin_convert_pipeline.html
https://huggingface.co/docs/optimum/onnxruntime/usage_guides/pipelines
https://www.tensorflow.org/tfx/tutorials/tfx/template

Every tool basically uses what works for them, usually based on a DAG. I'm not aware of significant standardization efforts in this area. I would be very interested if you found any :).

@siebert-julien
Copy link
Author

@joaquinvanschoren Thank you for your answer. I am now involved in a EU project proposal, I also looked at some state-of-the-art, I also have not seen anything in the direction of standardization. I'll keep looking ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants