generated from ydataai/opensource-template
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
6 changed files
with
95 additions
and
63 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,73 +1,29 @@ | ||
<p></p> | ||
<p align="center"><img width="500" src="https://assets.ydata.ai/sdk/logo_SDK_col_red_black.png" alt="YData Logo"></p> | ||
<p></p> | ||
|
||
[![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk) | ||
![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue) | ||
[![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk) | ||
!!! note "Ready to achieve high-quality data to train you machine learning models?" | ||
|
||
!!! note "YData SDK for improved data quality everywhere!" | ||
|
||
*ydata-sdk* is here! Create a YData account so you can start using today! | ||
*Fabric Community Version* is free! Create a YData account so you can start using today! | ||
|
||
[Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch} | ||
|
||
## Overview | ||
|
||
The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications. | ||
|
||
**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments. | ||
|
||
Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation. | ||
|
||
## Current functionality | ||
|
||
YData SDK is currently composed by the following main modules: | ||
|
||
* **Datasources** | ||
- YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors. | ||
- SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data. | ||
|
||
* **Synthesizers** | ||
- Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](examples/synthesize_with_privacy_control.md) use-cases. | ||
- From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed. | ||
- [Anonymization](examples/synthesize_with_anonymization.md) and [privacy](examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared! | ||
- [Conditional sampling](examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data. | ||
|
||
* **Synthetic data quality report** | ||
<span style="color:grey">*Coming soon*</span> | ||
- An extensive synthetic data quality report that measures 3 dimensions: privacy, utility and fidelity of the generated data. The report can be downloaded in PDF format for ease of sharing and compliance purposes or as a JSON to enable the integration in data flows. | ||
|
||
* **Profiling** | ||
<span style="color:grey">*Coming soon*</span> | ||
- A set of metrics and algorithms summarizes datasets quality in three main dimensions: warnings, univariate analysis and a multivariate perspective. | ||
|
||
## Supported data formats | ||
|
||
=== "Tabular" | ||
![Tabular data synthesizer](assets/500x330/single_table.png){ align=right } | ||
The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results. | ||
## Overview | ||
Fabric is a Data-Centric AI workbench that accelerates AI development by helping data scientists achieve production-quality data. It is an end-to-end data development solution that can be hosted on cloud environments (e.g., Azure, AWS, and GCP, among others) or on-prem. | ||
|
||
[Know more](#){ .md-button .md-button--ydata} | ||
With Fabric data scientists can explore **automated quality profiling** for a deeper understanding of their data assets, and leverage **smart synthetic data** to unlock data-sharing initiatives, improve data through augmentation or rebalancing, and mitigate bias in their datasets. | ||
|
||
=== "Time-Series" | ||
![Timeseries Synthesizer](assets/500x330/time_series.png){ align=left } | ||
The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock. | ||
<p align="center"> <iframe width="600" height="400" src="https://www.youtube.com/embed/ccF0RaxVLrk" title="Fabric - The data development platform for improved AI performance" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe></p> | ||
|
||
[Know more](#){ .md-button .md-button--ydata} | ||
Fabric includes a set of integrated components for data ingestion, data profiling, data quality evaluation, and synthetic data generation, including the following functionalities: | ||
|
||
=== "Transactional" | ||
![Transactional data synthesizer](assets/500x330/time_series.png){ align=right } | ||
The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities. | ||
- **Data Catalog:** Simplified, scalable, and seamless connection to a variety of object storages, data warehouses, and relational database management systems, with detailed visual data univariate and multivariate analysis combined with automatic detection of data quality issues; | ||
- **Labs:** On-demand JupyterLab, Visual Studio Code or H2O Flow development environments with configurable hardware (including GPUs), supercharged with the most popular data science libraries and [YData Python SDK](sdk/index.md) – a code interface to most of the functionalities, ideal for advanced use cases; | ||
- **Synthetic Data:** Simplified interface to train state-of-the-art Machine Learning models able to generate artificial data mimicking a specific Data Source, and assess the quality of the new data according to the 3 essential pillars of fidelity, utility, and privacy; | ||
- **Pipelines:** General-purpose job orchestrator with built-in scalability, modularity, reporting, and experiment-tracking capabilities, useful for iterative experimentation at scale. | ||
|
||
<span style="color:grey">*Coming soon*</span> | ||
|
||
[Know more](#){ .md-button .md-button--ydata} | ||
???+ tip | ||
While each module provides value by itself, **when used together they enable a compelling data-centric narrative arc** that goes from data exploration to data improvement while abstracting away shared core needs like infrastructure, data access, and workspace management: | ||
|
||
=== "Relational databases" | ||
![Relational databases synthesizer](assets/500x330/multi_table.png){ align=left } | ||
The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema. | ||
|
||
<span style="color:grey">*Coming soon*</span> | ||
![Fabric Data-Centric Flow](../assets/overview/fabric_data_centric_flow.png){: style="height:398px;width:1042px;align:center"} | ||
|
||
[Know more](#){ .md-button .md-button--ydata} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
<p></p> | ||
<p align="center"><img width="500" src="https://assets.ydata.ai/sdk/logo_SDK_col_red_black.png" alt="YData Logo"></p> | ||
<p></p> | ||
|
||
[![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk) | ||
![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue) | ||
[![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk) | ||
|
||
!!! note "YData SDK for improved data quality everywhere!" | ||
|
||
*ydata-sdk* is here! Create a YData account so you can start using today! | ||
|
||
[Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch} | ||
|
||
## Overview | ||
|
||
The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications. | ||
|
||
**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments. | ||
|
||
Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation. | ||
|
||
## Current functionality | ||
|
||
YData SDK is currently composed by the following main modules: | ||
|
||
* **Datasources** | ||
- YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors. | ||
- SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data. | ||
|
||
* **Synthesizers** | ||
- Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](examples/synthesize_with_privacy_control.md) use-cases. | ||
- From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed. | ||
- [Anonymization](examples/synthesize_with_anonymization.md) and [privacy](examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared! | ||
- [Conditional sampling](examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data. | ||
|
||
* **Synthetic data quality report** | ||
<span style="color:grey">*Coming soon*</span> | ||
- An extensive synthetic data quality report that measures 3 dimensions: privacy, utility and fidelity of the generated data. The report can be downloaded in PDF format for ease of sharing and compliance purposes or as a JSON to enable the integration in data flows. | ||
|
||
* **Profiling** | ||
<span style="color:grey">*Coming soon*</span> | ||
- A set of metrics and algorithms summarizes datasets quality in three main dimensions: warnings, univariate analysis and a multivariate perspective. | ||
|
||
## Supported data formats | ||
|
||
=== "Tabular" | ||
![Tabular data synthesizer](assets/500x330/single_table.png){ align=right } | ||
The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results. | ||
|
||
[Know more](#){ .md-button .md-button--ydata} | ||
|
||
=== "Time-Series" | ||
![Timeseries Synthesizer](assets/500x330/time_series.png){ align=left } | ||
The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock. | ||
|
||
[Know more](#){ .md-button .md-button--ydata} | ||
|
||
=== "Transactional" | ||
![Transactional data synthesizer](assets/500x330/time_series.png){ align=right } | ||
The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities. | ||
|
||
<span style="color:grey">*Coming soon*</span> | ||
|
||
[Know more](#){ .md-button .md-button--ydata} | ||
|
||
=== "Relational databases" | ||
![Relational databases synthesizer](assets/500x330/multi_table.png){ align=left } | ||
The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema. | ||
|
||
<span style="color:grey">*Coming soon*</span> | ||
|
||
[Know more](#){ .md-button .md-button--ydata} |
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters