diff --git a/docs/assets/overview/fabric_data_centric_flow.png b/docs/assets/overview/fabric_data_centric_flow.png new file mode 100644 index 00000000..c5d4c9ef Binary files /dev/null and b/docs/assets/overview/fabric_data_centric_flow.png differ diff --git a/docs/index.md b/docs/index.md index 4363443d..1ae077bd 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,73 +1,29 @@ -

-

YData Logo

-

-[![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk) -![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue) -[![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk) +!!! note "Ready to achieve high-quality data to train you machine learning models?" -!!! note "YData SDK for improved data quality everywhere!" - - *ydata-sdk* is here! Create a YData account so you can start using today! + *Fabric Community Version* is free! Create a YData account so you can start using today! [Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch} -## Overview - -The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications. - -**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments. - -Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation. - -## Current functionality - -YData SDK is currently composed by the following main modules: -* **Datasources** - - YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors. - - SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data. - -* **Synthesizers** - - Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](examples/synthesize_with_privacy_control.md) use-cases. - - From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed. - - [Anonymization](examples/synthesize_with_anonymization.md) and [privacy](examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared! - - [Conditional sampling](examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data. - -* **Synthetic data quality report** - *Coming soon* - - An extensive synthetic data quality report that measures 3 dimensions: privacy, utility and fidelity of the generated data. The report can be downloaded in PDF format for ease of sharing and compliance purposes or as a JSON to enable the integration in data flows. - -* **Profiling** - *Coming soon* - - A set of metrics and algorithms summarizes datasets quality in three main dimensions: warnings, univariate analysis and a multivariate perspective. - -## Supported data formats - -=== "Tabular" - ![Tabular data synthesizer](assets/500x330/single_table.png){ align=right } - The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results. +## Overview +Fabric is a Data-Centric AI workbench that accelerates AI development by helping data scientists achieve production-quality data. It is an end-to-end data development solution that can be hosted on cloud environments (e.g., Azure, AWS, and GCP, among others) or on-prem. - [Know more](#){ .md-button .md-button--ydata} +With Fabric data scientists can explore **automated quality profiling** for a deeper understanding of their data assets, and leverage **smart synthetic data** to unlock data-sharing initiatives, improve data through augmentation or rebalancing, and mitigate bias in their datasets. -=== "Time-Series" - ![Timeseries Synthesizer](assets/500x330/time_series.png){ align=left } - The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock. +

- [Know more](#){ .md-button .md-button--ydata} +Fabric includes a set of integrated components for data ingestion, data profiling, data quality evaluation, and synthetic data generation, including the following functionalities: -=== "Transactional" - ![Transactional data synthesizer](assets/500x330/time_series.png){ align=right } - The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities. +- **Data Catalog:** Simplified, scalable, and seamless connection to a variety of object storages, data warehouses, and relational database management systems, with detailed visual data univariate and multivariate analysis combined with automatic detection of data quality issues; +- **Labs:** On-demand JupyterLab, Visual Studio Code or H2O Flow development environments with configurable hardware (including GPUs), supercharged with the most popular data science libraries and [YData Python SDK](sdk/index.md) – a code interface to most of the functionalities, ideal for advanced use cases; +- **Synthetic Data:** Simplified interface to train state-of-the-art Machine Learning models able to generate artificial data mimicking a specific Data Source, and assess the quality of the new data according to the 3 essential pillars of fidelity, utility, and privacy; +- **Pipelines:** General-purpose job orchestrator with built-in scalability, modularity, reporting, and experiment-tracking capabilities, useful for iterative experimentation at scale. - *Coming soon* - [Know more](#){ .md-button .md-button--ydata} +???+ tip + While each module provides value by itself, **when used together they enable a compelling data-centric narrative arc** that goes from data exploration to data improvement while abstracting away shared core needs like infrastructure, data access, and workspace management: -=== "Relational databases" - ![Relational databases synthesizer](assets/500x330/multi_table.png){ align=left } - The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema. - *Coming soon* + ![Fabric Data-Centric Flow](../assets/overview/fabric_data_centric_flow.png){: style="height:398px;width:1042px;align:center"} - [Know more](#){ .md-button .md-button--ydata} diff --git a/docs/sdk/index.md b/docs/sdk/index.md new file mode 100644 index 00000000..4363443d --- /dev/null +++ b/docs/sdk/index.md @@ -0,0 +1,73 @@ +

+

YData Logo

+

+ +[![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk) +![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue) +[![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk) + +!!! note "YData SDK for improved data quality everywhere!" + + *ydata-sdk* is here! Create a YData account so you can start using today! + + [Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch} + +## Overview + +The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications. + +**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments. + +Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation. + +## Current functionality + +YData SDK is currently composed by the following main modules: + +* **Datasources** + - YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors. + - SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data. + +* **Synthesizers** + - Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](examples/synthesize_with_privacy_control.md) use-cases. + - From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed. + - [Anonymization](examples/synthesize_with_anonymization.md) and [privacy](examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared! + - [Conditional sampling](examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data. + +* **Synthetic data quality report** + *Coming soon* + - An extensive synthetic data quality report that measures 3 dimensions: privacy, utility and fidelity of the generated data. The report can be downloaded in PDF format for ease of sharing and compliance purposes or as a JSON to enable the integration in data flows. + +* **Profiling** + *Coming soon* + - A set of metrics and algorithms summarizes datasets quality in three main dimensions: warnings, univariate analysis and a multivariate perspective. + +## Supported data formats + +=== "Tabular" + ![Tabular data synthesizer](assets/500x330/single_table.png){ align=right } + The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results. + + [Know more](#){ .md-button .md-button--ydata} + +=== "Time-Series" + ![Timeseries Synthesizer](assets/500x330/time_series.png){ align=left } + The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock. + + [Know more](#){ .md-button .md-button--ydata} + +=== "Transactional" + ![Transactional data synthesizer](assets/500x330/time_series.png){ align=right } + The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities. + + *Coming soon* + + [Know more](#){ .md-button .md-button--ydata} + +=== "Relational databases" + ![Relational databases synthesizer](assets/500x330/multi_table.png){ align=left } + The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema. + + *Coming soon* + + [Know more](#){ .md-button .md-button--ydata} diff --git a/docs/getting-started/installation.md b/docs/sdk/installation.md similarity index 100% rename from docs/getting-started/installation.md rename to docs/sdk/installation.md diff --git a/docs/getting-started/quickstart.md b/docs/sdk/quickstart.md similarity index 100% rename from docs/getting-started/quickstart.md rename to docs/sdk/quickstart.md diff --git a/mkdocs.yml b/mkdocs.yml index c0322012..bdda5294 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,15 +1,18 @@ -site_name: "YData SDK" +site_name: "YData Fabric" repo_url: https://github.com/ydataai/ydata-sdk repo_name: ydataai/ydata-sdk edit_uri: "" dev_addr: 0.0.0.0:1235 site_dir: "static/docs" nav: - - Getting started: + - Getting Started: - 'index.md' - - Overview: 'index.md' - - Installation: 'getting-started/installation.md' - - Quickstart: 'getting-started/quickstart.md' + - What is Fabric?: 'index.md' + - SDK: + - 'sdk/index.md' + - Overview: 'sdk/index.md' + - Installation: 'sdk/installation.md' + - Quickstart: 'sdk/quickstart.md' - Examples: - Generate Tabular Data: "examples/synthesize_tabular_data.md" - Generate Time-Series Data: "examples/synthesize_timeseries_data.md"