docs: start fabric documentation

Adds the overview page for fabric.
ydataai · Oct 20, 2023 · 5c89480 · 5c89480
1 parent a8608ca
commit 5c89480
Show file tree

Hide file tree

Showing 6 changed files with 95 additions and 63 deletions.
diff --git a/docs/assets/overview/fabric_data_centric_flow.png b/docs/assets/overview/fabric_data_centric_flow.png
diff --git a/docs/index.md b/docs/index.md
@@ -1,73 +1,29 @@
-<p></p>
-<p align="center"><img width="500" src="https://assets.ydata.ai/sdk/logo_SDK_col_red_black.png" alt="YData Logo"></p>
-<p></p>
 
-[![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk)
-![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)
-[![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk)
+!!! note "Ready to achieve high-quality data to train you machine learning models?"
 
-!!! note "YData SDK for improved data quality everywhere!"
-
-    *ydata-sdk* is here! Create a YData account so you can start using today!
+    *Fabric Community Version* is free! Create a YData account so you can start using today!
 
     [Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch}
 
-## Overview
-
-The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications.
-
-**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments.
-
-Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation.
-
-## Current functionality
-
-YData SDK is currently composed by the following main modules:
 
-* **Datasources**
-     - YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors.
-     - SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data.
-
-* **Synthesizers**
-     - Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](examples/synthesize_with_privacy_control.md) use-cases.
-     - From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed.
-     - [Anonymization](examples/synthesize_with_anonymization.md) and [privacy](examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared!
-     - [Conditional sampling](examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data.
-
-* **Synthetic data quality report**
-    <span style="color:grey">*Coming soon*</span>
-     - An extensive synthetic data quality report that measures 3 dimensions: privacy, utility and fidelity of the generated data. The report can be downloaded in PDF format for ease of sharing and compliance purposes or as a JSON to enable the integration in data flows.
-
-* **Profiling**
-    <span style="color:grey">*Coming soon*</span>
-    - A set of metrics and algorithms summarizes datasets quality in three main dimensions: warnings, univariate analysis and a multivariate perspective.
-
-## Supported data formats
-
-=== "Tabular"
-    ![Tabular data synthesizer](assets/500x330/single_table.png){ align=right }
-    The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results.
+## Overview
+Fabric is a Data-Centric AI workbench that accelerates AI development by helping data scientists achieve production-quality data. It is an end-to-end data development solution that can be hosted on cloud environments (e.g., Azure, AWS, and GCP, among others) or on-prem.
 
-    [Know more](#){ .md-button .md-button--ydata}
+With Fabric data scientists can explore **automated quality profiling** for a deeper understanding of their data assets, and leverage **smart synthetic data** to unlock data-sharing initiatives, improve data through augmentation or rebalancing, and mitigate bias in their datasets.
 
-=== "Time-Series"
-    ![Timeseries Synthesizer](assets/500x330/time_series.png){ align=left }
-    The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock.
+<p align="center"> <iframe width="600" height="400" src="https://www.youtube.com/embed/ccF0RaxVLrk" title="Fabric -  The data development platform for improved AI performance" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe></p>
 
-    [Know more](#){ .md-button .md-button--ydata}
+Fabric includes a set of integrated components for data ingestion, data profiling, data quality evaluation, and synthetic data generation, including the following functionalities:
 
-=== "Transactional"
-    ![Transactional data synthesizer](assets/500x330/time_series.png){ align=right }
-    The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities.
+- **Data Catalog:** Simplified, scalable, and seamless connection to a variety of object storages, data warehouses, and relational database management systems, with detailed visual data univariate and multivariate analysis combined with automatic detection of data quality issues;
+- **Labs:** On-demand JupyterLab, Visual Studio Code or H2O Flow development environments with configurable hardware (including GPUs), supercharged with the most popular data science libraries and [YData Python SDK](sdk/index.md) – a code interface to most of the functionalities, ideal for advanced use cases;
+- **Synthetic Data:** Simplified interface to train state-of-the-art Machine Learning models able to generate artificial data mimicking a specific Data Source, and assess the quality of the new data according to the 3 essential pillars of fidelity, utility, and privacy;
+- **Pipelines:** General-purpose job orchestrator with built-in scalability, modularity, reporting, and experiment-tracking capabilities, useful for iterative experimentation at scale.
 
-    <span style="color:grey">*Coming soon*</span>
 
-    [Know more](#){ .md-button .md-button--ydata}
+???+ tip
+    While each module provides value by itself, **when used together they enable a compelling data-centric narrative arc** that goes from data exploration to data improvement while abstracting away shared core needs like infrastructure, data access, and workspace management:
 
-=== "Relational databases"
-    ![Relational databases synthesizer](assets/500x330/multi_table.png){ align=left }
-    The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema.
 
-    <span style="color:grey">*Coming soon*</span>
+    ![Fabric Data-Centric Flow](../assets/overview/fabric_data_centric_flow.png){: style="height:398px;width:1042px;align:center"}
 
-    [Know more](#){ .md-button .md-button--ydata}
diff --git a/docs/sdk/index.md b/docs/sdk/index.md
@@ -0,0 +1,73 @@
+<p></p>
+<p align="center"><img width="500" src="https://assets.ydata.ai/sdk/logo_SDK_col_red_black.png" alt="YData Logo"></p>
+<p></p>
+
+[![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk)
+![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)
+[![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk)
+
+!!! note "YData SDK for improved data quality everywhere!"
+
+    *ydata-sdk* is here! Create a YData account so you can start using today!
+
+    [Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch}
+
+## Overview
+
+The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications.
+
+**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments.
+
+Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation.
+
+## Current functionality
+
+YData SDK is currently composed by the following main modules:
+
+* **Datasources**
+     - YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors.
+     - SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data.
+
+* **Synthesizers**
+     - Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](examples/synthesize_with_privacy_control.md) use-cases.
+     - From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed.
+     - [Anonymization](examples/synthesize_with_anonymization.md) and [privacy](examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared!
+     - [Conditional sampling](examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data.
+
+* **Synthetic data quality report**
+    <span style="color:grey">*Coming soon*</span>
+     - An extensive synthetic data quality report that measures 3 dimensions: privacy, utility and fidelity of the generated data. The report can be downloaded in PDF format for ease of sharing and compliance purposes or as a JSON to enable the integration in data flows.
+
+* **Profiling**
+    <span style="color:grey">*Coming soon*</span>
+    - A set of metrics and algorithms summarizes datasets quality in three main dimensions: warnings, univariate analysis and a multivariate perspective.
+
+## Supported data formats
+
+=== "Tabular"
+    ![Tabular data synthesizer](assets/500x330/single_table.png){ align=right }
+    The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results.
+
+    [Know more](#){ .md-button .md-button--ydata}
+
+=== "Time-Series"
+    ![Timeseries Synthesizer](assets/500x330/time_series.png){ align=left }
+    The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock.
+
+    [Know more](#){ .md-button .md-button--ydata}
+
+=== "Transactional"
+    ![Transactional data synthesizer](assets/500x330/time_series.png){ align=right }
+    The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities.
+
+    <span style="color:grey">*Coming soon*</span>
+
+    [Know more](#){ .md-button .md-button--ydata}
+
+=== "Relational databases"
+    ![Relational databases synthesizer](assets/500x330/multi_table.png){ align=left }
+    The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema.
+
+    <span style="color:grey">*Coming soon*</span>
+
+    [Know more](#){ .md-button .md-button--ydata}
diff --git a/docs/getting-started/installation.md → docs/sdk/installation.md b/docs/getting-started/installation.md → docs/sdk/installation.md
diff --git a/docs/getting-started/quickstart.md → docs/sdk/quickstart.md b/docs/getting-started/quickstart.md → docs/sdk/quickstart.md
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -1,15 +1,18 @@
-site_name: "YData SDK"
+site_name: "YData Fabric"
 repo_url: https://github.com/ydataai/ydata-sdk
 repo_name: ydataai/ydata-sdk
 edit_uri: ""
 dev_addr: 0.0.0.0:1235
 site_dir: "static/docs"
 nav:
-  - Getting started:
+  - Getting Started:
     - 'index.md'
-    - Overview: 'index.md'
-    - Installation: 'getting-started/installation.md'
-    - Quickstart: 'getting-started/quickstart.md'
+    - What is Fabric?: 'index.md'
+  - SDK:
+    - 'sdk/index.md'
+    - Overview: 'sdk/index.md'
+    - Installation: 'sdk/installation.md'
+    - Quickstart: 'sdk/quickstart.md'
   - Examples:
       - Generate Tabular Data: "examples/synthesize_tabular_data.md"
       - Generate Time-Series Data: "examples/synthesize_timeseries_data.md"