Skip to content

Commit

Permalink
docs: start fabric documentation
Browse files Browse the repository at this point in the history
Adds the overview page for fabric.
  • Loading branch information
miriamspsantos committed Oct 20, 2023
1 parent a8608ca commit 5c89480
Show file tree
Hide file tree
Showing 6 changed files with 95 additions and 63 deletions.
Binary file added docs/assets/overview/fabric_data_centric_flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
72 changes: 14 additions & 58 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,29 @@
<p></p>
<p align="center"><img width="500" src="https://assets.ydata.ai/sdk/logo_SDK_col_red_black.png" alt="YData Logo"></p>
<p></p>

[![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk)
![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)
[![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk)
!!! note "Ready to achieve high-quality data to train you machine learning models?"

!!! note "YData SDK for improved data quality everywhere!"

*ydata-sdk* is here! Create a YData account so you can start using today!
*Fabric Community Version* is free! Create a YData account so you can start using today!

[Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch}

## Overview

The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications.

**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments.

Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation.

## Current functionality

YData SDK is currently composed by the following main modules:

* **Datasources**
- YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors.
- SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data.

* **Synthesizers**
- Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](examples/synthesize_with_privacy_control.md) use-cases.
- From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed.
- [Anonymization](examples/synthesize_with_anonymization.md) and [privacy](examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared!
- [Conditional sampling](examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data.

* **Synthetic data quality report**
<span style="color:grey">*Coming soon*</span>
- An extensive synthetic data quality report that measures 3 dimensions: privacy, utility and fidelity of the generated data. The report can be downloaded in PDF format for ease of sharing and compliance purposes or as a JSON to enable the integration in data flows.

* **Profiling**
<span style="color:grey">*Coming soon*</span>
- A set of metrics and algorithms summarizes datasets quality in three main dimensions: warnings, univariate analysis and a multivariate perspective.

## Supported data formats

=== "Tabular"
![Tabular data synthesizer](assets/500x330/single_table.png){ align=right }
The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results.
## Overview
Fabric is a Data-Centric AI workbench that accelerates AI development by helping data scientists achieve production-quality data. It is an end-to-end data development solution that can be hosted on cloud environments (e.g., Azure, AWS, and GCP, among others) or on-prem.

[Know more](#){ .md-button .md-button--ydata}
With Fabric data scientists can explore **automated quality profiling** for a deeper understanding of their data assets, and leverage **smart synthetic data** to unlock data-sharing initiatives, improve data through augmentation or rebalancing, and mitigate bias in their datasets.

=== "Time-Series"
![Timeseries Synthesizer](assets/500x330/time_series.png){ align=left }
The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock.
<p align="center"> <iframe width="600" height="400" src="https://www.youtube.com/embed/ccF0RaxVLrk" title="Fabric - The data development platform for improved AI performance" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe></p>

[Know more](#){ .md-button .md-button--ydata}
Fabric includes a set of integrated components for data ingestion, data profiling, data quality evaluation, and synthetic data generation, including the following functionalities:

=== "Transactional"
![Transactional data synthesizer](assets/500x330/time_series.png){ align=right }
The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities.
- **Data Catalog:** Simplified, scalable, and seamless connection to a variety of object storages, data warehouses, and relational database management systems, with detailed visual data univariate and multivariate analysis combined with automatic detection of data quality issues;
- **Labs:** On-demand JupyterLab, Visual Studio Code or H2O Flow development environments with configurable hardware (including GPUs), supercharged with the most popular data science libraries and [YData Python SDK](sdk/index.md) – a code interface to most of the functionalities, ideal for advanced use cases;
- **Synthetic Data:** Simplified interface to train state-of-the-art Machine Learning models able to generate artificial data mimicking a specific Data Source, and assess the quality of the new data according to the 3 essential pillars of fidelity, utility, and privacy;
- **Pipelines:** General-purpose job orchestrator with built-in scalability, modularity, reporting, and experiment-tracking capabilities, useful for iterative experimentation at scale.

<span style="color:grey">*Coming soon*</span>

[Know more](#){ .md-button .md-button--ydata}
???+ tip
While each module provides value by itself, **when used together they enable a compelling data-centric narrative arc** that goes from data exploration to data improvement while abstracting away shared core needs like infrastructure, data access, and workspace management:

=== "Relational databases"
![Relational databases synthesizer](assets/500x330/multi_table.png){ align=left }
The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema.

<span style="color:grey">*Coming soon*</span>
![Fabric Data-Centric Flow](../assets/overview/fabric_data_centric_flow.png){: style="height:398px;width:1042px;align:center"}

[Know more](#){ .md-button .md-button--ydata}
73 changes: 73 additions & 0 deletions docs/sdk/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
<p></p>
<p align="center"><img width="500" src="https://assets.ydata.ai/sdk/logo_SDK_col_red_black.png" alt="YData Logo"></p>
<p></p>

[![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk)
![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)
[![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk)

!!! note "YData SDK for improved data quality everywhere!"

*ydata-sdk* is here! Create a YData account so you can start using today!

[Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch}

## Overview

The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications.

**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments.

Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation.

## Current functionality

YData SDK is currently composed by the following main modules:

* **Datasources**
- YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors.
- SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data.

* **Synthesizers**
- Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](examples/synthesize_with_privacy_control.md) use-cases.
- From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed.
- [Anonymization](examples/synthesize_with_anonymization.md) and [privacy](examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared!
- [Conditional sampling](examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data.

* **Synthetic data quality report**
<span style="color:grey">*Coming soon*</span>
- An extensive synthetic data quality report that measures 3 dimensions: privacy, utility and fidelity of the generated data. The report can be downloaded in PDF format for ease of sharing and compliance purposes or as a JSON to enable the integration in data flows.

* **Profiling**
<span style="color:grey">*Coming soon*</span>
- A set of metrics and algorithms summarizes datasets quality in three main dimensions: warnings, univariate analysis and a multivariate perspective.

## Supported data formats

=== "Tabular"
![Tabular data synthesizer](assets/500x330/single_table.png){ align=right }
The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results.

[Know more](#){ .md-button .md-button--ydata}

=== "Time-Series"
![Timeseries Synthesizer](assets/500x330/time_series.png){ align=left }
The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock.

[Know more](#){ .md-button .md-button--ydata}

=== "Transactional"
![Transactional data synthesizer](assets/500x330/time_series.png){ align=right }
The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities.

<span style="color:grey">*Coming soon*</span>

[Know more](#){ .md-button .md-button--ydata}

=== "Relational databases"
![Relational databases synthesizer](assets/500x330/multi_table.png){ align=left }
The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema.

<span style="color:grey">*Coming soon*</span>

[Know more](#){ .md-button .md-button--ydata}
File renamed without changes.
File renamed without changes.
13 changes: 8 additions & 5 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,18 @@
site_name: "YData SDK"
site_name: "YData Fabric"
repo_url: https://github.com/ydataai/ydata-sdk
repo_name: ydataai/ydata-sdk
edit_uri: ""
dev_addr: 0.0.0.0:1235
site_dir: "static/docs"
nav:
- Getting started:
- Getting Started:
- 'index.md'
- Overview: 'index.md'
- Installation: 'getting-started/installation.md'
- Quickstart: 'getting-started/quickstart.md'
- What is Fabric?: 'index.md'
- SDK:
- 'sdk/index.md'
- Overview: 'sdk/index.md'
- Installation: 'sdk/installation.md'
- Quickstart: 'sdk/quickstart.md'
- Examples:
- Generate Tabular Data: "examples/synthesize_tabular_data.md"
- Generate Time-Series Data: "examples/synthesize_timeseries_data.md"
Expand Down

0 comments on commit 5c89480

Please sign in to comment.