Skip to content

Commit

Permalink
docs: update multi-table related documentation (#83)
Browse files Browse the repository at this point in the history
* docs: update multi-table examples

* docs: update SDK mainpage

* docs: update sdk installation step by step guide.

* docs: new quickstart tutorial.

* docs: update multitable example.

* fix(linting): code formatting

---------

Co-authored-by: Fabiana Clemente <[email protected]>
Co-authored-by: Azory YData Bot <[email protected]>
  • Loading branch information
3 people authored Jan 17, 2024
1 parent e15e427 commit fadeed9
Show file tree
Hide file tree
Showing 21 changed files with 114 additions and 57 deletions.
Binary file removed docs/assets/fabric_sdk_token.png
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/sdk/fabric_sdk_token.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 0 additions & 17 deletions docs/examples/synthesizer_multitable.md

This file was deleted.

46 changes: 46 additions & 0 deletions docs/get-started/create_multitable_dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# How to create your first Relational database in Fabric's Catalog

To create your first multi-table dataset in the **Data Catalog**, you can start by clicking on **"Add Dataset"** from the Home section.
Or click to **Data Catalog** (on the left side menu) and click **“Add Dataset”**.

![Create dataset with upload csv](../assets/quickstart/upload_csv/welcome_add_dataset.png){: style="width:75%"}

After that the below modal will be shown. You will need to select a connector. To create a multi-table dataset, we need to choose an RDBMS connector like Azure SQL, Snowflake or MySQL.
In this case let's select MySQL.

![Data Catalog connectors](../assets/quickstart/upload_csv/data_catalog_connectors.png){: style="width:40%"}

Once you've selected the **“MySQL”** connector, a new screen will appear, enabling you to introduce the connection details such as database username, host, password as well as the database name.

![MySQL credentials](../assets/quickstart/create_relational_database/mysql_credentials.png){: style="width:60%"}

With the *Connector* created, you'll be able to add a dataset and specify its properties:

- **Name:** The name of your dataset;
- **Table:** You can create a dataset with all the tables from the schema or select the tables that you need in your project.
- **Query:** Create a single table dataset by providing a query

![Add dataset details](../assets/quickstart/create_relational_database/mysql_dataset_creation.png){: style="width:45%"}

Now both the Connector to the MySQL Berka database and Berka dataset will be added to our Catalog.
As soon as the status is green, you can navigate your Dataset. Click in **Open** dataset as per the image below.

![Explore dataset](../assets/quickstart/create_relational_database/open_dataset.png){: style="width:75%"}

Within the **Dataset** details, you can gain valuable insights like your database schema.

![Database schema overview ](../assets/quickstart/create_relational_database/database_schema_overview.png){: style="width:75%"}

For each an every table you can explore the both an overview on the structure (number of columns, number of rows, etc.) but also a useful
summary of the quality and warnings regarding your dataset behaviour.

![Dataset profiling](../assets/quickstart/create_relational_database/table_overview.png){: style="width:75%"}

**Congrats!** 🚀 You have now successfully created your first **Connector** and **Multi-table Dataset** in Fabric’s Data Catalog.
To get the both the ID of your database and project you can decompose the URL from the Database schema overview page. The structure is as follows:

```
https://fabric.ydata.ai/rdbms/{your-dataset-id}?ns={your-project-id}
```

Get ready for your journey of improved quality data for AI.
3 changes: 2 additions & 1 deletion docs/get-started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ The get started is here to help you if you are not yet familiar with YData Fabri
data quality, data preparation workflows and how you can start leveraging synthetic data.
<a href="fabric_community"><u>Mention to YData Fabric Community</u></a>

### 📚 <a href="upload_csv"><u>Create your first Data with the Data Catalog</u></a>
### 📚 <a href="upload_csv"><u>Create your first Dataset with the Data Catalog</u></a>
### 💾 <a href="create_multitable_dataset"><u>Create your Multi-Table Dataset with the Data Catalog</u></a>
### ⚙️ <a href="create_syntheticdata_generator"><u>Create your first Synthetic Data generator</u></a>
### 🧪 <a href="create_lab"><u>Create your first Lab</u></a>
### 🌀 <a href="create_pipeline"><u>Create your first data Pipeline</u></a>
File renamed without changes.
File renamed without changes.
25 changes: 25 additions & 0 deletions docs/sdk/examples/synthesizer_multitable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Synthesize Relational databases

**Integrate Fabric's *MultiTableSynthesizer* in your data flows and generate synthetic relational databases or multi-table datasets**

The capability to generate synthetic data from relational databases is a powerful and innovative approach to
streamline the access to data and improve data democratization strategy within the organization.
Fabric's SDK makes available an easy-to-use code interface to integrate the process of generating synthetic multi-table databases
into your existing data flows.

!!! tip "How to get your datasource?"
Learn how to create your multi-table data in Fabric <a href="/get-started/create_multitable_dataset"><u>here</u></a> before creating your first multi-table synthetic data generator!

**Get your datasource and connector ID**

*Datasource uid:* You can find your datasource ID through Fabric UI. Open your relational dataset and click in the "Explore in Labs" button.
Copy the uid that you find available in the code snippet.

*Connector uid:* You can find your connector ID through Fabric UI. Open the connector tab from your Data Catalog. Under the connector "Actions"
select "Explore in Lab". Copy the uid available in the code snippet.

Quickstart example:

```python
--8<-- "examples/synthesizers/multi_table_quickstart.py"
```
40 changes: 20 additions & 20 deletions docs/sdk/index.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,40 @@
<p></p>
<p align="center"><img width="500" src="https://assets.ydata.ai/sdk/logo_SDK_col_red_black.png" alt="YData Logo"></p>
<p></p>

[![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk)
![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)
[![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk)

!!! note "YData SDK for improved data quality everywhere!"
!!! note "Fabric SDK for improved data quality everywhere!"

*ydata-sdk* is here! Create a YData account so you can start using today!

[Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch}

## Overview

The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications.
The *Fabric SDK* is an ecosystem of methods that allows users to, through a python interface, adopt data development focused on improving the quality of the data.
The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications.

## Benefits

**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments.
Fabric SDK interface enables the ability to integrate data quality tooling with other platforms offering several beneficts in the realm of
data science development and data management:

Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation.
- **Interoperability:** seamless integration with other data platform and systems like Databricks, Snowflake, etc. This ensures that all your software will work cohesively with all the elements from your data architecture.
- **Collaboration:** ease of integration with a multitude of tools and services, reducing the need to reinvent the wheel and fostering a collaborative environment for all developers (data scientists, data engineers, software developers, etc.)
- **Improved usage experience:** Fabric SDK enables a well-integrated software solution, which allows a seamless transition between different tools or platforms without facing compatibility issues.

## Current functionality

YData SDK is currently composed by the following main modules:
Fabric SDK is currently composed by the following main modules:

* **Datasources**
- YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors.
- SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data.

* **Synthesizers**
- Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](../examples/synthesize_with_privacy_control.md) use-cases.
- From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed.
- [Anonymization](../examples/synthesize_with_anonymization.md) and [privacy](../examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared!
- [Conditional sampling](../examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data.
* **Synthetic data generators**
- Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](examples/synthesize_with_privacy_control.md) use-cases.
- From a trained synthetic data generator, you can generate synthetic samples as needed and parametrise the number of records needed.
- [Anonymization](sdk/examples/synthesize_with_anonymization.md) and [privacy](sdk/examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared!
- [Conditional sampling](sdk/examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data.

* **Synthetic data quality report**
<span style="color:grey">*Coming soon*</span>
Expand All @@ -45,29 +47,27 @@ YData SDK is currently composed by the following main modules:
## Supported data formats

=== "Tabular"
![Tabular data synthesizer](../assets/500x330/single_table.png){ align=right }
![Tabular data Synthetic data generator](../assets/500x330/single_table.png){ align=right }
The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results.

[Know more](#){ .md-button .md-button--ydata}

=== "Time-Series"
![Timeseries Synthesizer](../assets/500x330/time_series.png){ align=left }
![Timeseries Synthetic data generator](../assets/500x330/time_series.png){ align=left }
The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock.

[Know more](#){ .md-button .md-button--ydata}

=== "Transactional"
![Transactional data synthesizer](../assets/500x330/time_series.png){ align=right }
![Transactional data Synthetic data generator](../assets/500x330/time_series.png){ align=right }
The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities.

<span style="color:grey">*Coming soon*</span>

[Know more](#){ .md-button .md-button--ydata}

=== "Relational databases"
![Relational databases synthesizer](../assets/500x330/multi_table.png){ align=left }
![Relational databases Synthetic data generator](../assets/500x330/multi_table.png){ align=left }
The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema.

<span style="color:grey">*Coming soon*</span>

[Know more](#){ .md-button .md-button--ydata}
2 changes: 1 addition & 1 deletion docs/sdk/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ YData SDK offers a free-trial and an enterprise version. To access your free-tri

The token will be available [**here**](https://fabric.ydata.ai), after login:

![SDK Token](../assets/fabric_sdk_token.png){: style="height:450px;width:750px;align:center"}
![SDK Token](../assets/sdk/fabric_sdk_token.png){: style="height:450px;width:750px;align:center"}

With your account toke copied, you can set a new environment variable `YDATA_TOKEN` in the beginning of your development session.

Expand Down
25 changes: 13 additions & 12 deletions examples/synthesizers/multi_table_quickstart.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,24 @@
from ydata.sdk.datasources import DataSource
from ydata.sdk.synthesizers import MultiTableSynthesizer

# Do not forget to add your token as env variables
# Authenticate to Fabric to leverage the SDK - https://docs.sdk.ydata.ai/latest/sdk/installation/
# Make sure to add your token as env variable.
os.environ["YDATA_TOKEN"] = '<TOKEN>' # Remove if already defined

# In this example, we demonstrate how to train a synthesizer from an existing multi table RDBMS datasource.
# After training a Multi Table Synthesizer, we request a sample.
# In this case, we don't return the Dataset for the sample, it will be saved in the database
# that the connector refers to.

# In this example, we demonstrate how to train a synthesizer from an existing RDBMS Dataset.
# Make sure to follow the step-by-step guide to create a Dataset in Fabric's catalog: https://docs.sdk.ydata.ai/latest/get-started/create_multitable_dataset/
X = DataSource.get('<DATASOURCE_UID>')

# Initialize a multi table synthesizer with the connector to write to
# As long as the synthesizer does not call `fit`, it exists only locally
# write_connector can be an UID or a Connector instance
# Init a multi-table synthesizer. Provide a connector so that the process of data synthesis write the
# synthetic data into the destination database
# Provide a connector ID as the write_connector argument. See in this tutorial how to get a connector ID
synth = MultiTableSynthesizer(write_connector='<CONNECTOR_UID')

# The synthesizer training is requested
# Start the training of your synthetic data generator
synth.fit(X)

# We request a synthetic dataset with a fracion of 1.5
synth.sample(frac=1.5)
# As soon as the training process is completed you are able to sample a synthetic database
# The input expected is a percentage of the original database size
# In this case it was requested a synthetic database with the same size as the original
# Your synthetic sample was written to the database provided in the write_connector
synth.sample(frac=1.)
13 changes: 7 additions & 6 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ nav:
- "get-started/index.md"
- Quickstart:
- How to create your first Dataset from a CSV file: "get-started/upload_csv.md"
- How to create your first Relational database in Fabric's Catalog: "get-started/create_multitable_dataset.md"
- How to create your first Synthetic Data generator: "get-started/create_syntheticdata_generator.md"
- How to create your first Lab: "get-started/create_lab.md"
- How to create your first Pipeline: "get-started/create_pipeline.md"
Expand All @@ -21,12 +22,12 @@ nav:
- Components:
- "sdk/modules/connectors.md"
- Examples:
- Generate Tabular Data: "examples/synthesize_tabular_data.md"
- Generate Time-Series Data: "examples/synthesize_timeseries_data.md"
- Generate MultiTable Data: "examples/synthesizer_multitable.md"
- Anonymization: "examples/synthesize_with_anonymization.md"
- Privacy Level: "examples/synthesize_with_privacy_control.md"
- Conditional Sampling: "examples/synthesize_with_conditional_sampling.md"
- Generate Tabular Data: "sdk/examples/synthesize_tabular_data.md"
- Generate Time-Series Data: "sdk/examples/synthesize_timeseries_data.md"
- Generate MultiTable Data: "sdk/examples/synthesizer_multitable.md"
- Anonymization: "sdk/examples/synthesize_with_anonymization.md"
- Privacy Level: "sdk/examples/synthesize_with_privacy_control.md"
- Conditional Sampling: "sdk/examples/synthesize_with_conditional_sampling.md"
- Reference:
- Changelog: 'sdk/reference/changelog.md'
- API:
Expand Down

0 comments on commit fadeed9

Please sign in to comment.