docs: update multi-table related documentation (#83)

* docs: update multi-table examples * docs: update SDK mainpage * docs: update sdk installation step by step guide. * docs: new quickstart tutorial. * docs: update multitable example. * fix(linting): code formatting --------- Co-authored-by: Fabiana Clemente <[email protected]> Co-authored-by: Azory YData Bot <[email protected]>
ydataai · Jan 17, 2024 · fadeed9 · fadeed9
1 parent e15e427
commit fadeed9
Show file tree

Hide file tree

Showing 21 changed files with 114 additions and 57 deletions.
diff --git a/docs/assets/fabric_sdk_token.png b/docs/assets/fabric_sdk_token.png
diff --git a/docs/assets/quickstart/create_relational_database/database_schema_overview.png b/docs/assets/quickstart/create_relational_database/database_schema_overview.png
diff --git a/docs/assets/quickstart/create_relational_database/mysql_credentials.png b/docs/assets/quickstart/create_relational_database/mysql_credentials.png
diff --git a/docs/assets/quickstart/create_relational_database/mysql_dataset_creation.png b/docs/assets/quickstart/create_relational_database/mysql_dataset_creation.png
diff --git a/docs/assets/quickstart/create_relational_database/open_dataset.png b/docs/assets/quickstart/create_relational_database/open_dataset.png
diff --git a/docs/assets/quickstart/create_relational_database/table_overview.png b/docs/assets/quickstart/create_relational_database/table_overview.png
diff --git a/docs/assets/sdk/fabric_sdk_token.png b/docs/assets/sdk/fabric_sdk_token.png
diff --git a/docs/examples/synthesizer_multitable.md b/docs/examples/synthesizer_multitable.md
diff --git a/docs/get-started/create_multitable_dataset.md b/docs/get-started/create_multitable_dataset.md
@@ -0,0 +1,46 @@
+# How to create your first Relational database in Fabric's Catalog
+
+To create your first multi-table dataset in the **Data Catalog**, you can start by clicking on **"Add Dataset"** from the Home section.
+Or click to **Data Catalog** (on the left side menu) and click **“Add Dataset”**.
+
+![Create dataset with upload csv](../assets/quickstart/upload_csv/welcome_add_dataset.png){: style="width:75%"}
+
+After that the below modal will be shown. You will need to select a connector. To create a multi-table dataset, we need to choose an RDBMS connector like Azure SQL, Snowflake or MySQL.
+In this case let's select MySQL.
+
+![Data Catalog connectors](../assets/quickstart/upload_csv/data_catalog_connectors.png){: style="width:40%"}
+
+Once you've selected the **“MySQL”** connector, a new screen will appear, enabling you to introduce the connection details such as database username, host, password as well as the database name.
+
+![MySQL credentials](../assets/quickstart/create_relational_database/mysql_credentials.png){: style="width:60%"}
+
+With the *Connector* created, you'll be able to add a dataset and specify its properties:
+
+- **Name:** The name of your dataset;
+- **Table:** You can create a dataset with all the tables from the schema or select the tables that you need in your project.
+- **Query:** Create a single table dataset by providing a query
+
+![Add dataset details](../assets/quickstart/create_relational_database/mysql_dataset_creation.png){: style="width:45%"}
+
+Now both the Connector to the MySQL Berka database and Berka dataset will be added to our Catalog.
+As soon as the status is green, you can navigate your Dataset. Click in **Open** dataset as per the image below.
+
+![Explore dataset](../assets/quickstart/create_relational_database/open_dataset.png){: style="width:75%"}
+
+Within the **Dataset** details, you can gain valuable insights like your database schema.
+
+![Database schema overview ](../assets/quickstart/create_relational_database/database_schema_overview.png){: style="width:75%"}
+
+For each an every table you can explore the both an overview on the structure (number of columns, number of rows, etc.) but also a useful
+summary of the quality and warnings regarding your dataset behaviour.
+
+![Dataset profiling](../assets/quickstart/create_relational_database/table_overview.png){: style="width:75%"}
+
+**Congrats!** 🚀 You have now successfully created your first **Connector** and **Multi-table Dataset** in Fabric’s Data Catalog.
+To get the both the ID of your database and project you can decompose the URL from the Database schema overview page. The structure is as follows:
+
+```
+    https://fabric.ydata.ai/rdbms/{your-dataset-id}?ns={your-project-id}
+```
+
+Get ready for your journey of improved quality data for AI.
diff --git a/docs/get-started/index.md b/docs/get-started/index.md
@@ -4,7 +4,8 @@ The get started is here to help you if you are not yet familiar with YData Fabri
 data quality, data preparation workflows and how you can start leveraging synthetic data.
 <a href="fabric_community"><u>Mention to YData Fabric Community</u></a>
 
-### 📚 <a href="upload_csv"><u>Create your first Data with the Data Catalog</u></a>
+### 📚 <a href="upload_csv"><u>Create your first Dataset with the Data Catalog</u></a>
+### 💾 <a href="create_multitable_dataset"><u>Create your Multi-Table Dataset with the Data Catalog</u></a>
 ### ⚙️ <a href="create_syntheticdata_generator"><u>Create your first Synthetic Data generator</u></a>
 ### 🧪 <a href="create_lab"><u>Create your first Lab</u></a>
 ### 🌀 <a href="create_pipeline"><u>Create your first data Pipeline</u></a>
diff --git a/docs/examples/synthesize_tabular_data.md → docs/sdk/examples/synthesize_tabular_data.md b/docs/examples/synthesize_tabular_data.md → docs/sdk/examples/synthesize_tabular_data.md
diff --git a/docs/examples/synthesize_timeseries_data.md → ...dk/examples/synthesize_timeseries_data.md b/docs/examples/synthesize_timeseries_data.md → ...dk/examples/synthesize_timeseries_data.md
diff --git a/...examples/synthesize_with_anonymization.md → ...examples/synthesize_with_anonymization.md b/...examples/synthesize_with_anonymization.md → ...examples/synthesize_with_anonymization.md
diff --git a/...s/synthesize_with_conditional_sampling.md → ...s/synthesize_with_conditional_sampling.md b/...s/synthesize_with_conditional_sampling.md → ...s/synthesize_with_conditional_sampling.md
diff --git a/...amples/synthesize_with_privacy_control.md → ...amples/synthesize_with_privacy_control.md b/...amples/synthesize_with_privacy_control.md → ...amples/synthesize_with_privacy_control.md
diff --git a/docs/sdk/examples/synthesizer_multitable.md b/docs/sdk/examples/synthesizer_multitable.md
@@ -0,0 +1,25 @@
+# Synthesize Relational databases
+
+**Integrate Fabric's *MultiTableSynthesizer* in your data flows and generate synthetic relational databases or multi-table datasets**
+
+The capability to generate synthetic data from relational databases is a powerful and innovative approach to
+streamline the access to data and improve data democratization strategy within the organization.
+Fabric's SDK makes available an easy-to-use code interface to integrate the process of generating synthetic multi-table databases
+into your existing data flows.
+
+!!! tip "How to get your datasource?"
+    Learn how to create your multi-table data in Fabric <a href="/get-started/create_multitable_dataset"><u>here</u></a> before creating your first multi-table synthetic data generator!
+
+    **Get your datasource and connector ID**
+
+    *Datasource uid:* You can find your datasource ID through Fabric UI. Open your relational dataset and click in the "Explore in Labs" button.
+    Copy the uid that you find available in the code snippet.
+
+    *Connector uid:* You can find your connector ID through Fabric UI. Open the connector tab from your Data Catalog. Under the connector "Actions"
+    select "Explore in Lab". Copy the uid available in the code snippet.
+
+Quickstart example:
+
+```python
+--8<-- "examples/synthesizers/multi_table_quickstart.py"
+```
diff --git a/docs/sdk/index.md b/docs/sdk/index.md
@@ -1,38 +1,40 @@
-<p></p>
-<p align="center"><img width="500" src="https://assets.ydata.ai/sdk/logo_SDK_col_red_black.png" alt="YData Logo"></p>
-<p></p>
-
 [![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk)
 ![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)
 [![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk)
 
-!!! note "YData SDK for improved data quality everywhere!"
+!!! note "Fabric SDK for improved data quality everywhere!"
 
     *ydata-sdk* is here! Create a YData account so you can start using today!
 
     [Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch}
 
 ## Overview
 
-The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications.
+The *Fabric SDK* is an ecosystem of methods that allows users to, through a python interface, adopt data development focused on improving the quality of the data.
+The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications.
+
+## Benefits
 
-**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments.
+Fabric SDK interface enables the ability to integrate data quality tooling with other platforms offering several beneficts in the realm of
+data science development and data management:
 
-Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation.
+- **Interoperability:** seamless integration with other data platform and systems like Databricks, Snowflake, etc. This ensures that all your software will work cohesively with all the elements from your data architecture.
+- **Collaboration:** ease of integration with a multitude of tools and services, reducing the need to reinvent the wheel and fostering a collaborative environment for all developers (data scientists, data engineers, software developers, etc.)
+- **Improved usage experience:** Fabric SDK enables a well-integrated software solution, which allows a seamless transition between different tools or platforms without facing compatibility issues.
 
 ## Current functionality
 
-YData SDK is currently composed by the following main modules:
+Fabric SDK is currently composed by the following main modules:
 
 * **Datasources**
      - YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors.
      - SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data.
 
-* **Synthesizers**
-     - Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](../examples/synthesize_with_privacy_control.md) use-cases.
-     - From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed.
-     - [Anonymization](../examples/synthesize_with_anonymization.md) and [privacy](../examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared!
-     - [Conditional sampling](../examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data.
+* **Synthetic data generators**
+     - Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](examples/synthesize_with_privacy_control.md) use-cases.
+     - From a trained synthetic data generator, you can generate synthetic samples as needed and parametrise the number of records needed.
+     - [Anonymization](sdk/examples/synthesize_with_anonymization.md) and [privacy](sdk/examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared!
+     - [Conditional sampling](sdk/examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data.
 
 * **Synthetic data quality report**
     <span style="color:grey">*Coming soon*</span>
@@ -45,29 +47,27 @@ YData SDK is currently composed by the following main modules:
 ## Supported data formats
 
 === "Tabular"
-    ![Tabular data synthesizer](../assets/500x330/single_table.png){ align=right }
+    ![Tabular data Synthetic data generator](../assets/500x330/single_table.png){ align=right }
     The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results.
 
     [Know more](#){ .md-button .md-button--ydata}
 
 === "Time-Series"
-    ![Timeseries Synthesizer](../assets/500x330/time_series.png){ align=left }
+    ![Timeseries Synthetic data generator](../assets/500x330/time_series.png){ align=left }
     The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock.
 
     [Know more](#){ .md-button .md-button--ydata}
 
 === "Transactional"
-    ![Transactional data synthesizer](../assets/500x330/time_series.png){ align=right }
+    ![Transactional data Synthetic data generator](../assets/500x330/time_series.png){ align=right }
     The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities.
 
     <span style="color:grey">*Coming soon*</span>
 
     [Know more](#){ .md-button .md-button--ydata}
 
 === "Relational databases"
-    ![Relational databases synthesizer](../assets/500x330/multi_table.png){ align=left }
+    ![Relational databases Synthetic data generator](../assets/500x330/multi_table.png){ align=left }
     The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema.
 
-    <span style="color:grey">*Coming soon*</span>
-
     [Know more](#){ .md-button .md-button--ydata}
diff --git a/docs/sdk/installation.md b/docs/sdk/installation.md
@@ -28,7 +28,7 @@ YData SDK offers a free-trial and an enterprise version. To access your free-tri
 
 The token will be available [**here**](https://fabric.ydata.ai), after login:
 
-![SDK Token](../assets/fabric_sdk_token.png){: style="height:450px;width:750px;align:center"}
+![SDK Token](../assets/sdk/fabric_sdk_token.png){: style="height:450px;width:750px;align:center"}
 
 With your account toke copied, you can set a new environment variable `YDATA_TOKEN` in the beginning of your development session.
 

diff --git a/examples/synthesizers/multi_table_quickstart.py b/examples/synthesizers/multi_table_quickstart.py
@@ -3,23 +3,24 @@
 from ydata.sdk.datasources import DataSource
 from ydata.sdk.synthesizers import MultiTableSynthesizer
 
-# Do not forget to add your token as env variables
+# Authenticate to Fabric to leverage the SDK - https://docs.sdk.ydata.ai/latest/sdk/installation/
+# Make sure to add your token as env variable.
 os.environ["YDATA_TOKEN"] = '<TOKEN>'  # Remove if already defined
 
-# In this example, we demonstrate how to train a synthesizer from an existing multi table RDBMS datasource.
-# After training a Multi Table Synthesizer, we request a sample.
-# In this case, we don't return the Dataset for the sample, it will be saved in the database
-# that the connector refers to.
-
+# In this example, we demonstrate how to train a synthesizer from an existing RDBMS Dataset.
+# Make sure to follow the step-by-step guide to create a Dataset in Fabric's catalog: https://docs.sdk.ydata.ai/latest/get-started/create_multitable_dataset/
 X = DataSource.get('<DATASOURCE_UID>')
 
-# Initialize a multi table synthesizer with the connector to write to
-# As long as the synthesizer does not call `fit`, it exists only locally
-# write_connector can be an UID or a Connector instance
+# Init a multi-table synthesizer. Provide a connector so that the process of data synthesis write the
+# synthetic data into the destination database
+# Provide a connector ID as the write_connector argument. See in this tutorial how to get a connector ID
 synth = MultiTableSynthesizer(write_connector='<CONNECTOR_UID')
 
-# The synthesizer training is requested
+# Start the training of your synthetic data generator
 synth.fit(X)
 
-# We request a synthetic dataset with a fracion of 1.5
-synth.sample(frac=1.5)
+# As soon as the training process is completed you are able to sample a synthetic database
+# The input expected is a percentage of the original database size
+# In this case it was requested a synthetic database with the same size as the original
+# Your synthetic sample was written to the database provided in the write_connector
+synth.sample(frac=1.)
diff --git a/...zers/multi_table_sample_write_override.py → ...thesizers/multi_table_sample_connector.py b/...zers/multi_table_sample_write_override.py → ...thesizers/multi_table_sample_connector.py
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -10,6 +10,7 @@ nav:
       - "get-started/index.md"
       - Quickstart:
           - How to create your first Dataset from a CSV file: "get-started/upload_csv.md"
+          - How to create your first Relational database in Fabric's Catalog: "get-started/create_multitable_dataset.md"
           - How to create your first Synthetic Data generator: "get-started/create_syntheticdata_generator.md"
           - How to create your first Lab: "get-started/create_lab.md"
           - How to create your first Pipeline: "get-started/create_pipeline.md"
@@ -21,12 +22,12 @@ nav:
       - Components:
           - "sdk/modules/connectors.md"
       - Examples:
-        - Generate Tabular Data: "examples/synthesize_tabular_data.md"
-        - Generate Time-Series Data: "examples/synthesize_timeseries_data.md"
-        - Generate MultiTable Data: "examples/synthesizer_multitable.md"
-        - Anonymization: "examples/synthesize_with_anonymization.md"
-        - Privacy Level: "examples/synthesize_with_privacy_control.md"
-        - Conditional Sampling: "examples/synthesize_with_conditional_sampling.md"
+        - Generate Tabular Data: "sdk/examples/synthesize_tabular_data.md"
+        - Generate Time-Series Data: "sdk/examples/synthesize_timeseries_data.md"
+        - Generate MultiTable Data: "sdk/examples/synthesizer_multitable.md"
+        - Anonymization: "sdk/examples/synthesize_with_anonymization.md"
+        - Privacy Level: "sdk/examples/synthesize_with_privacy_control.md"
+        - Conditional Sampling: "sdk/examples/synthesize_with_conditional_sampling.md"
       - Reference:
           - Changelog: 'sdk/reference/changelog.md'
           - API: