docs: Datasources and Snowflake integration example (#96)

* docs: add new information to Datasources and Snowflake integration example. * fix(linting): code formatting * docs: fix issue related with missing directory. * docs: remove optimize --------- Co-authored-by: Fabiana Clemente <[email protected]> Co-authored-by: Azory YData Bot <[email protected]>
ydataai · May 24, 2024 · efe0956 · efe0956
1 parent e05e7b6
commit efe0956
Show file tree

Hide file tree

Showing 10 changed files with 212 additions and 64 deletions.
diff --git a/docs/assets/data_catalog/connectors/select_snowflake_connector.png b/docs/assets/data_catalog/connectors/select_snowflake_connector.png
diff --git a/docs/assets/data_catalog/connectors/snowflake_config.png b/docs/assets/data_catalog/connectors/snowflake_config.png
diff --git a/docs/data_catalog/connectors/integration_databricks.md b/docs/data_catalog/connectors/integration_databricks.md
@@ -0,0 +1,9 @@
+mkdocs Read data from your Delta Lake
+
+### Write data from Fabric into your Delta Lake
+
+## Databricks Unity Catalog
+
+### Read data from a Unity catalog defined Delta Sharing area
+
+Add here a few more notes.
diff --git a/docs/data_catalog/connectors/integration_snowflake.md b/docs/data_catalog/connectors/integration_snowflake.md
@@ -0,0 +1,133 @@
+# ❄️ Integrate Fabric with Snowflake - from Analytics to Machine Learning
+
+YData Fabric provides a seamless integration with Snowflake, allowing you to connect,
+query, and manage your data in Snowflake with ease. This section will guide you through the benefits,
+setup, and usage of the Snowflake connector within YData Fabric.
+
+### Benefits of Integration
+Integrating YData Fabric with Snowflake offers several key benefits:
+
+- **Scalability:** Snowflake's architecture scales effortlessly with your data needs, while YData Fabric's tools ensure efficient data integration and management.
+- **Performance:** Leveraging Snowflake's high performance for data querying and YData Fabric's optimization techniques enhances overall data processing speed.
+- **Security:** Snowflake's robust security features, combined with YData Fabric's data governance capabilities, ensure your data remains secure and compliant.
+- **Interoperability:** YData Fabric simplifies the process of connecting to Snowflake, allowing you to quickly set up and start using the data without extensive configuration. Benefit from the unique Fabric functionalities like data preparation with Python, synthetic data generation and data profiling.
+
+## Setting Up the Snowflake Connector
+
+:fontawesome-brands-youtube:{ .youtube } <a href="https://youtube.com/clip/UgkxVTrEn2jY8GL-wqSXX3PByuUH5Q81Usih?si=xdpQ4eTCo_SEcvxp"><u>How to create a connector to Snowflake in Fabric?</u></a>
+
+To create a Snowflake connector in YData Fabric Ui you need to meet the following pre-requisites and steps:
+
+!!! note "Prerequisites"
+    Before setting up the connector, ensure you have the following:
+
+    - A Snowflake account with appropriate access permissions.
+    - YData Fabric installed and running in your environment.
+    - Credentials for Snowflake (username, password, account identifier, warehouse, database, schema).
+
+### Step-by-step creation through the UI
+
+To create a connector in YData Fabric, select the *"Connectors"* page from the left side menu, as illustrated in the image below.
+
+![Select Connectors from Homepage](../../assets/data_catalog/connectors/go_to_connector.png){: style="width:75%"}
+
+Now, click in the *"Create Connector"* button and the following menu with the available connectors will be shown.
+
+![Select Snowflake connector](../../assets/data_catalog/connectors/select_snowflake_connector.png){: style="width:50%"}
+
+After selecting the connector type *"Snowflake"* the below menu will be shown. This is where you can configure the connection to your Snowflake instance. For that you will need the following information:
+
+![Config Snowflake connector](../../assets/data_catalog/connectors/snowflake_config.png){: style="width:45%; padding-right:10px", align=left}
+
+- **Username:** Your Snowflake username.
+- **Password:** Your Snowflake password.
+- **Host/Account Identifier:** Your Snowflake account identifier (e.g., xy12345.us-east-1).
+- **Port:** The Snowflake port number.
+- **Database:** The Snowflake database to connect to.
+- **Schema:** The schema within the database.
+- **Warehouse:** The Snowflake warehouse to use.
+- **Display Name:** A unique name for your connector.
+</br></br></br></br></br>
+
+Test your connection and that's it! 🚀
+
+You are now ready to create different **Datasources** using this connector - read the data from a query, evaluate the quality of the data from a table or even
+read a full database and generate a synthetic replica of your data!
+Read more about ^^[Fabric Datasources in here](../datasources/index.md)^^.
+
+### Use it inside the Labs
+
+👨‍💻 ^^[Full code example and recipe can be found here](https://github.com/ydataai/academy/blob/master/1%20-%20Data%20Catalog/1.%20Connectors/Snowflake.ipynb)^^.
+
+In case you prefer a Python interface, we also have connectors available through Fabric SDK inside the labs.
+For a seamless integration between the UI and the Labs environment, Fabric offers an SDK that allows you to re-use connectors, datasources and even synthesizers.
+
+Start by creating your code environment through the Labs. In case you need to get started with the Labs, ^^[check this step-by-step guide](../../get-started/create_lab.md)^^.
+
+```python
+    # Importing YData's packages
+    from ydata.labs import Connectors
+    # Getting a previously created Connector
+    connector = Connectors.get(uid='insert-connector-id',
+                               namespace='indert-namespace-id')
+    print(connector)
+```
+
+#### Navigate your database
+Add here a short description
+
+```python title="List available schemas and get the metadata of a given schema"
+    # returns a list of schemas
+    schemas = connector.list_schemas()
+
+    # get the metadata of a database schema, including columns and relations between tables (PK and FK)
+    schema = connector.get_database_schema('PATIENTS')
+```
+
+#### Read from a Snowflake instance
+Using the Snowflake connector it is possible to:
+
+- Get the data from a Snowflake table
+- Get a sample from a Snowflake table
+- Get the data from a query to a Snowflake instance
+- Get the full data from a selected database
+
+```python title="Read full and a sample from a table"
+    # returns the whole data from a given table
+    table = connector.get_table('cardio_test')
+    print(table)
+
+    # Get a sample with n rows from a given table
+    table_sample = connector.get_table_sample(table='cardio_test', sample_size=50)
+    print(table_sample)
+```
+
+```python title="Get the data from a query"
+    # returns the whole data from a given table
+    query_output = connector.query('SELECT * FROM patients.cardio_test;')
+    print(query_output)
+```
+
+#### Write to a Snowflake instance
+If you need to write your data into a Snowflake instance you can also leverage your Snowflake connector for the following actions:
+
+- Write the data into a table
+- Write a new database schema
+
+The **if_exists** parameter allow you to decide whether you want to **append**, **replace** or **fail** in case a table with the same name
+already exists in the schema.
+
+```python title='Writing a dataset to a table in a Snowflake schema'
+    connector.write_table(data=tables['cardio_test'],
+                          name='cardio',
+                          if_exists='fail')
+```
+
+**table_names** allow you to define a new name for the table in the database. If not provided it will be assumed the table names from your dataset.
+```python title='Writing a full database to a Snowflake schema'
+    connector.write_database(data=database,
+                         schema_name='new_cardio',
+                         table_names={'cardio_test': 'cardio'})
+```
+
+I hope you enjoyed this quick tutorial on seamlessly integrating Snowflake with your data preparation workflows. ❄️🚀
diff --git a/docs/data_catalog/connectors/supported_connections.md b/docs/data_catalog/connectors/supported_connections.md
@@ -6,19 +6,21 @@ Fabric can read and write data from a variety of data sources.
 
 Here is the list of the available connectors in Fabric.
 
-| Connector Name       |      Type      |                 Supported file types | Notes                                                                                                 |
-|:---------------------|:--------------:|-------------------------------------:|:------------------------------------------------------------------------------------------------------|
-| AWS S3               | Object Storage |                      `Parquet` `CSV` |                                                                                                       |
-| Azure Blog Storage   | Object Storage |                      `Parquet` `CSV` |                                                                                                       |
-| Azure Data Lake      | Object Storage |                      `Parquet` `CSV` |                                                                                                       |
-| Google Cloud storage | Object Storage |                      `Parquet` `CSV` |                                                                                                       |
-| Upload file          |      File      |                      `Parquet` `CSV` | Maximum file size is 700MB. <br/>Bigger files should be uploaded and read from <br/>remote object storages |
-| Google BigQuery      |   Big Table    |                     `Not applicable` |                                                                                                       |
-| MySQL                |     RDBMS      |                     `Not applicable` | Supports reading whole schemas or specifying a query                                                  |
-| Azure SQL Server     |     RDBMS      |                     `Not applicable` | Supports reading whole schemas or specifying a query                                                  |
-| PostGreSQL           |     RDBMS      |                     `Not applicable` | Supports reading whole schemas or specifying a query                                                  |
-| Snowflake            |     RDBMS      |                     `Not applicable` | Supports reading whole schemas or specifying a query                                                  |
-| Oracle DB            |     RDBMS      |                     `Not applicable` | Supports reading whole schemas or specifying a query                                                  |
+| Connector Name           |      Type      |                 Supported file types | Notes                                                                                                      |
+|:-------------------------|:--------------:|-------------------------------------:|:-----------------------------------------------------------------------------------------------------------|
+| AWS S3                   | Object Storage |                      `Parquet` `CSV` |                                                                                                            |
+| Azure Blog Storage       | Object Storage |                      `Parquet` `CSV` |                                                                                                            |
+| Azure Data Lake          | Object Storage |                      `Parquet` `CSV` |                                                                                                            |
+| Google Cloud storage     | Object Storage |                      `Parquet` `CSV` |                                                                                                            |
+| Upload file              |      File      |                      `Parquet` `CSV` | Maximum file size is 700MB. <br/>Bigger files should be uploaded and read from <br/>remote object storages |
+| Google BigQuery          |   Big Table    |                     `Not applicable` |                                                                                                            |
+| MySQL                    |     RDBMS      |                     `Not applicable` | Supports reading whole schemas or specifying a query                                                       |
+| Azure SQL Server         |     RDBMS      |                     `Not applicable` | Supports reading whole schemas or specifying a query                                                       |
+| PostGreSQL               |     RDBMS      |                     `Not applicable` | Supports reading whole schemas or specifying a query                                                       |
+| Snowflake                |     RDBMS      |                     `Not applicable` | Supports reading whole schemas or specifying a query                                                       |
+| Oracle DB                |     RDBMS      |                     `Not applicable` | Supports reading whole schemas or specifying a query                                                       |
+| Databricks Unity Catalog |    Catalog     |                     `Not applicable` | Supports reading a table                                                                                   |
+| Databricks Delta Lake    |   Lakehouse    |                     `Not applicable` | Supports reading a table                                                                                   |
 
 ## Haven't found your storage?
 

diff --git a/docs/data_catalog/connectors/use_in_labs.md b/docs/data_catalog/connectors/use_in_labs.md
@@ -0,0 +1,3 @@
+# Use connectors in Lab
+
+## Create a lab environment
diff --git a/docs/data_catalog/datasources/index.md b/docs/data_catalog/datasources/index.md
@@ -1,36 +1,28 @@
 # Overview
 
-To enable a full understanding of the available data assets, Fabric further incorporates a module for **Data Profiling**, which allows you to further investigate the characteristics of your dataset more deeply, zooming in on the behavior and relationships between particular columns.
+YData Fabric Datasources are entities that represent specific data sets such as tables,
+file sets, or other structured formats within the YData Fabric platform.
+They offer a centralized framework for managing, cataloging, and profiling data,
+enhancing data management and quality.
 
-???+ question "Profiling Large Datasets?"
-	We've got you covered. [Fabric Data Catalog](https://ydata.ai/products/data_catalog) offers an interactive, flexible, and intuitive experience when handling datasets with **thousands of columns and any number of rows**. Learn more about the benefits of Fabric in [profiling high-dimensional datasets](https://ydata.ai/resources/understanding-large-multivariate-data-with-data-profiling) and sign up for the [Community Version](https://ydata.ai/ydata-fabric-free-trial) to experiment with your own data assets.
+## Benefits
 
-The data profiling essentially enables the following analysis:
+- **Summarized metadata information:** Fabric Datasources provide comprehensive metadata management, offering detailed
+information about each datasource, including schema details, descriptions, tags, and data lineage.
+This metadata helps users understand the structure and context of their data.
 
-- **Univariate Analysis and Feature Statistics:** Fabric incorporates **type inference**, automatically detecting the data types in a dataset. Depending on the column’s data type, **adjusted descriptive statistics** are presented. The same applies for the **visualizations** chosen for each column.
+- **Data Quality Management:** Users can find data quality warnings, validation results, cleansing suggestions, and quality scores.
+These features help in identifying and addressing data quality issues automatically, ensuring reliable data
+for analysis and decision-making.
 
-- **Multivariate Analysis and Correlation Assessment:** To enable multivariate analysis and the evaluation of existing relationships between columns, Fabric includes informative visualizations regarding the **interactions** and **correlations** between columns, and the investigation of **missing data** and **outliers**.
+- **Data Profiling:** Data profiling tools analyze the content and structure of datasources, providing statistical summaries,
+detecting patterns, assessing completeness, and evaluating data uniqueness. These insights help in understanding
+and improving data quality.
 
-<figure markdown>
-![Connectors](../assets/data_catalog/profiling_heatmap.png){: style="height:550px;width:1200px"}
-</figure>
+- **PII Identification and Management:** Fabric detects and manages Personally Identifiable Information (PII) within datasources.
+It includes automatic PII detection, masking tools, and compliance reporting to protect sensitive data and
+ensure regulatory compliance.
 
-
-The data profiling highlights a set of **statistical properties**, such as:
-
-- **Variables Properties**:
-	- Descriptive statistics
-	- Quantile statistics
-	- Histogram, Common Values, and Extreme Values
-- **Interactions and Correlations**:
-	- Heat maps and bar plot formats with interactive selection;
-	- Spearman’s and Cramer’s V analysis
-- **Missing Values (MAR, MNAR, and MCAR):**
-	- Count and Matrix
-- **Autoregressive and Stationarity Detection** <span style="color:grey">***(Time Series Data)***</span>
-	- ACF and PACF analysis
-- **Text Analysis**
-	- Most occurring characters, words, categories, among others
-
-???+ tip "Profiling Sensitive Data?"
-	By default, Fabric assumes that any data to be profile **can contain sensitive information**. For that reason, it includes several features to enable a **secure and fair data profiling** such as the *aggregation of easily-identifiable groups* and the *obfuscation of values* for categorical columns. Sign up for the [Community Version](https://ydata.ai/ydata-fabric-free-trial) and move towards a **responsible exploration** of your data.
+- **Centralized Repository:** Fabric Datasources serve as a centralized repository for data quality discovery and management.
+They provide a single point of access for all data assets, simplifying discovery, monitoring, and governance,
+and improving overall data management efficiency.
diff --git a/docs/data_catalog/datasources/pii.md b/docs/data_catalog/datasources/pii.md
@@ -6,20 +6,22 @@ To overcome the concerns around data privacy and enable secure data sharing, Fab
 Fabric offers a **standardized classification of PII** that automatically highlights and tags potential PII. The automatic detection of PII can be enabled **during the loading process** of your datasets and can be leveraged to generate **privacy-preserving synthetic data**.
 
 <figure markdown>
-![PII Detection](../assets/data_catalog/pii_detection.png){: style="height:550px;width:600px"}
+![PII Detection](../../assets/data_catalog/pii_detection.png){: style="height:550px;width:600px"}
 </figure>
 
 After the detection, the PII information will be available through the **Metadata > PII Types**, where each column that may represent potential PII is *associated to one or several tags that identify the type of information it might be leaking*.
 
 <figure markdown>
-![PII Overview](../assets/data_catalog/pii_overview.png){: style="height:430px;width:1000px"}
+![PII Overview](../../assets/data_catalog/pii_overview.png){: style="height:430px;width:1000px"}
 </figure>
 
 You can **review the automatic PII classification and add additional PII tags** of your own by editing the metadata and select additional tags available in a **pre-defined list of values**, containing the most common types of potential PII information: email, phone, VAT, zip code, among others.
 
 <figure markdown>
-![PII Editing](../assets/data_catalog/pii_editing.png){: style="height:600px;width:1000px"}
+![PII Editing](../../assets/data_catalog/pii_editing.png){: style="height:600px;width:1000px"}
 </figure>
 
-???+ question "Need a solution to enable data sharing and comply with GDPR and CCPA regulations?"
-    Using [synthetic data](https://ydata.ai/products/synthetic_data) has proven to foster a culture of data-sharing within organizations, overcoming the limitations of traditional privacy methods and maximizing data value. Try [Fabric Community Version](https://ydata.ai/ydata-fabric-free-trial) to enable secure data sharing.
+???+ question "Need a solution to enable data sharing and comply with **GDPR** and **CCPA** regulations?"
+    Using ^^[synthetic data](https://ydata.ai/products/synthetic_data)^^ has proven to foster a culture of data-sharing
+    within organizations, overcoming the limitations of traditional privacy methods and maximizing data value.
+    Try ^^[Fabric Community Version](https://ydata.ai/ydata-fabric-free-trial)^^ to enable secure data sharing.