Skip to content

Commit

Permalink
docs: Datasources and Snowflake integration example (#96)
Browse files Browse the repository at this point in the history
* docs: add new information to Datasources and Snowflake integration example.

* fix(linting): code formatting

* docs: fix issue related with missing directory.

* docs: remove optimize

---------

Co-authored-by: Fabiana Clemente <[email protected]>
Co-authored-by: Azory YData Bot <[email protected]>
  • Loading branch information
3 people authored May 24, 2024
1 parent e05e7b6 commit efe0956
Show file tree
Hide file tree
Showing 10 changed files with 212 additions and 64 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions docs/data_catalog/connectors/integration_databricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
mkdocs Read data from your Delta Lake

### Write data from Fabric into your Delta Lake

## Databricks Unity Catalog

### Read data from a Unity catalog defined Delta Sharing area

Add here a few more notes.
133 changes: 133 additions & 0 deletions docs/data_catalog/connectors/integration_snowflake.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# ❄️ Integrate Fabric with Snowflake - from Analytics to Machine Learning

YData Fabric provides a seamless integration with Snowflake, allowing you to connect,
query, and manage your data in Snowflake with ease. This section will guide you through the benefits,
setup, and usage of the Snowflake connector within YData Fabric.

### Benefits of Integration
Integrating YData Fabric with Snowflake offers several key benefits:

- **Scalability:** Snowflake's architecture scales effortlessly with your data needs, while YData Fabric's tools ensure efficient data integration and management.
- **Performance:** Leveraging Snowflake's high performance for data querying and YData Fabric's optimization techniques enhances overall data processing speed.
- **Security:** Snowflake's robust security features, combined with YData Fabric's data governance capabilities, ensure your data remains secure and compliant.
- **Interoperability:** YData Fabric simplifies the process of connecting to Snowflake, allowing you to quickly set up and start using the data without extensive configuration. Benefit from the unique Fabric functionalities like data preparation with Python, synthetic data generation and data profiling.

## Setting Up the Snowflake Connector

:fontawesome-brands-youtube:{ .youtube } <a href="https://youtube.com/clip/UgkxVTrEn2jY8GL-wqSXX3PByuUH5Q81Usih?si=xdpQ4eTCo_SEcvxp"><u>How to create a connector to Snowflake in Fabric?</u></a>

To create a Snowflake connector in YData Fabric Ui you need to meet the following pre-requisites and steps:

!!! note "Prerequisites"
Before setting up the connector, ensure you have the following:

- A Snowflake account with appropriate access permissions.
- YData Fabric installed and running in your environment.
- Credentials for Snowflake (username, password, account identifier, warehouse, database, schema).

### Step-by-step creation through the UI

To create a connector in YData Fabric, select the *"Connectors"* page from the left side menu, as illustrated in the image below.

![Select Connectors from Homepage](../../assets/data_catalog/connectors/go_to_connector.png){: style="width:75%"}

Now, click in the *"Create Connector"* button and the following menu with the available connectors will be shown.

![Select Snowflake connector](../../assets/data_catalog/connectors/select_snowflake_connector.png){: style="width:50%"}

After selecting the connector type *"Snowflake"* the below menu will be shown. This is where you can configure the connection to your Snowflake instance. For that you will need the following information:

![Config Snowflake connector](../../assets/data_catalog/connectors/snowflake_config.png){: style="width:45%; padding-right:10px", align=left}

- **Username:** Your Snowflake username.
- **Password:** Your Snowflake password.
- **Host/Account Identifier:** Your Snowflake account identifier (e.g., xy12345.us-east-1).
- **Port:** The Snowflake port number.
- **Database:** The Snowflake database to connect to.
- **Schema:** The schema within the database.
- **Warehouse:** The Snowflake warehouse to use.
- **Display Name:** A unique name for your connector.
</br></br></br></br></br>

Test your connection and that's it! 🚀

You are now ready to create different **Datasources** using this connector - read the data from a query, evaluate the quality of the data from a table or even
read a full database and generate a synthetic replica of your data!
Read more about ^^[Fabric Datasources in here](../datasources/index.md)^^.

### Use it inside the Labs

👨‍💻 ^^[Full code example and recipe can be found here](https://github.com/ydataai/academy/blob/master/1%20-%20Data%20Catalog/1.%20Connectors/Snowflake.ipynb)^^.

In case you prefer a Python interface, we also have connectors available through Fabric SDK inside the labs.
For a seamless integration between the UI and the Labs environment, Fabric offers an SDK that allows you to re-use connectors, datasources and even synthesizers.

Start by creating your code environment through the Labs. In case you need to get started with the Labs, ^^[check this step-by-step guide](../../get-started/create_lab.md)^^.

```python
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
connector = Connectors.get(uid='insert-connector-id',
namespace='indert-namespace-id')
print(connector)
```

#### Navigate your database
Add here a short description

```python title="List available schemas and get the metadata of a given schema"
# returns a list of schemas
schemas = connector.list_schemas()

# get the metadata of a database schema, including columns and relations between tables (PK and FK)
schema = connector.get_database_schema('PATIENTS')
```

#### Read from a Snowflake instance
Using the Snowflake connector it is possible to:

- Get the data from a Snowflake table
- Get a sample from a Snowflake table
- Get the data from a query to a Snowflake instance
- Get the full data from a selected database

```python title="Read full and a sample from a table"
# returns the whole data from a given table
table = connector.get_table('cardio_test')
print(table)

# Get a sample with n rows from a given table
table_sample = connector.get_table_sample(table='cardio_test', sample_size=50)
print(table_sample)
```

```python title="Get the data from a query"
# returns the whole data from a given table
query_output = connector.query('SELECT * FROM patients.cardio_test;')
print(query_output)
```

#### Write to a Snowflake instance
If you need to write your data into a Snowflake instance you can also leverage your Snowflake connector for the following actions:

- Write the data into a table
- Write a new database schema

The **if_exists** parameter allow you to decide whether you want to **append**, **replace** or **fail** in case a table with the same name
already exists in the schema.

```python title='Writing a dataset to a table in a Snowflake schema'
connector.write_table(data=tables['cardio_test'],
name='cardio',
if_exists='fail')
```

**table_names** allow you to define a new name for the table in the database. If not provided it will be assumed the table names from your dataset.
```python title='Writing a full database to a Snowflake schema'
connector.write_database(data=database,
schema_name='new_cardio',
table_names={'cardio_test': 'cardio'})
```

I hope you enjoyed this quick tutorial on seamlessly integrating Snowflake with your data preparation workflows. ❄️🚀
28 changes: 15 additions & 13 deletions docs/data_catalog/connectors/supported_connections.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,21 @@ Fabric can read and write data from a variety of data sources.

Here is the list of the available connectors in Fabric.

| Connector Name | Type | Supported file types | Notes |
|:---------------------|:--------------:|-------------------------------------:|:------------------------------------------------------------------------------------------------------|
| AWS S3 | Object Storage | `Parquet` `CSV` | |
| Azure Blog Storage | Object Storage | `Parquet` `CSV` | |
| Azure Data Lake | Object Storage | `Parquet` `CSV` | |
| Google Cloud storage | Object Storage | `Parquet` `CSV` | |
| Upload file | File | `Parquet` `CSV` | Maximum file size is 700MB. <br/>Bigger files should be uploaded and read from <br/>remote object storages |
| Google BigQuery | Big Table | `Not applicable` | |
| MySQL | RDBMS | `Not applicable` | Supports reading whole schemas or specifying a query |
| Azure SQL Server | RDBMS | `Not applicable` | Supports reading whole schemas or specifying a query |
| PostGreSQL | RDBMS | `Not applicable` | Supports reading whole schemas or specifying a query |
| Snowflake | RDBMS | `Not applicable` | Supports reading whole schemas or specifying a query |
| Oracle DB | RDBMS | `Not applicable` | Supports reading whole schemas or specifying a query |
| Connector Name | Type | Supported file types | Notes |
|:-------------------------|:--------------:|-------------------------------------:|:-----------------------------------------------------------------------------------------------------------|
| AWS S3 | Object Storage | `Parquet` `CSV` | |
| Azure Blog Storage | Object Storage | `Parquet` `CSV` | |
| Azure Data Lake | Object Storage | `Parquet` `CSV` | |
| Google Cloud storage | Object Storage | `Parquet` `CSV` | |
| Upload file | File | `Parquet` `CSV` | Maximum file size is 700MB. <br/>Bigger files should be uploaded and read from <br/>remote object storages |
| Google BigQuery | Big Table | `Not applicable` | |
| MySQL | RDBMS | `Not applicable` | Supports reading whole schemas or specifying a query |
| Azure SQL Server | RDBMS | `Not applicable` | Supports reading whole schemas or specifying a query |
| PostGreSQL | RDBMS | `Not applicable` | Supports reading whole schemas or specifying a query |
| Snowflake | RDBMS | `Not applicable` | Supports reading whole schemas or specifying a query |
| Oracle DB | RDBMS | `Not applicable` | Supports reading whole schemas or specifying a query |
| Databricks Unity Catalog | Catalog | `Not applicable` | Supports reading a table |
| Databricks Delta Lake | Lakehouse | `Not applicable` | Supports reading a table |

## Haven't found your storage?

Expand Down
3 changes: 3 additions & 0 deletions docs/data_catalog/connectors/use_in_labs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Use connectors in Lab

## Create a lab environment
48 changes: 20 additions & 28 deletions docs/data_catalog/datasources/index.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,28 @@
# Overview

To enable a full understanding of the available data assets, Fabric further incorporates a module for **Data Profiling**, which allows you to further investigate the characteristics of your dataset more deeply, zooming in on the behavior and relationships between particular columns.
YData Fabric Datasources are entities that represent specific data sets such as tables,
file sets, or other structured formats within the YData Fabric platform.
They offer a centralized framework for managing, cataloging, and profiling data,
enhancing data management and quality.

???+ question "Profiling Large Datasets?"
We've got you covered. [Fabric Data Catalog](https://ydata.ai/products/data_catalog) offers an interactive, flexible, and intuitive experience when handling datasets with **thousands of columns and any number of rows**. Learn more about the benefits of Fabric in [profiling high-dimensional datasets](https://ydata.ai/resources/understanding-large-multivariate-data-with-data-profiling) and sign up for the [Community Version](https://ydata.ai/ydata-fabric-free-trial) to experiment with your own data assets.
## Benefits

The data profiling essentially enables the following analysis:
- **Summarized metadata information:** Fabric Datasources provide comprehensive metadata management, offering detailed
information about each datasource, including schema details, descriptions, tags, and data lineage.
This metadata helps users understand the structure and context of their data.

- **Univariate Analysis and Feature Statistics:** Fabric incorporates **type inference**, automatically detecting the data types in a dataset. Depending on the column’s data type, **adjusted descriptive statistics** are presented. The same applies for the **visualizations** chosen for each column.
- **Data Quality Management:** Users can find data quality warnings, validation results, cleansing suggestions, and quality scores.
These features help in identifying and addressing data quality issues automatically, ensuring reliable data
for analysis and decision-making.

- **Multivariate Analysis and Correlation Assessment:** To enable multivariate analysis and the evaluation of existing relationships between columns, Fabric includes informative visualizations regarding the **interactions** and **correlations** between columns, and the investigation of **missing data** and **outliers**.
- **Data Profiling:** Data profiling tools analyze the content and structure of datasources, providing statistical summaries,
detecting patterns, assessing completeness, and evaluating data uniqueness. These insights help in understanding
and improving data quality.

<figure markdown>
![Connectors](../assets/data_catalog/profiling_heatmap.png){: style="height:550px;width:1200px"}
</figure>
- **PII Identification and Management:** Fabric detects and manages Personally Identifiable Information (PII) within datasources.
It includes automatic PII detection, masking tools, and compliance reporting to protect sensitive data and
ensure regulatory compliance.


The data profiling highlights a set of **statistical properties**, such as:

- **Variables Properties**:
- Descriptive statistics
- Quantile statistics
- Histogram, Common Values, and Extreme Values
- **Interactions and Correlations**:
- Heat maps and bar plot formats with interactive selection;
- Spearman’s and Cramer’s V analysis
- **Missing Values (MAR, MNAR, and MCAR):**
- Count and Matrix
- **Autoregressive and Stationarity Detection** <span style="color:grey">***(Time Series Data)***</span>
- ACF and PACF analysis
- **Text Analysis**
- Most occurring characters, words, categories, among others

???+ tip "Profiling Sensitive Data?"
By default, Fabric assumes that any data to be profile **can contain sensitive information**. For that reason, it includes several features to enable a **secure and fair data profiling** such as the *aggregation of easily-identifiable groups* and the *obfuscation of values* for categorical columns. Sign up for the [Community Version](https://ydata.ai/ydata-fabric-free-trial) and move towards a **responsible exploration** of your data.
- **Centralized Repository:** Fabric Datasources serve as a centralized repository for data quality discovery and management.
They provide a single point of access for all data assets, simplifying discovery, monitoring, and governance,
and improving overall data management efficiency.
12 changes: 7 additions & 5 deletions docs/data_catalog/datasources/pii.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,22 @@ To overcome the concerns around data privacy and enable secure data sharing, Fab
Fabric offers a **standardized classification of PII** that automatically highlights and tags potential PII. The automatic detection of PII can be enabled **during the loading process** of your datasets and can be leveraged to generate **privacy-preserving synthetic data**.

<figure markdown>
![PII Detection](../assets/data_catalog/pii_detection.png){: style="height:550px;width:600px"}
![PII Detection](../../assets/data_catalog/pii_detection.png){: style="height:550px;width:600px"}
</figure>

After the detection, the PII information will be available through the **Metadata > PII Types**, where each column that may represent potential PII is *associated to one or several tags that identify the type of information it might be leaking*.

<figure markdown>
![PII Overview](../assets/data_catalog/pii_overview.png){: style="height:430px;width:1000px"}
![PII Overview](../../assets/data_catalog/pii_overview.png){: style="height:430px;width:1000px"}
</figure>

You can **review the automatic PII classification and add additional PII tags** of your own by editing the metadata and select additional tags available in a **pre-defined list of values**, containing the most common types of potential PII information: email, phone, VAT, zip code, among others.

<figure markdown>
![PII Editing](../assets/data_catalog/pii_editing.png){: style="height:600px;width:1000px"}
![PII Editing](../../assets/data_catalog/pii_editing.png){: style="height:600px;width:1000px"}
</figure>

???+ question "Need a solution to enable data sharing and comply with GDPR and CCPA regulations?"
Using [synthetic data](https://ydata.ai/products/synthetic_data) has proven to foster a culture of data-sharing within organizations, overcoming the limitations of traditional privacy methods and maximizing data value. Try [Fabric Community Version](https://ydata.ai/ydata-fabric-free-trial) to enable secure data sharing.
???+ question "Need a solution to enable data sharing and comply with **GDPR** and **CCPA** regulations?"
Using ^^[synthetic data](https://ydata.ai/products/synthetic_data)^^ has proven to foster a culture of data-sharing
within organizations, overcoming the limitations of traditional privacy methods and maximizing data value.
Try ^^[Fabric Community Version](https://ydata.ai/ydata-fabric-free-trial)^^ to enable secure data sharing.
Loading

0 comments on commit efe0956

Please sign in to comment.