Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add databricks integration documentation #100

Merged
merged 4 commits into from
Jun 12, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/integrations/Databricks diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
9 changes: 0 additions & 9 deletions docs/data_catalog/connectors/integration_databricks.md

This file was deleted.

163 changes: 163 additions & 0 deletions docs/integrations/databricks/integration_connectors_catalog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Connectors & Catalog

^^[YData Fabric](https://ydata.ai/products/fabric)^^ provides a seamless integration with Databricks, allowing you to connect,
query, and manage your data in Databricks Unity Catalog and Delta Lake with ease. This section will guide you through the benefits,
setup, and usage of the Databricks' available connector in Fabric.

!!! note "Prerequisites"
Before using the YData SDK in Databricks notebooks, ensure the following prerequisites are met:

- Access to a Databricks workspace
- A valid YData Fabric account and API key
- Credentials for Databricks (tokens, Databricks host, warehouse, database, schema, etc.).

## Delta Lake

Databricks Delta Lake is an open-source storage layer that brings reliability to data lakes. Built on top of Apache Spark,
Delta Lake provides ACID (Atomicity, Consistency, Isolation, Durability) transaction guarantees,
scalable metadata handling, and unifies streaming and batch data processing.

In this tutorial it will be covered how you can leverage ^^[YData Fabric connectors](../../data_catalog/connectors/supported_connections.md)^^
to integrate with Databricks Delta Lake.

### Setting Up the Delta Lake Connector

To create a Delta Lake connector in YData Fabric Ui you need to meet the ^^[following pre-requisites](overview.md)^^.

#### Step-by-step creation through the UI
To create a connector in YData Fabric, select the *"Connectors"* page from the left side menu, as illustrated in the image below.

![Select Connectors from Homepage](../../assets/data_catalog/connectors/go_to_connector.png){: style="width:75%"}

Now, click in the *"Create Connector"* button and the following menu with the available connectors will be shown.

![Select Databricks Delta Lake connector](../../assets/integrations/select_delta_lake_connector.webp){: style="width:50%"}

Depending on the cloud vendor that you have your Databricks' instance deployed, select the Delta Lake connector for AWS or Azure.
After selecting the connector type *"Databricks Delta Lake"* the below menu will be shown.
This is where you can configure the connection to your Delta Lake. For that you will need the following information:

![Config Delta Lake connector](../../assets/integrations/Delta_lake_aws_inputs.webp){: style="width:45%; padding-right:10px", align=left}

- **Databricks Host:** The URL of your Databricks cluster
- **Access token:** Your Databricks' user token
- **Catalog:** The name of a Catalog that you want to connect to
- **Schema:** The name of the schema that you want to connect to

Depending on the cloud selected, you will be asked for the credentials to your staging storage (**AWS S3** or **Azure Blob Storage**).
In this example we are using AWS and for that reason the below inputs refer to *AWS S3*.

- **Key ID:** The Snowflake database to connect to.
- **Key Secret:** The schema within the database.

And finally, the name for your connector:
- **Display name:** A unique name for your connector.
</br></br>
Test your connection and that's it! 🚀

You are now ready to create different **Datasources** using this connector - read the data from a table,
evaluate the quality of the data or even read a full database and generate a synthetic replica of your data!
Read more about ^^[Fabric Datasources in here](../datasources/index.md)^^.

### Use it inside the Labs

👨‍💻 ^^[Full code example and recipe can be found here](https://github.com/ydataai/academy/blob/master/1%20-%20Data%20Catalog/1.%20Connectors/Databricks%20_%20Delta%20Lake.ipynb)^^.

In case you prefer a Python interface, we also have connectors available through Fabric SDK inside the labs.
For a seamless integration between the UI and the Labs environment, Fabric offers an SDK that allows you to re-use connectors,
datasources and even synthesizers.

Start by creating your code environment through the Labs.
In case you need to get started with the Labs, ^^[check this step-by-step guide](../../get-started/create_lab.md)^^.

```python
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
connector = Connectors.get(uid='insert-connector-id',
namespace='indert-namespace-id')
print(connector)
```

#### Read from your Delta Lake
Using the Delta Lake connector it is possible to:

- Get the data from a Delta Lake table
- Get a sample from a Delta Lake table
- Get the data from a query to a Delta Lake instance

## Unity Catalog
Databricks Unity Catalog is a unified governance solution for all data and AI assets within the Databricks Lakehouse Platform.

Databricks Unity Catalog leverages the concept of [Delta Sharing](https://www.databricks.com/product/delta-sharing),
meaning this is a great way not only to ensure alignment between Catalogs but also to limit the access to data.
This means that byt leveraging the Unity Catalog connector, users can only access a set of data assets that were authorized
for a given Share.

### Step-by-step creation through the UI

:fontawesome-brands-youtube:{ .youtube } <a href="https://www.youtube.com/watch?v=_12AfMB8hiQ&t=2s"><u>How to create a connector to Databricks Unity Catalog in Fabric?</u></a>

The process to create a new connector is similar to what we have covered before to create a new *Databricks Unity Catalog*
connector in YData Fabric.

After selecting the connector *"Databricks Unity Catalog"*, you will be requested to upload your Delta Sharing token as
depicted in the image below.

![Upload Delta Sharing token](../../assets/integrations/databricks_unity_catalog.webp){: style="width:50%"}

Test your connection and that's it! 🚀

### Use it inside the Labs

👨‍💻 ^^[Full code example and recipe can be found here](https://github.com/ydataai/academy/blob/master/1%20-%20Data%20Catalog/1.%20Connectors/Databricks%20_%20Unity%20Catalog.ipynb)^^.

In case you prefer a Python interface, we also have connectors available through Fabric inside the labs.
Start by creating your code environment through the Labs. In case you need to get started with the Labs, ^^[check this step-by-step guide](../../get-started/create_lab.md)^^.

#### Navigate your Delta Share
With your connector created you are now able to explore the schemas and tables available in a Delta share.

```python title="List available shares"
#List the available shares for the provided authentication
connector.list_shares()
```

```python title="List available schemas"
#List the available schemas for a given share
connector.list_schemas(share_name='teste')
```

```python title="List available tables"
#List the available tables for a given schema in a share
connector.list_tables(schema_name='berka',
share_name='teste')

#List all the tables regardless of share and schema
connector.list_all_tables()
```

#### Read from your Delta Share
Using the Delta Lake connector it is possible to:

- Get the data from a Delta Lake table
- Get a sample from a Delta Lake table

```python title="Read the data from a table"
#This method reads all the data records in the table
table = connector.read_table(table_name='insert-table-name',
schema_name='insert-schema-name',
share_name='insert-share-name')
print(table)
```

```python title="Read a data sample from a table"
#This method reads all the data records in the table
table = connector.read_table(table_name='insert-table-name',
schema_name='insert-schema-name',
share_name='insert-share-name',
sample_size=100)
print(table)
```

I hope you enjoyed this quick tutorial on seamlessly integrating Databricks with your data preparation workflows. 🚀
202 changes: 202 additions & 0 deletions docs/integrations/databricks/integration_with_sdk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# YData SDK in Databricks Notebooks

The [YData Fabric SDK](https://pypi.org/project/ydata-sdk/) provides a powerful set of tools for integrating and enhancing data within Databricks notebooks.
This guide covers the installation, basic usage, and advanced features of the Fabric SDK, helping users maximize
the potential of their data for AI and machine learning applications.

👨‍💻 ^^[Full code example and recipe can be found here](https://raw.githubusercontent.com/ydataai/academy/master/5%20-%20Integrations/databricks/YData%20Fabric%20SDK%20in%20Databricks%20notebooks)^^.

!!! note "Prerequisites"
Before using the YData Fabric SDK in Databricks notebooks, ensure the following prerequisites are met:

- Access to a Databricks workspace
- A valid YData Fabric account and API key
- Basic knowledge of Python and Databricks notebooks
- A safe connection between your Databricks cluster and Fabric

**Best Practices**

- *Data Security:* Ensure API keys and sensitive data are securely managed.
- *Efficient Coding:* Use vectorized operations for data manipulation where possible.
- *Resource Management:* Monitor and manage the resources used by your clusters (Databricks and Fabric)
Databricks cluster to optimize performance.

### Installation

To install the YData SDK in a Databricks notebook, use the following command:
```python
%pip install ydata-sdk
dbutils.library.restartPython()
```
Ensure the installation is successful before proceeding to the next steps.

## Basic Usage - data integration
This section provides step-by-step instructions on connecting to YData Fabric and performing essential
data operations using the YData SDK within Databricks notebooks. This includes establishing a secure connection
to YData Fabric and accessing datasets.

### Connecting to YData Fabric
First, establish a connection to YData Fabric using your API key:

```python
import os

# Add your Fabric token as part of your environment variables for authentication
os.environ["YDATA_TOKEN"] = '<TOKEN>'
```

### Data access & manipulation
Once connected, you can access and manipulate data within YData Fabric. For example, to list available datasets:

```python
from ydata.sdk.datasources import DataSource

#return the list of available DataSources
DataSource.list()
```

To load a specific dataset into a Pandas DataFrame:

```python
#get the data from an existing datasource
dataset = DataSource.get('<DATASOURCE-ID>')
```

## Advanced Usage - Synthetic data generation

This section explores one of the most powerful features of the Fabric SDK for enhancing and refining data
within Databricks notebooks. This includes as generating synthetic data to augment
datasets or to generate privacy-preserving data.
By leveraging these advanced capabilities, users can significantly enhance the robustness and performance of their AI
and machine learning models, unlocking the full potential of their data.

### Privacy-preserving
Leveraging synthetic data allows to create privacy-preserving datasets that maintain real-world value,
enabling users to work with sensitive information securely while accessing utility of real data.

Check the SDK documentation for more information regarding [privacy-controls and anonymization](../../sdk/examples/synthesize_with_privacy_control.md).

#### From a datasource in YData Fabric
Users can generate synthetic data from datasource's existing in Fabric:

```python title="Train a synthetic data generator"
# From an existing Fabric datasource
from ydata.sdk.synthesizers import RegularSynthesizer

synth = RegularSynthesizer(name='<NAME-YOUR-MODEL>')
synth.fit(X=dataset)
```

```python title="Sample from a Synthetic data generator"
# From an existing Fabric datasource
from ydata.sdk.synthesizers import RegularSynthesizer

synth = RegularSynthesizer(name='<NAME-YOUR-MODEL>')
synth.fit(X=dataset)
```
After your synthetic data generator have been trained successfully you can generate as many synthetic datasets as needed
```python title='Sampling from the model that we have just trained'
from ydata.sdk.synthesizers import RegularSynthesizer
sample = synth.sample(100)
sample.head()
```

It is also possible to generate data from other synthetic data generation models previously trained:

```python title='Generating synthetic data from a previously trained model'
from ydata.sdk.synthesizers import RegularSynthesizer

existing_synth = RegularSynthesizer('<INSERT-SYNTHETIC-DATA-GENERATOR-ID>').get()
sample = existing_synth.sample(100)
```

#### From a datasource in Databricks
Another important integration is to train a synthetic data generator from a dataset that you are currently exploring
in your notebook environment.
In order to do so, we recommend that you create your dataset using
[YData Fabric integration connector to your Delta Lake](integration_connectors_catalog.md) and follow the flow for the creation
of a synthetic data generation models from Fabric existing dasources.

For a small dataset you can also follow [this tutorial](../../sdk/examples/synthesize_tabular_data.md).

### Data augmentation
Another key focus is on generating synthetic data to augment existing datasets.
This technique, particularly through conditional synthetic data generation, allows users to create targeted,
realistic datasets. By addressing data imbalances and enriching the training data, conditional synthetic data generation
significantly enhances the robustness and performance of machine learning (ML) models,
leading to more accurate and reliable outcomes.

```python title='Read data from a delta table'
# Read data from the catalog
df = spark.sql("SELECT * FROM ydata.default.credit_scoring_labeled")

# Display the dataframe
display(df)
```

After reading the data we need to convert it to pandas dataframe in order to create our synthetic data generation model.
For the augmentation use-case we will be leveraging Conditional Synthetic data generation.

```python title='Training a conditional synthetic data generator'
from ydata.sdk.synthesizers import RegularSynthesizer

# Convert Spark dataframe to pandas dataframe
pandas_df = df.toPandas()
pandas_df = pandas_df.drop('ID', axis=1)

# Train a synthetic data generator using ydata-sdk
synth = RegularSynthesizer(name='Synth credit scoring | Conditional')
synth.fit(pandas_df, condition_on='Label')

# Display the synthetic dataframe
display(synth)
```

Now that we have a trained conditional synthetic data generator we are able to generate a few samples controlling the
population behaviour based on the columns that we have conditioned the process to.

```python title="Generating a synthetic sample conditioned to column 'Label'"
#generate synthetic samples condition to Label
synthetic_sample = synth.sample(n_samples=len(pandas_df), condition_on={
"Label": {
"categories": [{
"category": 1,
"percentage": 0.7
}]
}
}
)
```

After generating the synthetic data we can combine it with our dataset.

```python title='Convert the dataframe to Spark dataframe'
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

#Create a spark dataframe from the synthetic dataframe
synthetic_df = spark.createDataFrame(synthetic_sample)

display(synthetic_df)
```

```python title="Combining the datasets"
# Concatenate the original dataframe with the synthetic dataframe
#removing the column ID as it is not used
df = df.drop('ID')
concatenated_df = df.union(synthetic_df)

# Display the concatenated dataframe
display(concatenated_df)
```

Afterwards you can use your augmented dataset to train a ^^[Machine Learning model using MLFlow](https://docs.databricks.com/en/mlflow/tracking-ex-scikit.html)^^.









Loading
Loading