Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: synthetic data, labs and pipelines general documentation #72

Merged
merged 8 commits into from
Jun 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/labs/cloning_jupyterlab.webp
Binary file not shown.
Binary file added docs/assets/labs/cloning_repo_vscode.webp
Binary file not shown.
Binary file added docs/assets/labs/git_integration_vscode.webp
Binary file not shown.
Binary file added docs/assets/labs/jupyterlab-git.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/labs/jupyterlab_git_extension.webp
Binary file not shown.
Binary file added docs/assets/labs/welcome_labs_creation.webp
Binary file not shown.
22 changes: 22 additions & 0 deletions docs/labs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Fabric coding environment

^^[**YData Fabric Labs**](https://ydata.ai/products/fabric)^^ are on-demand, cloud-based data development environments with automatically provisioned hardware (multiple infrastructure configurations,
including GPUs, are possible) and **full platform integration** via a Python interface (allowing access to Data Sources, Synthesizers,
and the Workspace’s shared files).

Wit Labs, you can create environment with the support to familiar IDEs like [**Visual Studio Code**](https://code.visualstudio.com/), [**Jupyter Lab](https://jupyterlab.readthedocs.io/en/stable/)**
and [**H20 Flow**](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/flow.html), with support for both Python and R are included.

For Python specifically, pre-configured bundles including TensorFlow, PyTorch and/or the main popular data science libraries
are also available, jumpstarting data development. Additional libraries can be easily installed leveraging a simple *!pip install*

<p align="center"><img src="assets/labs/welcome_labs_creation.webp" alt="Welcome Labs" width="900"></p>

## Get started with your first lab

🧪 Follow this [step-by-step guided tutorial to create your first Lab](../get-started/create_lab.md).

## Tutorials & recipes

Leverage YData extensive collection of ^^[tutorials and recipes that you can find in YData Academy](https://github.com/ydataai/academy)^^. Quickstart or accelerate your data developments
with recipes and tutorial use-cases.
101 changes: 101 additions & 0 deletions docs/labs/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Overview

Labs exist for Data practitioners to tackle more complex use cases through a familiar environment supercharged with infrastructure,
integration with other Fabric modules and access to advanced synthesis and profiling technology via a familiar python interface.

It is the preferred environment for Data practitioners to express their domain expertise with all the required tools,
technology and computational power at their fingertips. It is thus the natural continuation of the data understanding works which
started in Data Sources.

## Supported IDE's and images

### IDEs
YData Fabric supports integration with various Integrated Development Environments (IDEs) to enhance productivity and streamline workflows.
The supported IDEs include:

- **Visual Studio Code (VS Code):** A highly versatile and widely-used code editor that offers robust support for numerous programming languages
and frameworks. Its integration with Git and extensions like GitLens makes it ideal for version control and collaborative development.
- **Jupyter Lab:** An interactive development environment that allows for notebook-based data science and machine learning workflows.
It supports seamless Git integration through extensions and offers a user-friendly interface for managing code, data, and visualizations.
- **H2O Flow:** A web-based interface specifically designed for machine learning and data analysis with the H2O platform.
It provides a flow-based, interactive environment for building and deploying machine learning models.

### Labs images
In the Labs environment, users have access to the following default images, tailored to different computational needs:

#### Python
All the below images support Python as the programming language. Current Python version is x

- **YData CPU:** Optimized for general-purpose computing and data analysis tasks that do not require GPU acceleration. This image includes access
to YData Fabric unique capabilities for data processing (profiling, constraints engine, synthetic data generation, etc).
- **YData GPU:** Designed for tasks that benefit from GPU acceleration, providing enhanced performance for large-scale data processing and machine learning
operations. Also includes access to YData Fabric unique capabilities for data processing.
- **YData GPU TensorFlow:** Specifically configured for TensorFlow-based machine learning and deep learning applications, leveraging GPU capabilities
to accelerate training and inference processes. These images ensure that users have the necessary resources and configurations to efficiently
conduct their data science and machine learning projects within the Labs environment.
- **YData GPU Torch:** Specifically configured for Torch-based machine learning and deep learning applications, leveraging GPU capabilities
to accelerate training and inference processes. These images ensure that users have the necessary resources and configurations to efficiently
conduct their data science and machine learning projects within the Labs environment.

#### R
An ^^[image for R](https://www.r-project.org/about.html#:~:text=Introduction%20to%20R,by%20John%20Chambers%20and%20colleagues.)^^, that allows you
to leverage the latest version of the language as well as the most user libraries.

## Existing Labs

Existing Labs appear in the *Labs* pane of the web application. Besides information about its settings and status, three buttons exist:

- **Open:** Open the Lab’s IDE in a new browser tab
- **Pause:** Pause the Lab. When resumed, all data will be available.
- **Delete:** Lab will be deleted. Data not saved in the workspace’s shared folder (see below) will be deleted.

![The details list of a Lab, with the status and its main actions.](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/f6b25172-047e-47bd-8ab2-c9a0a45731ae/Untitled.png)

The details list of a Lab, with the status and its main actions.

The Status column indicates the Labs’ status. A Lab can have 4 statuses:

- 🟢 Lab is running
- 🟡 Lab is being created (hardware is being provisioned) or is either pausing or starting
- 🔴 Lab was shutdown due to an error. A common error is the Lab going out-of-memory. Additional details are offered in the web application.
- ⚫ Lab is paused

## Git integration
Integrating Git with Jupyter Notebooks and Visual Studio Code (VS Code) streamlines version control and collaborative workflows
for data developers. This integration allows you to track changes, manage project versions, and collaborate effectively within familiar interfaces.

### Jupyter Lab

Inside of Labs that use Jupyter Lab as IDE, you will find the ^^[*jupyterlab-git*](https://github.com/jupyterlab/jupyterlab-git)^^
extension installed in the environment.

To create or clone a new repository you need to perform the following steps:

| Select Jupyter Lab Git extension | Cloning a repository to your local env |
|----------------------------------------------------------------|------------------------------------------------------|
| ![Jupyter Lab git](../assets/labs/jupyterlab_git_extension.webp) | ![Cloning](../assets/labs/cloning_jupyterlab.webp) |

For more complex actions like forking and merging branches, see the gif below:
![Jupyterlab-git extension in action](../assets/labs/jupyterlab-git.gif){: style="width:80%"}

### Visual Code (VS Code)

To clone or create a new git repository you can click in *"Clone Git Repository..."* and paste it in the text box in the top center area of screen
as depicted in the image below.

| Clone Git repository | Cloning a repository to your local env |
|--------------------------------------------------------------------------------|--------------------------------------------------------------|
| ![Vs code clone repo](../assets/labs/git_integration_vscode.webp) | ![Cloning vs code](../assets/labs/cloning_repo_vscode.webp) |

## Building Pipelines
Building data pipelines and breaking them down into modular components can be challenging.
For instance, a typical machine learning or deep learning pipeline starts with a series of preprocessing steps,
followed by experimentation and optimization, and finally deployment.
Each of these stages presents unique challenges within the development lifecycle.

Fabric Jupyter Labs simplifies this process by incorporating Elyra as the Pipeline Visual Editor.
The visual editor enables users to build data pipelines from notebooks, Python scripts, and R scripts, making it easier to convert multiple notebooks
or script files into batch jobs or workflows.

Currently, these pipelines can be executed either locally in JupyterLab or on Kubeflow Pipelines, offering flexibility and scalability
for various project needs. ^^[Read more about pipelines.](../pipelines/index.md)^^
67 changes: 67 additions & 0 deletions docs/pipelines/concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Concepts

An example pipeline (as seen in the Pipelines module of the dashboard), where each single-responsibility block corresponds to a step in a typical machine learning workflow

Each Pipeline is a set of connected blocks. A block is a self-contained set of code, packaged as a container, that performs one step in the Pipeline. Usually, each Pipeline block corresponds to a single responsibility task in a workflow. In a machine learning workflow, each step would correspond to one block, i.e, data ingestion, data cleaning, pre-processing, ML model training, ML model evaluation.

Each block is parametrized by:

- **code:** it executes (for instance, a Jupyter Notebook, a Python file, an R script)
- **runtime:** which specifies the container environment it runs in, allowing modularization and inter-step independence of software requirements (for instance, specific Python versions for different blocks)
- **hardware requirements:** depending on the workload, a block may have different needs regarding CPU/GPU/RAM. These requirements are automatically matched with the hardware availability of the cluster the Platform’s running in. This, combined with the modularity of each block, allows cost and efficiency optimizations by up/downscaling hardware according to the workload.
- **file dependencies:** local files that need to be copied to the container environment
- **environment variables**, useful, for instance to apply specific settings or inject authentication credentials
- **output files**: files generated during the block’s workload, which will be made available to all subsequent Pipeline steps

The hierarchy of a Pipeline, in an ascending manner, is as follows:

- **Run:** A single execution of a Pipeline. Usually, Pipelines are run due to changes on the code,
on the data sources or on its parameters (as Pipelines can have runtime parameters)
- **Experiment:** Groups of runs of the same Pipeline (may have different parameters, code or settings, which are
then easily comparable). All runs must have an Experiment. An Experiment can contain Runs from different Pipelines.
- **Pipeline Version:** Pipeline definitions can be versioned (for instance, early iterations on the flow of operations;
different versions for staging and production environments)
- **Pipeline**

📖 ^^[Get started with the concepts and a step-by-step tutorial](../get-started/create_pipeline.md)^^

## Runs & Recurring Runs
A *run* is a single execution of a pipeline. Runs comprise an immutable log of all experiments that you attempt,
and are designed to be self-contained to allow for reproducibility. You can track the progress of a run by looking
at its details page on the pipeline's UI, where you can see the runtime graph, output artifacts, and logs for each step
in the run.

A *recurring run*, or job in the backend APIs, is a repeatable run of a pipeline.
The configuration for a recurring run includes a copy of a pipeline with all parameter values specified
and a run trigger. You can start a recurring run inside any experiment, and it will periodically start a new copy
of the run configuration. You can enable or disable the recurring run from the pipeline's UI. You can also specify
the maximum number of concurrent runs to limit the number of runs launched in parallel.
This can be helpful if the pipeline is expected to run for a long period and is triggered to run frequently.
## Experiment
An experiment is a workspace where you can try different configurations of your pipelines. You can use experiments to organize
your runs into logical groups. Experiments can contain arbitrary runs, including recurring runs.
## Pipeline & Pipeline Version
A pipeline is a description of a workflow, which can include machine learning (ML) tasks, data preparation or even the
generation of synthetic data. The pipeline outlines all the components involved in the workflow and illustrates how these
components interrelate in the form of a graph. The pipeline configuration defines the inputs (parameters) required to run
the pipeline and specifies the inputs and outputs of each component.

When you run a pipeline, the system launches one or more Kubernetes Pods corresponding to the steps (components)
in your workflow. The Pods start Docker containers, and the containers, in turn, start your programs.

Pipelines can be easily versioned for reproducibility of results.
## Artifacts
For each block/step in a Run, **Artifacts** can be generated.
Artifacts are raw output data which is automatically rendered in the Pipeline’s UI in a rich manner - as formatted tables, text, charts, bar graphs/scatter plots/line graphs,
ROC curves, confusion matrices or inline HTML.

Artifacts are useful to attach, to each step/block of a data improvement workflow, relevant visualizations, summary tables, data profiling reports or text analyses.
They are logged by creating a JSON file with a simple, pre-specified format (according to the output artifact type).
Additional types of artifacts are supported (like binary files - models, datasets), yet will not benefit from rich visualizations in the UI.

!!! tip "Compare side-by-side"
💡 **Artifacts** and **Metrics** can be compared side-by-side across runs, which makes them a powerful tool when doing iterative experimentation over
data quality improvement pipelines.

## Pipelines examples in YData Academy
👉 ^^[Use cases on YData’s Academy](https://github.com/ydataai/academy/tree/master/4%20-%20Use%20Cases)^^ contain examples of full use-cases as well as Pipelines interface to log metrics and artifacts.
42 changes: 42 additions & 0 deletions docs/pipelines/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Pipelines

The Pipelines module of [YData Fabric](https://ydata.ai/products/fabric) is a general-purpose job orchestrator with built-in scalability and modularity
plus reporting and experiment tracking capabilities.
With **automatic hardware provisioning**, **on-demand** or **scheduled execution**, **run fingerprinting**
and a **UI interface for review and configuration**, Pipelines equip the Fabric with
**operational capabilities for interfacing with up/downstream systems**
(for instance to automate data ingestion, synthesis and transfer workflows) and with the ability to
**experiment at scale** (crucial during the iterative development process required to discover the data
improvement pipeline yielding the highest quality datasets).

YData Fabric's Pipelines are based on ^^[Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/)^^
and can be created via an interactive interface in Labs with Jupyter Lab as the IDE **(recommended)** or
via [Kubeflow Pipeline’s Python SDK](https://www.kubeflow.org/docs/components/pipelines/sdk/sdk-overview/).

With its full integration with Fabric's scalable architecture and the ability to leverage Fabric’s Python interface,
Pipelines are the recommended tool to **scale up notebook work to experiment at scale** or
**move from experimentation to production**.

## Benefits
Using Pipelines for data preparation offers several benefits, particularly in the context of data engineering,
machine learning, and data science workflows. Here are some key advantages:

- **Modularity:** they allow to break down data preparation into discrete, reusable steps.
Each step can be independently developed, tested, and maintained, enhancing code modularity and readability.
- **Automation:** they automate the data preparation process, reducing the need for manual intervention
and ensuring that data is consistently processed. This leads to more efficient workflows and saves time.
- **Scalability:** Fabric's distributed infrastructure combined with kubernetes based pipelines allows to handle
large volumes of data efficiently, making them suitable for big data environments.
- **Reproducibility:** By defining a series of steps that transform raw data into a ready-to-use format,
pipelines ensure that the same transformations are applied every time. This reproducibility is crucial for
maintaining data integrity and for validating results.
Maintainability:
- **Versioning:** support versioning of the data preparation steps. This versioning is crucial
for tracking changes, auditing processes, and rolling back to previous versions if needed.
- **Flexibility:** and above all they can be customized to fit specific requirements of different projects.
They can be adapted to include various preprocessing techniques, feature engineering steps,
and data validation processes.

## Related Materials
- 📖 ^^[How to create your first Pipeline](../get-started/create_pipeline.md)^^
- :fontawesome-brands-youtube:{ .youtube } <a href="https://www.youtube.com/watch?v=feNoXv34waM"><u>How to build a pipeline with YData Fabric</u></a>
Loading
Loading