diff --git a/docs/assets/labs/cloning_jupyterlab.webp b/docs/assets/labs/cloning_jupyterlab.webp new file mode 100644 index 00000000..7eb240d4 Binary files /dev/null and b/docs/assets/labs/cloning_jupyterlab.webp differ diff --git a/docs/assets/labs/cloning_repo_vscode.webp b/docs/assets/labs/cloning_repo_vscode.webp new file mode 100644 index 00000000..fcf63dc1 Binary files /dev/null and b/docs/assets/labs/cloning_repo_vscode.webp differ diff --git a/docs/assets/labs/git_integration_vscode.webp b/docs/assets/labs/git_integration_vscode.webp new file mode 100644 index 00000000..7e3435e2 Binary files /dev/null and b/docs/assets/labs/git_integration_vscode.webp differ diff --git a/docs/assets/labs/jupyterlab-git.gif b/docs/assets/labs/jupyterlab-git.gif new file mode 100644 index 00000000..d5e398ad Binary files /dev/null and b/docs/assets/labs/jupyterlab-git.gif differ diff --git a/docs/assets/labs/jupyterlab_git_extension.webp b/docs/assets/labs/jupyterlab_git_extension.webp new file mode 100644 index 00000000..547b8e49 Binary files /dev/null and b/docs/assets/labs/jupyterlab_git_extension.webp differ diff --git a/docs/assets/labs/welcome_labs_creation.webp b/docs/assets/labs/welcome_labs_creation.webp new file mode 100644 index 00000000..7c3e41f8 Binary files /dev/null and b/docs/assets/labs/welcome_labs_creation.webp differ diff --git a/docs/labs/index.md b/docs/labs/index.md new file mode 100644 index 00000000..9487c5a9 --- /dev/null +++ b/docs/labs/index.md @@ -0,0 +1,22 @@ +# Fabric coding environment + +^^[**YData Fabric Labs**](https://ydata.ai/products/fabric)^^ are on-demand, cloud-based data development environments with automatically provisioned hardware (multiple infrastructure configurations, +including GPUs, are possible) and **full platform integration** via a Python interface (allowing access to Data Sources, Synthesizers, +and the Workspace’s shared files). + +Wit Labs, you can create environment with the support to familiar IDEs like [**Visual Studio Code**](https://code.visualstudio.com/), [**Jupyter Lab](https://jupyterlab.readthedocs.io/en/stable/)** +and [**H20 Flow**](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/flow.html), with support for both Python and R are included. + +For Python specifically, pre-configured bundles including TensorFlow, PyTorch and/or the main popular data science libraries +are also available, jumpstarting data development. Additional libraries can be easily installed leveraging a simple *!pip install* + +

Welcome Labs

+ +## Get started with your first lab + +🧪 Follow this [step-by-step guided tutorial to create your first Lab](../get-started/create_lab.md). + +## Tutorials & recipes + +Leverage YData extensive collection of ^^[tutorials and recipes that you can find in YData Academy](https://github.com/ydataai/academy)^^. Quickstart or accelerate your data developments +with recipes and tutorial use-cases. diff --git a/docs/labs/overview.md b/docs/labs/overview.md new file mode 100644 index 00000000..829b4643 --- /dev/null +++ b/docs/labs/overview.md @@ -0,0 +1,101 @@ +# Overview + +Labs exist for Data practitioners to tackle more complex use cases through a familiar environment supercharged with infrastructure, +integration with other Fabric modules and access to advanced synthesis and profiling technology via a familiar python interface. + +It is the preferred environment for Data practitioners to express their domain expertise with all the required tools, +technology and computational power at their fingertips. It is thus the natural continuation of the data understanding works which +started in Data Sources. + +## Supported IDE's and images + +### IDEs +YData Fabric supports integration with various Integrated Development Environments (IDEs) to enhance productivity and streamline workflows. +The supported IDEs include: + +- **Visual Studio Code (VS Code):** A highly versatile and widely-used code editor that offers robust support for numerous programming languages +and frameworks. Its integration with Git and extensions like GitLens makes it ideal for version control and collaborative development. +- **Jupyter Lab:** An interactive development environment that allows for notebook-based data science and machine learning workflows. +It supports seamless Git integration through extensions and offers a user-friendly interface for managing code, data, and visualizations. +- **H2O Flow:** A web-based interface specifically designed for machine learning and data analysis with the H2O platform. +It provides a flow-based, interactive environment for building and deploying machine learning models. + +### Labs images +In the Labs environment, users have access to the following default images, tailored to different computational needs: + +#### Python +All the below images support Python as the programming language. Current Python version is x + +- **YData CPU:** Optimized for general-purpose computing and data analysis tasks that do not require GPU acceleration. This image includes access +to YData Fabric unique capabilities for data processing (profiling, constraints engine, synthetic data generation, etc). +- **YData GPU:** Designed for tasks that benefit from GPU acceleration, providing enhanced performance for large-scale data processing and machine learning +operations. Also includes access to YData Fabric unique capabilities for data processing. +- **YData GPU TensorFlow:** Specifically configured for TensorFlow-based machine learning and deep learning applications, leveraging GPU capabilities +to accelerate training and inference processes. These images ensure that users have the necessary resources and configurations to efficiently +conduct their data science and machine learning projects within the Labs environment. +- **YData GPU Torch:** Specifically configured for Torch-based machine learning and deep learning applications, leveraging GPU capabilities +to accelerate training and inference processes. These images ensure that users have the necessary resources and configurations to efficiently +conduct their data science and machine learning projects within the Labs environment. + +#### R +An ^^[image for R](https://www.r-project.org/about.html#:~:text=Introduction%20to%20R,by%20John%20Chambers%20and%20colleagues.)^^, that allows you +to leverage the latest version of the language as well as the most user libraries. + +## Existing Labs + +Existing Labs appear in the *Labs* pane of the web application. Besides information about its settings and status, three buttons exist: + +- **Open:** Open the Lab’s IDE in a new browser tab +- **Pause:** Pause the Lab. When resumed, all data will be available. +- **Delete:** Lab will be deleted. Data not saved in the workspace’s shared folder (see below) will be deleted. + +![The details list of a Lab, with the status and its main actions.](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/f6b25172-047e-47bd-8ab2-c9a0a45731ae/Untitled.png) + +The details list of a Lab, with the status and its main actions. + +The Status column indicates the Labs’ status. A Lab can have 4 statuses: + +- 🟢 Lab is running +- 🟡 Lab is being created (hardware is being provisioned) or is either pausing or starting +- 🔴 Lab was shutdown due to an error. A common error is the Lab going out-of-memory. Additional details are offered in the web application. +- ⚫ Lab is paused + +## Git integration +Integrating Git with Jupyter Notebooks and Visual Studio Code (VS Code) streamlines version control and collaborative workflows +for data developers. This integration allows you to track changes, manage project versions, and collaborate effectively within familiar interfaces. + +### Jupyter Lab + +Inside of Labs that use Jupyter Lab as IDE, you will find the ^^[*jupyterlab-git*](https://github.com/jupyterlab/jupyterlab-git)^^ +extension installed in the environment. + +To create or clone a new repository you need to perform the following steps: + +| Select Jupyter Lab Git extension | Cloning a repository to your local env | +|----------------------------------------------------------------|------------------------------------------------------| +| ![Jupyter Lab git](../assets/labs/jupyterlab_git_extension.webp) | ![Cloning](../assets/labs/cloning_jupyterlab.webp) | + +For more complex actions like forking and merging branches, see the gif below: +![Jupyterlab-git extension in action](../assets/labs/jupyterlab-git.gif){: style="width:80%"} + +### Visual Code (VS Code) + +To clone or create a new git repository you can click in *"Clone Git Repository..."* and paste it in the text box in the top center area of screen +as depicted in the image below. + +| Clone Git repository | Cloning a repository to your local env | +|--------------------------------------------------------------------------------|--------------------------------------------------------------| +| ![Vs code clone repo](../assets/labs/git_integration_vscode.webp) | ![Cloning vs code](../assets/labs/cloning_repo_vscode.webp) | + +## Building Pipelines +Building data pipelines and breaking them down into modular components can be challenging. +For instance, a typical machine learning or deep learning pipeline starts with a series of preprocessing steps, +followed by experimentation and optimization, and finally deployment. +Each of these stages presents unique challenges within the development lifecycle. + +Fabric Jupyter Labs simplifies this process by incorporating Elyra as the Pipeline Visual Editor. +The visual editor enables users to build data pipelines from notebooks, Python scripts, and R scripts, making it easier to convert multiple notebooks +or script files into batch jobs or workflows. + +Currently, these pipelines can be executed either locally in JupyterLab or on Kubeflow Pipelines, offering flexibility and scalability +for various project needs. ^^[Read more about pipelines.](../pipelines/index.md)^^ diff --git a/docs/pipelines/concepts.md b/docs/pipelines/concepts.md new file mode 100644 index 00000000..21a256d2 --- /dev/null +++ b/docs/pipelines/concepts.md @@ -0,0 +1,67 @@ +# Concepts + +An example pipeline (as seen in the Pipelines module of the dashboard), where each single-responsibility block corresponds to a step in a typical machine learning workflow + +Each Pipeline is a set of connected blocks. A block is a self-contained set of code, packaged as a container, that performs one step in the Pipeline. Usually, each Pipeline block corresponds to a single responsibility task in a workflow. In a machine learning workflow, each step would correspond to one block, i.e, data ingestion, data cleaning, pre-processing, ML model training, ML model evaluation. + +Each block is parametrized by: + +- **code:** it executes (for instance, a Jupyter Notebook, a Python file, an R script) +- **runtime:** which specifies the container environment it runs in, allowing modularization and inter-step independence of software requirements (for instance, specific Python versions for different blocks) +- **hardware requirements:** depending on the workload, a block may have different needs regarding CPU/GPU/RAM. These requirements are automatically matched with the hardware availability of the cluster the Platform’s running in. This, combined with the modularity of each block, allows cost and efficiency optimizations by up/downscaling hardware according to the workload. +- **file dependencies:** local files that need to be copied to the container environment +- **environment variables**, useful, for instance to apply specific settings or inject authentication credentials +- **output files**: files generated during the block’s workload, which will be made available to all subsequent Pipeline steps + +The hierarchy of a Pipeline, in an ascending manner, is as follows: + +- **Run:** A single execution of a Pipeline. Usually, Pipelines are run due to changes on the code, +on the data sources or on its parameters (as Pipelines can have runtime parameters) +- **Experiment:** Groups of runs of the same Pipeline (may have different parameters, code or settings, which are +then easily comparable). All runs must have an Experiment. An Experiment can contain Runs from different Pipelines. +- **Pipeline Version:** Pipeline definitions can be versioned (for instance, early iterations on the flow of operations; +different versions for staging and production environments) +- **Pipeline** + +📖 ^^[Get started with the concepts and a step-by-step tutorial](../get-started/create_pipeline.md)^^ + +## Runs & Recurring Runs +A *run* is a single execution of a pipeline. Runs comprise an immutable log of all experiments that you attempt, +and are designed to be self-contained to allow for reproducibility. You can track the progress of a run by looking +at its details page on the pipeline's UI, where you can see the runtime graph, output artifacts, and logs for each step +in the run. + +A *recurring run*, or job in the backend APIs, is a repeatable run of a pipeline. +The configuration for a recurring run includes a copy of a pipeline with all parameter values specified +and a run trigger. You can start a recurring run inside any experiment, and it will periodically start a new copy +of the run configuration. You can enable or disable the recurring run from the pipeline's UI. You can also specify +the maximum number of concurrent runs to limit the number of runs launched in parallel. +This can be helpful if the pipeline is expected to run for a long period and is triggered to run frequently. +## Experiment +An experiment is a workspace where you can try different configurations of your pipelines. You can use experiments to organize +your runs into logical groups. Experiments can contain arbitrary runs, including recurring runs. +## Pipeline & Pipeline Version +A pipeline is a description of a workflow, which can include machine learning (ML) tasks, data preparation or even the +generation of synthetic data. The pipeline outlines all the components involved in the workflow and illustrates how these +components interrelate in the form of a graph. The pipeline configuration defines the inputs (parameters) required to run +the pipeline and specifies the inputs and outputs of each component. + +When you run a pipeline, the system launches one or more Kubernetes Pods corresponding to the steps (components) +in your workflow. The Pods start Docker containers, and the containers, in turn, start your programs. + +Pipelines can be easily versioned for reproducibility of results. +## Artifacts +For each block/step in a Run, **Artifacts** can be generated. +Artifacts are raw output data which is automatically rendered in the Pipeline’s UI in a rich manner - as formatted tables, text, charts, bar graphs/scatter plots/line graphs, +ROC curves, confusion matrices or inline HTML. + +Artifacts are useful to attach, to each step/block of a data improvement workflow, relevant visualizations, summary tables, data profiling reports or text analyses. +They are logged by creating a JSON file with a simple, pre-specified format (according to the output artifact type). +Additional types of artifacts are supported (like binary files - models, datasets), yet will not benefit from rich visualizations in the UI. + +!!! tip "Compare side-by-side" + 💡 **Artifacts** and **Metrics** can be compared side-by-side across runs, which makes them a powerful tool when doing iterative experimentation over + data quality improvement pipelines. + +## Pipelines examples in YData Academy +👉 ^^[Use cases on YData’s Academy](https://github.com/ydataai/academy/tree/master/4%20-%20Use%20Cases)^^ contain examples of full use-cases as well as Pipelines interface to log metrics and artifacts. diff --git a/docs/pipelines/index.md b/docs/pipelines/index.md new file mode 100644 index 00000000..e6be9101 --- /dev/null +++ b/docs/pipelines/index.md @@ -0,0 +1,42 @@ +# Pipelines + +The Pipelines module of [YData Fabric](https://ydata.ai/products/fabric) is a general-purpose job orchestrator with built-in scalability and modularity +plus reporting and experiment tracking capabilities. +With **automatic hardware provisioning**, **on-demand** or **scheduled execution**, **run fingerprinting** +and a **UI interface for review and configuration**, Pipelines equip the Fabric with +**operational capabilities for interfacing with up/downstream systems** +(for instance to automate data ingestion, synthesis and transfer workflows) and with the ability to +**experiment at scale** (crucial during the iterative development process required to discover the data +improvement pipeline yielding the highest quality datasets). + +YData Fabric's Pipelines are based on ^^[Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/)^^ +and can be created via an interactive interface in Labs with Jupyter Lab as the IDE **(recommended)** or +via [Kubeflow Pipeline’s Python SDK](https://www.kubeflow.org/docs/components/pipelines/sdk/sdk-overview/). + +With its full integration with Fabric's scalable architecture and the ability to leverage Fabric’s Python interface, +Pipelines are the recommended tool to **scale up notebook work to experiment at scale** or +**move from experimentation to production**. + +## Benefits +Using Pipelines for data preparation offers several benefits, particularly in the context of data engineering, +machine learning, and data science workflows. Here are some key advantages: + +- **Modularity:** they allow to break down data preparation into discrete, reusable steps. +Each step can be independently developed, tested, and maintained, enhancing code modularity and readability. +- **Automation:** they automate the data preparation process, reducing the need for manual intervention +and ensuring that data is consistently processed. This leads to more efficient workflows and saves time. +- **Scalability:** Fabric's distributed infrastructure combined with kubernetes based pipelines allows to handle +large volumes of data efficiently, making them suitable for big data environments. +- **Reproducibility:** By defining a series of steps that transform raw data into a ready-to-use format, +pipelines ensure that the same transformations are applied every time. This reproducibility is crucial for +maintaining data integrity and for validating results. +Maintainability: +- **Versioning:** support versioning of the data preparation steps. This versioning is crucial +for tracking changes, auditing processes, and rolling back to previous versions if needed. +- **Flexibility:** and above all they can be customized to fit specific requirements of different projects. +They can be adapted to include various preprocessing techniques, feature engineering steps, +and data validation processes. + +## Related Materials +- 📖 ^^[How to create your first Pipeline](../get-started/create_pipeline.md)^^ +- :fontawesome-brands-youtube:{ .youtube } How to build a pipeline with YData Fabric diff --git a/docs/pipelines/runs.md b/docs/pipelines/runs.md new file mode 100644 index 00000000..5cc2d8fc --- /dev/null +++ b/docs/pipelines/runs.md @@ -0,0 +1,167 @@ +# Creating & managing runs + +## Viewing Run details + +To view a specific Run, we need to go into the **Experiments** list and click on the desired Run. Alternatively, accessing **Runs** and selecting directly the desired run is possible. + +![Acessing Runs through its Experiment](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Untitled%202.png) + +Acessing Runs through its Experiment + +![Viewing the full list of Runs, for all Pipelines and Experiments. Runs can be filtered and sorted based on different fields (including Metrics).](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Untitled%203.png) + +Viewing the full list of Runs, for all Pipelines and Experiments. Runs can be filtered and sorted based on different fields (including Metrics). + +Once a Run is selected, its graph can be viewed (and in real-time, if the Run is being executing). The graph shows the execution status of each log. Clicking on each block will reveal the block’s details, including artifacts, various configuration details and logs (useful for troubleshooting). + +![The details page of a step, showing a profiling report (as HTML) as an Artifact](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Untitled%204.png) + +The details page of a step, showing a profiling report (as HTML) as an Artifact + +The **Run Output** tab includes outputs such as metrics or binary artifacts. + +## Creating Runs + +Besides triggering Execution via the pipeline editor in Jupyter Lab or the Python SDK, the Pipelines management UI can also be used. + +### One-off + +To create a one-off run of a Pipeline, choose a Pipeline in the *Pipelines* section (including the specific Pipeline version, in case there are multiple definitions) and click *+ Create Run*. + +![Creating a Run of a specific Pipeline](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Untitled%205.png) + +Creating a Run of a specific Pipeline + +To finish creating the Run, additional information is needed: + +- a **Description** (optional) +- the **Experiment** (mandatory and can be chosen from the list of existing ones) +- the **Run Type** (which should be one-off) +- any eventual runtime **parameters** of the Pipeline. + +![Untitled](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Untitled%206.png) + +Clicking *Start* ****will trigger execution. Each Run will have a unique, automatically created ID. + + + +### Recurring + +To create a Recurring Run, the procedure shown above should be followed, but instead a *Recurring* **Run Type** should be chosen. + +The main configuration parameters of a Recurring Run are the **frequency**, **start date** and **end date**, as well as the **maximum number of concurrent Runs** of the Pipeline. The maximum number of concurrent Runs is a particularly relevant parameter for Pipelines whose execution time may stretch into the following’s scheduled Run start time - it should be tweaked to avoid overwhelming the available infrastructure. Recurrency can also be configured via cron-like definitions. + +![Configuring a Recurrent Run](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Untitled%207.png) + +Configuring a Recurrent Run + +The recurring run will keep on executing until its end date or until it is manually disabled. Configured Recurrent Runs are listed on the *Recurring Runs* section. + + + +# Creating a Pipeline + +The recommended way to create a Pipeline is to use the interactive Pipeline editor available on Labs with Jupyter Lab set as IDE. It allows the: + +- addition of blocks by dragging and dropping notebooks/Python scripts/R scripts (can be a mixture) +- connecting blocks in linear and non-linear ways to define the execution sequence +- configuring the parameters of each block in-line. + +[Building a simple synthetic data generation pipeline in the interactive editor by dragging and dropping Jupyter Notebooks (Python/R files could also be dragged), leveraging input files for credentials, environment variables for workflow settings, software runtime specification and per-block hardware needs.](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Screen_Recording_2022-07-27_at_18.29.08.mov) + +Building a simple synthetic data generation pipeline in the interactive editor by dragging and dropping Jupyter Notebooks (Python/R files could also be dragged), leveraging input files for credentials, environment variables for workflow settings, software runtime specification and per-block hardware needs. + +The built Pipeline can be directly ran from the editor. It will then be automatically available in the dashboard’s web UI, where it can be viewed and managed. + + + +# Managing Pipelines + +The Pipelines management interface is accessible in the platform’s dashboard, via the sidebar item *Pipelines*. + +![The Pipelines management module](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Untitled%201.png) + +The Pipelines management module + +It has 6 main sub-modules: + +- **Pipelines**: list of existing Pipelines, which can be further drilled-down into the versions of each Pipeline, as Pipeline definitions can be versioned. +- **Experiments:** a ****list of all available Experiments (groups of Runs), regardless of their origin Pipeline. +- **Runs:** a ****list of all available Runs, regardless of their origin Pipeline/Experiment. +- **Recurring Runs:** an interface to view and configure the Runs triggered on a schedule. +- **Artifacts:** list of Artifacts generated by all Runs of all Pipelines +- **Executions:** a list of all executed blocks/steps across all Runs of all Pipelines + + + + +## Creating a new Experiment + +An experiment is used to group together the runs of a single or different Pipelines. It is particularly useful for organization and Artifacts/Metrics comparison purposes. + +To create a new Experiment, access the *Experiments* section and click *+ Create Experiment*. An Experiment requires a name and an optional description. + +## Comparing Runs + +**Comparing runs is particularly useful in iterative data improvement scenarios**, as Artifacts, Metrics and Parameters can be directly compared side-by-side. Runs using different pre-processing techniques, settings, algorithms can be put against each other side-by-side in a visual and intuitive interface. + +To compare multiple Runs, select the Runs of interest (either from the *Experiments* or *Runs* pane) and select *Compare runs:* + +![Selecting Runs to compare from the Experiments list](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Untitled%208.png) + +Selecting Runs to compare from the Experiments list + +![In case of this particular data quality improvement Pipeline, the Metrics of each Run are shown side by side.](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Untitled%209.png) + +In case of this particular data quality improvement Pipeline, the Metrics of each Run are shown side by side. + +Up to 10 runs can be selected for side-by-side comparison. In case any step of the Run has logged Artifacts, the equivalent Artifacts are shown in a comparative interface. + +![Comparing the confusion matrices of three Runs of a Pipeline, which were logged as Artifacts during one of the Pipeline’s steps.](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Screenshot_from_2020-10-24_01-12-27.png) + +Comparing the confusion matrices of three Runs of a Pipeline, which were logged as Artifacts during one of the Pipeline’s steps. + +## Cloning Runs + +For full reproducibility purposes, it is possible to select a previous run and clone it. Cloned runs will use exactly the same runtime input parameters and settings. However, **any time dependent inputs (like the state of a remote data source at a particular point in time) will not be recreated**. + +To clone a Run, click the *Clone run* button available in a Run’s detail page or in the list of Runs/Experiment (when a single Run is selected). It will be possible to review the settings prior to triggering the execution. + +## Archiving Runs + +Archiving a Run will move it to the Archived section the *Runs* and *Experiments* list. This section can be used to save older executions, to highlight best runs or to record anomalous executions which require further digging into. + +Archive a Run by clicking the *Archive* button from the Run’s details page (or from the list of Runs/Experiments when a Run is selected). + +![The Archived section, which is in all ways similar to the list of Active buttons. The *Restore* button (highlighted) moves Runs between the two sections.](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Untitled%2010.png) + +The Archived section, which is in all ways similar to the list of Active buttons. The *Restore* button (highlighted) moves Runs between the two sections. + +When a Run is archived, it can be restored through the *Restore* button. + +--- + + + +[](Pipelines%2055c1a84b8a374deab72652d8f3fc375c/Untitled%20cb6afa6dc57e4efdbda8f94de7b67340.md) diff --git a/docs/sdk/index.md b/docs/sdk/index.md index 3f70a0a8..eca3e42c 100644 --- a/docs/sdk/index.md +++ b/docs/sdk/index.md @@ -2,11 +2,9 @@ ![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue) [![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk) -!!! note "Fabric SDK for improved data quality everywhere!" +!!! note "YData Fabric SDK for improved data quality everywhere!" - *ydata-sdk* is here! Create a YData account so you can start using today! - - [Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch} + To start using create a Fabric community account at ^^[ydata.ai/register](https://ydata.ai/ydata-fabric-free-trial)^^ ## Overview diff --git a/docs/synthetic_data/index.md b/docs/synthetic_data/index.md new file mode 100644 index 00000000..22878331 --- /dev/null +++ b/docs/synthetic_data/index.md @@ -0,0 +1,56 @@ +# Synthetic Data generation + +[YData Fabric's *Synthetic data Generation*](https://ydata.ai/products/synthetic_data) capabilities leverages the latest generative models to create +high-quality artificial data that replicates real-world data properties. Regardless it is a table, a database or a tex corpus +this powerful capability ensures privacy, enhances data availability, and boosts model performance across various industries. +In this section discover how YData Fabric's synthetic data solutions can transform your data-driven initiatives. + +## What is Synthetic Data? +Synthetic data is artificially generated data that mimics the statistical properties and structure of real-world data +without directly copying it. It is created using algorithms and models designed to replicate the characteristics +of actual data sets. This process ensures that synthetic data retains the essential patterns and relationships present +in the original data, making it a valuable asset for various applications, particularly in situations where using real data +might pose privacy, security, or availability concerns. It can be used for: + +- Guaranteeing privacy and compliance when sharing datasets (for quality assurance, product development and other analytics teams) +- Removing bias by upsampling rare events +- Balancing datasets +- Augment existing datasets to improve the performance of machine learning models or use in stress testing +- Smartly fill in missing values based on context +- Simulate new scenarios and hypothesis + +## The benefits of Synthetic Data +Leveraging synthetic data offers numerous benefits: + +- **Privacy and Security:** Synthetic data eliminates the risk of exposing sensitive information, making it an +ideal solution for industries handling sensitive data, such as healthcare, finance, and telecommunications. +- **Data Augmentation:** It enables organizations to augment existing data sets, enhancing model training by providing +diverse and representative samples, thereby improving model accuracy and robustness. +- **Cost Efficiency:** Generating synthetic data can be more cost-effective than collecting and labeling large volumes +of real data, particularly for rare events or scenarios that are difficult to capture. +- **Testing and Development:** Synthetic data provides a safe environment for testing and developing algorithms, +ensuring that models are robust before deployment in real-world scenarios. + +## Synthetic Data in Fabric +[YData Fabric](https://ydata.ai/products/fabric) offers robust support for creating high-quality synthetic data using +generative models and/or through bootstrapping. +The platform is designed to address the diverse needs of data scientists, engineers, and analysts by providing +a comprehensive set of tools and features. + +### Data Types Supported: +YData Fabric supports the generation of various data types, including: + +- **Tabular Data:** Generate synthetic versions of structured data typically found in spreadsheets and databases, +with support for categorical, numerical, and mixed data types. +- **Time Series Data:** Create synthetic time series data that preserves the temporal dependencies and trends, +useful for applications like financial forecasting and sensor data analysis. +- **Multi-Table or Database Synthesis:** Synthesize complex databases with multiple interrelated tables, +maintaining the relational integrity and dependencies, which is crucial for comprehensive data analysis and testing +applications. +- **Text Data:** Produce synthetic text data for natural language processing (NLP) tasks, +ensuring the generated text maintains the linguistic properties and context of the original data. + +## Related Materials +- 📖 ^^[The 5 Benefits of Synthetic data generation for modern AI](https://ydata.ai/resources/top-5-benefits-of-synthetic-data-in-modern-ai)^^ +- 📖 ^^[The role of Synthetic data in Healthcare](https://ydata.ai/resources/the-role-of-synthetic-data-in-healthcare-from-innovation-to-improved-diagnosis)^^ +- 📖 ^^[The role of Synthetic data to overcome Bias](https://ydata.ai/resources/using-synthetic-data-to-overcome-bias-in-machine-learning)^^ diff --git a/docs/synthetic_data/relational_database/index.md b/docs/synthetic_data/relational_database/index.md new file mode 100644 index 00000000..8682fc9c --- /dev/null +++ b/docs/synthetic_data/relational_database/index.md @@ -0,0 +1,23 @@ +# Multi-Table Synthetic data generation + +**Multi-Table or Database's synthetic data generation** is a powerful method to create high-quality artificial datasets +that mirror the statistical properties and relational structures of original multi-table databases. +A multi-table database consists of multiple interrelated tables, often with various data types (dates, categorical, numerical, etc.) +and complex relationships between records. +Key use cases include privacy-preserving access to full production databases and the creation of realistic test environments. +Synthetic data allows organizations to share and analyze full production databases without exposing sensitive information, +ensuring compliance with data privacy regulations. Additionally, it is invaluable for creating realistic test environments, +enabling developers and testers to simulate real-world scenarios, identify potential issues, and validate database applications +without risking data breaches. +By leveraging synthetic multi-table data, organizations can simulate complex relational data environments, enhance the robustness +of database applications, and ensure data privacy, making it a valuable tool for industries that rely on intricate data structures +and interdependencies. + +## Tutorials & Recipes +To **get-started** with Synthetic Data Generation ^^[you can follow out quickstart guide](../../get-started/create_database_sd_generator.md)^^. + +For more tutorial and recipes, ^^[follow the link to YData's Academy](https://github.com/ydataai/academy/tree/master/2-%20Synthetic%20Data/MultiTable).^^ +## Related Materials +- :fontawesome-brands-youtube:{ .youtube } How to generate Synthetic Data from a Database +- :fontawesome-brands-youtube:{ .youtube } How to generate Multi-Table step-by-step +- :fontawesome-brands-youtube:{ .youtube } How to generate Multi-Table synthetic data in Google Colab diff --git a/docs/synthetic_data/relational_database/use_in_labs.md b/docs/synthetic_data/relational_database/use_in_labs.md new file mode 100644 index 00000000..e69de29b diff --git a/docs/synthetic_data/single_table/index.md b/docs/synthetic_data/single_table/index.md new file mode 100644 index 00000000..9f3ad943 --- /dev/null +++ b/docs/synthetic_data/single_table/index.md @@ -0,0 +1,19 @@ +# Tabular synthetic data generation + +Tabular synthetic data generation is a powerful method to create high-quality artificial datasets +that mirror the statistical properties of original tabular data. A tabular dataset is usually composed by +several columns with structured data and mixed data types (dates, categorical, numerical, etc) with not time dependence +between records. +This ability of generating synthetic data from this type of datasets is essential for a wide range of +applications, from data augmentation to privacy preservation, and is particularly useful in scenarios where +obtaining or using real data is challenging. +## Tutorials & Recipes +To **get-started** with Synthetic Data Generation ^^[you can follow out quickstart guide](../../get-started/create_syntheticdata_generator.md)^^. + +For more tutorial and recipes, ^^[follow the link to YData's Academy](https://github.com/ydataai/academy/tree/master/2-%20Synthetic%20Data/Tabular).^^ +## Related Materials + +- 📖 ^^[Generating Synthetic data from a Tabular dataset with a large number of columns](https://ydata.ai/resources/how-to-synthesize-a-dataset-with-a-large-number-of-columns)^^ +- 📖 ^^[Synthetic data to improve Credit Scoring models](https://ydata.ai/resources/a-data-centric-ai-approach-to-credit-scoring)^^ +- :fontawesome-brands-youtube:{ .youtube } Generate Synthetic data with Python code +- :fontawesome-brands-youtube:{ .youtube } Synthetic data generation with API diff --git a/docs/synthetic_data/synthetic_data_quality/compare_profiling.md b/docs/synthetic_data/synthetic_data_quality/compare_profiling.md new file mode 100644 index 00000000..e69de29b diff --git a/docs/synthetic_data/synthetic_data_quality/report_pdf.md b/docs/synthetic_data/synthetic_data_quality/report_pdf.md new file mode 100644 index 00000000..e69de29b diff --git a/docs/synthetic_data/text/index.md b/docs/synthetic_data/text/index.md new file mode 100644 index 00000000..87f469e8 --- /dev/null +++ b/docs/synthetic_data/text/index.md @@ -0,0 +1,13 @@ +# Text Synthetic Data generation + +**Synthetic data generation for text** creates high-quality artificial text datasets that mimic the properties and patterns of original text data, +playing a crucial role in Generative AI applications. This technique enhances the performance of large language models (LLMs) by providing +extensive training datasets, which improve model accuracy and robustness. It addresses data scarcity by generating text for specialized domains or +languages where data is limited. Additionally, synthetic text generation ensures privacy preservation, allowing organizations to create useful datasets +without compromising sensitive information, thereby complying with data privacy regulations while enabling comprehensive data analysis and model training​ + +!!! Note "Feature in Preview" + This feature is in preview and not available for all users. ^^[Contact us if you are interested in giving it a try](https://ydata.ai/contact-us)!^^ + +## Related Materials +- :fontawesome-brands-youtube:{ .youtube } How to generate Synthetic Text Data? diff --git a/docs/synthetic_data/timeseries/index.md b/docs/synthetic_data/timeseries/index.md new file mode 100644 index 00000000..14b87545 --- /dev/null +++ b/docs/synthetic_data/timeseries/index.md @@ -0,0 +1,21 @@ +# Time-series synthetic data generation + +**Time-series synthetic data generation** is a powerful method to create high-quality artificial datasets that mirror the +statistical properties of original time-series data. A time-series dataset is composed of sequential data points +recorded at specific time intervals, capturing trends, patterns, and temporal dependencies. +This ability to generate synthetic data from time-series datasets is essential for a wide range of applications, +from data augmentation to privacy preservation, and is particularly useful in scenarios where obtaining or using +real data is challenging. By leveraging synthetic time-series data, organizations can simulate various conditions and +events, enhance model robustness, and ensure data privacy, making it a valuable tool for industries reliant on temporal +data analysis. +This type of data is prevalent in various fields, including finance, healthcare, energy, and IoT (Internet of Things). + +## Tutorials & Recipes +To **get-started** with Synthetic Data Generation ^^[you can follow out quickstart guide](../../get-started/create_syntheticdata_generator.md)^^. + +For more tutorial and recipes, ^^[follow the link to YData's Academy](https://github.com/ydataai/academy/tree/master/2-%20Synthetic%20Data/Time-series).^^ +## Related Materials +- 📖 ^^[Understanding the structure of a time-series dataset](https://ydata.ai/resources/understanding-the-structure-of-time-series-datasets)^^ +- 📖 ^^[Time-series synthetic data generation](https://ydata.ai/resources/simple-synthetic-time-series-data)^^ +- 📖 ^^[Synthetic multivariate time-series data](https://ydata.ai/resources/synthetic-multivariate-time-series-data)^^ +- :fontawesome-brands-youtube:{ .youtube } How to generate time-series synthetic data? diff --git a/mkdocs.yml b/mkdocs.yml index 06cbaafb..63d0a77e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -27,6 +27,18 @@ nav: - 'data_catalog/datasources/index.md' - Warnings: 'data_catalog/datasources/warnings.md' - PII identification: 'data_catalog/datasources/pii.md' + - Synthetic data: + - 'synthetic_data/index.md' + - Tabular: 'synthetic_data/single_table/index.md' + - Time-series: 'synthetic_data/timeseries/index.md' + - Multi-Table: 'synthetic_data/relational_database/index.md' + - Text: 'synthetic_data/text/index.md' + - Labs: + - 'labs/index.md' + - Overview: 'labs/overview.md' + - Pipelines: + - 'pipelines/index.md' + - Concepts: 'pipelines/concepts.md' - Integrations: - Snowflake: - 'integrations/snowflake/integration_snowflake.md'