diff --git a/docs/assets/overview/data_centric_approach.png b/docs/assets/overview/data_centric_approach.png new file mode 100644 index 00000000..31f0c19d Binary files /dev/null and b/docs/assets/overview/data_centric_approach.png differ diff --git a/docs/assets/overview/fabric_welcome.png b/docs/assets/overview/fabric_welcome.png new file mode 100644 index 00000000..34e0e55c Binary files /dev/null and b/docs/assets/overview/fabric_welcome.png differ diff --git a/docs/assets/overview/registration.png b/docs/assets/overview/registration.png new file mode 100644 index 00000000..353bc18d Binary files /dev/null and b/docs/assets/overview/registration.png differ diff --git a/docs/examples/synthesize_with_anonymization.md b/docs/examples/synthesize_with_anonymization.md index 63676e74..f783b880 100644 --- a/docs/examples/synthesize_with_anonymization.md +++ b/docs/examples/synthesize_with_anonymization.md @@ -6,13 +6,13 @@ YData Synthesizers offers a way to anonymize sensitive information such that the No! The anonymization is performed before the model training such that it never sees the original values. -The anonymization is performed by specifying which columns need to be anonymized and how to performed the anonymization. +The anonymization is performed by specifying which columns need to be anonymized and how to perform the anonymization. The anonymization rules are defined as a dictionary with the following format: `{column_name: anonymization_rule}` While here are some predefined anonymization rules such as `name`, `email`, `company`, it is also possible to create a rule using a regular expression. -The anonymization rules have to be passed to a synthesizer in its `fit` method using the parameter [`anonymize`](../reference/api/synthesizers/timeseries/#ydata.sdk.synthesizers.timeseries.TimeSeriesSynthesizer.fit). +The anonymization rules have to be passed to a synthesizer in its `fit` method using the parameter `anonymize`. !!! question "What is the difference between anonymization and privacy?" diff --git a/docs/examples/synthesize_with_privacy_control.md b/docs/examples/synthesize_with_privacy_control.md index 003cf61d..dcc9ebfc 100644 --- a/docs/examples/synthesize_with_privacy_control.md +++ b/docs/examples/synthesize_with_privacy_control.md @@ -6,8 +6,8 @@ YData Synthesizers offers 3 different levels of privacy: 2. **high fidelity** (default): the model is optimized for high fidelity, 3. **balanced**: tradeoff between privacy and fidelity. -The default privacy level is high fidelity. The privacy level can be changed by the user at the moment a synthesizer level is trained by using the parameter [`privacy_level`](../reference/api/synthesizers/timeseries/#ydata.sdk.synthesizers.timeseries.TimeSeriesSynthesizer.fit). -The parameter expect a [`PrivacyLevel`](../reference/api/synthesizers/base/#privacylevel) value. +The default privacy level is high fidelity. The privacy level can be changed by the user at the moment a synthesizer level is trained by using the parameter `privacy_level`. +The parameter expect a `PrivacyLevel` value. !!! question "What is the difference between anonymization and privacy?" diff --git a/docs/get-started/fabric_community.md b/docs/get-started/fabric_community.md new file mode 100644 index 00000000..f99ce378 --- /dev/null +++ b/docs/get-started/fabric_community.md @@ -0,0 +1,22 @@ +# Get started with Fabric Community + +Fabric Community is a SaaS version that allows you to explore all the functionalities of Fabric first-hand: ***free, forever, for everyone.*** You’ll be able to validate your data quality with automated profiling, unlock data sharing and improve your ML models with synthetic data, and increase your productivity with seamless integration: + +- Build 1 personal project; +- Create your first Data Catalog and benefit from automated data profiling; +- Train and generate synthetic data up to 2 models and datasets with 50 columns and 100K rows; +- Optimize synthetic data quality for your use cases with an evaluation PDF report; +- Create 1 development environment (Labs) and integrate it with your familiar ML packages and workflows. + +## Register +To register for Fabric Community: + +- Access the Fabric Community Try Now and create your YData account by submitting the form +- Check your email for your login credentials +- Login into fabric.ydata.ai and enjoy! + +![Registration Process](../assets/overview/registration.png) + +Once you login, you'll access the Home page and get started with your data preparation! + +![Welcome Screen](../assets/overview/fabric_welcome.png) diff --git a/docs/index.md b/docs/index.md index 4363443d..3d91373e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,73 +1,85 @@ -

-

YData Logo

-

+# Welcome -[![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk) -![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue) -[![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk) +YData Fabric is a **Data-Centric AI** development platform that accelerates AI development by helping data practitioners achieve production-quality data. -!!! note "YData SDK for improved data quality everywhere!" - *ydata-sdk* is here! Create a YData account so you can start using today! +Much like for software engineering the quality of code is a must for the success of software development, Fabric +accounts for the data quality requirements for data-driven applications. It introduces standards, processes, and +acceleration to empower data science, analytics, and data engineering teams. - [Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch} +

Data-Centric AI Approach

-## Overview -The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications. +### Try Fabric +- Get started with Fabric Community -**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments. -Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation. +## Why adopt YData Fabric? -## Current functionality +With Fabric, you can standardize the understanding of your data, quickly identify data quality issues, streamline and +version your data preparation workflows and finally leverage synthetic data for privacy-compliance or as a tool to boost ML +performance. Fabric is a development environment that supports a faster and easier process of preparing data for AI development. +Data practitioners are using Fabric to: -YData SDK is currently composed by the following main modules: +- Establish a centralized and collaborative repository for data projects. +- Create and share comprehensive documentation of data, encompassing data schema, structure, and personally identifiable information (PII). +- Prevent data quality issues with standardized data quality profiling, providing visual understanding and warnings on potential issues. +- Accelerate data preparation with customizable recipes. +- Improve machine learning performance with optimal data preparation through solutions such as synthetic data. +- Shorten access to data with privacy-compliant synthetic data generatio. +- Build and streamline data preparation workflows effortlessly through a user-friendly drag-and-drop interface. +- Efficiently manage business rules, conduct comparisons, and implement version control for data workflows using pipelines. -* **Datasources** - - YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors. - - SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data. +## 📝 Key features -* **Synthesizers** - - Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](examples/synthesize_with_privacy_control.md) use-cases. - - From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed. - - [Anonymization](examples/synthesize_with_anonymization.md) and [privacy](examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared! - - [Conditional sampling](examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data. +### Data Catalog +Fabric Data Catalog provides a centralized perspective on datasets within a project-basis, optimizing data management +through seamless integration with the organization's existing data architectures via scalable connectors (e.g., MySQL, Google Cloud Storage, AWS S3). +It standardizes data quality profiling, streamlining the processes of efficient data cleaning and preparation, +while also automating the identification of Personally Identifiable Information (PII) to facilitate compliance with privacy regulations. -* **Synthetic data quality report** - *Coming soon* - - An extensive synthetic data quality report that measures 3 dimensions: privacy, utility and fidelity of the generated data. The report can be downloaded in PDF format for ease of sharing and compliance purposes or as a JSON to enable the integration in data flows. +Explore how a Data Catalog through a centralized repository of your datasets, schema validation, and automated data profiling. -* **Profiling** - *Coming soon* - - A set of metrics and algorithms summarizes datasets quality in three main dimensions: warnings, univariate analysis and a multivariate perspective. -## Supported data formats +### Labs +Fabric's Labs environments provide collaborative, scalable, and secure workspaces layered on a flexible infrastructure, enabling users to +seamlessly switch between CPUs and GPUs based on their computational needs. Labs are familiar environments that empower data developers with +powerful IDEs (Jupyter Notebooks, Visual Code or H2O flow) and a seamless experience with the tools they already love combined with YData's +cutting-edge SDK for data preparation. -=== "Tabular" - ![Tabular data synthesizer](assets/500x330/single_table.png){ align=right } - The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results. +Learn how to use the Labs to generate synthetic data in a familiar Python interface. - [Know more](#){ .md-button .md-button--ydata} +### Synthetic data +Synthetic data, enabled by YData Fabric, provides data developers with a user-friendly interfaces (UI and code) for +generating artificial datasets, offering a versatile solution across formats like tabular, time-series and multi-table datasets. +The generated synthetic data holds the same value of the original and aligns intricately with specific business rules, contributing +to machine learning models enhancement, mitigation of privacy concerns and more robustness for data developments. +Fabric offers synthetic data that is ease to adapt and configure, allows customization in what concerns privacy-utility trade-offs. -=== "Time-Series" - ![Timeseries Synthesizer](assets/500x330/time_series.png){ align=left } - The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock. +Learn how you to create high-quality synthetic data within a user-friendly UI using Fabric’s data synthesis flow. - [Know more](#){ .md-button .md-button--ydata} +### Pipelines +Fabric Pipelines streamlines data preparation workflows by automating, orchestrating, and optimizing data pipelines, +providing benefits such as flexibility, scalability, monitoring, and reproducibility for efficient and reliable data processing. +The intuitive drag-and-drop interface, leveraging Jupyter notebooks or Python scripts, expedites the pipeline setup process, +providing data developers with a quick and user-friendly experience. -=== "Transactional" - ![Transactional data synthesizer](assets/500x330/time_series.png){ align=right } - The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities. +Explore how you can leverage Fabric Pipelines to build versionable and reproducible data preparation workflows for ML development. - *Coming soon* +### Tutorials +To understand how to best apply Fabric to your use cases, start by exploring the following tutorials: - [Know more](#){ .md-button .md-button--ydata} +- Handling Imbalanced Data for Improved Fraud Detection
Learn how to implement high-performant fraud detection models by incorporating synthetic data to balance your datasets. -=== "Relational databases" - ![Relational databases synthesizer](assets/500x330/multi_table.png){ align=left } - The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema. +- Prediction with Quality Inspection
Learn how to develop data preparation workflows with automated data quality checks and Pipelines. - *Coming soon* +- Generating Synthetic Data for Financial Transactions
Learn how to use synthetic data generation to replicate your existing relational databases while ensuring referential integrity. - [Know more](#){ .md-button .md-button--ydata} + +You can find additional examples and use cases at YData Academy GitHub Repository. + +## 🙋 Support +Facing an issue? We’re committed to providing all the support you need to ensure a smooth experience using Fabric: + +- Create a support ticket: our team will help you move forward! +- Contact a Fabric specialist: for personalized guidance or full access to the platform diff --git a/docs/sdk/index.md b/docs/sdk/index.md new file mode 100644 index 00000000..e8448b46 --- /dev/null +++ b/docs/sdk/index.md @@ -0,0 +1,73 @@ +

+

YData Logo

+

+ +[![pypi](https://img.shields.io/pypi/v/ydata-sdk)](https://pypi.org/project/ydata-sdk) +![Pythonversion](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue) +[![downloads](https://pepy.tech/badge/ydata-sdk/month)](https://pepy.tech/project/ydata-sdk) + +!!! note "YData SDK for improved data quality everywhere!" + + *ydata-sdk* is here! Create a YData account so you can start using today! + + [Create account](https://ydata.ai/ydata-fabric-free-trial){ .md-button .md-button--ydata .md-button--stretch} + +## Overview + +The *YData SDK* is an ecosystem of methods that allows users to, through a python interface, adopt a *Data-Centric* approach towards the AI development. The solution includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, such as *synthetic data generation*, allowing an iterative improvement of the datasets used in high-impact business applications. + +**Synthetic data** can be used as Machine Learning performance enhancer, to augment or mitigate the presence of bias in real data. Furthermore, it can be used as a Privacy Enhancing Technology, to enable data-sharing initiatives or even to fuel testing environments. + +Under the YData-SDK hood, you can find a set of algorithms and metrics based on statistics and deep learning based techniques, that will help you to accelerate your data preparation. + +## Current functionality + +YData SDK is currently composed by the following main modules: + +* **Datasources** + - YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors. + - SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data. + +* **Synthesizers** + - Simplified interface to train a generative model and learn in a data-driven manner the behavior, the patterns and original data distribution. Optimize your model for [privacy or utility](../examples/synthesize_with_privacy_control.md) use-cases. + - From a trained synthesizer, you can generate synthetic samples as needed and parametrise the number of records needed. + - [Anonymization](../examples/synthesize_with_anonymization.md) and [privacy](../examples/synthesize_with_privacy_control.md) preserving capabilities to ensure that synthetic datasets does not contain Personal Identifiable Information (PII) and can safely be shared! + - [Conditional sampling](../examples/synthesize_with_conditional_sampling.md) can be used to restrict the domain and values of specific features in the sampled data. + +* **Synthetic data quality report** + *Coming soon* + - An extensive synthetic data quality report that measures 3 dimensions: privacy, utility and fidelity of the generated data. The report can be downloaded in PDF format for ease of sharing and compliance purposes or as a JSON to enable the integration in data flows. + +* **Profiling** + *Coming soon* + - A set of metrics and algorithms summarizes datasets quality in three main dimensions: warnings, univariate analysis and a multivariate perspective. + +## Supported data formats + +=== "Tabular" + ![Tabular data synthesizer](../assets/500x330/single_table.png){ align=right } + The **RegularSynthesizer** is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results. + + [Know more](#){ .md-button .md-button--ydata} + +=== "Time-Series" + ![Timeseries Synthesizer](../assets/500x330/time_series.png){ align=left } + The **TimeSeriesSynthesizer** is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock. + + [Know more](#){ .md-button .md-button--ydata} + +=== "Transactional" + ![Transactional data synthesizer](../assets/500x330/time_series.png){ align=right } + The **TimeSeriesSynthesizer** supports transactional data, known to have highly irregular time intervals between records and directional relations between entities. + + *Coming soon* + + [Know more](#){ .md-button .md-button--ydata} + +=== "Relational databases" + ![Relational databases synthesizer](../assets/500x330/multi_table.png){ align=left } + The **MultiTableSynthesizer** is perfect to learn how to replicate the data within a relational database schema. + + *Coming soon* + + [Know more](#){ .md-button .md-button--ydata} diff --git a/docs/getting-started/installation.md b/docs/sdk/installation.md similarity index 100% rename from docs/getting-started/installation.md rename to docs/sdk/installation.md diff --git a/docs/modules/connectors.md b/docs/sdk/modules/connectors.md similarity index 100% rename from docs/modules/connectors.md rename to docs/sdk/modules/connectors.md diff --git a/docs/modules/synthetic_data.md b/docs/sdk/modules/synthetic_data.md similarity index 100% rename from docs/modules/synthetic_data.md rename to docs/sdk/modules/synthetic_data.md diff --git a/docs/getting-started/quickstart.md b/docs/sdk/quickstart.md similarity index 100% rename from docs/getting-started/quickstart.md rename to docs/sdk/quickstart.md diff --git a/docs/reference/api/common/client.md b/docs/sdk/reference/api/common/client.md similarity index 100% rename from docs/reference/api/common/client.md rename to docs/sdk/reference/api/common/client.md diff --git a/docs/reference/api/common/types.md b/docs/sdk/reference/api/common/types.md similarity index 100% rename from docs/reference/api/common/types.md rename to docs/sdk/reference/api/common/types.md diff --git a/docs/reference/api/connectors/connector.md b/docs/sdk/reference/api/connectors/connector.md similarity index 100% rename from docs/reference/api/connectors/connector.md rename to docs/sdk/reference/api/connectors/connector.md diff --git a/docs/reference/api/datasources/datasource.md b/docs/sdk/reference/api/datasources/datasource.md similarity index 100% rename from docs/reference/api/datasources/datasource.md rename to docs/sdk/reference/api/datasources/datasource.md diff --git a/docs/reference/api/datasources/metadata.md b/docs/sdk/reference/api/datasources/metadata.md similarity index 100% rename from docs/reference/api/datasources/metadata.md rename to docs/sdk/reference/api/datasources/metadata.md diff --git a/docs/reference/api/index.md b/docs/sdk/reference/api/index.md similarity index 100% rename from docs/reference/api/index.md rename to docs/sdk/reference/api/index.md diff --git a/docs/reference/api/synthesizers/base.md b/docs/sdk/reference/api/synthesizers/base.md similarity index 100% rename from docs/reference/api/synthesizers/base.md rename to docs/sdk/reference/api/synthesizers/base.md diff --git a/docs/reference/api/synthesizers/regular.md b/docs/sdk/reference/api/synthesizers/regular.md similarity index 100% rename from docs/reference/api/synthesizers/regular.md rename to docs/sdk/reference/api/synthesizers/regular.md diff --git a/docs/reference/api/synthesizers/timeseries.md b/docs/sdk/reference/api/synthesizers/timeseries.md similarity index 100% rename from docs/reference/api/synthesizers/timeseries.md rename to docs/sdk/reference/api/synthesizers/timeseries.md diff --git a/docs/reference/changelog.md b/docs/sdk/reference/changelog.md similarity index 100% rename from docs/reference/changelog.md rename to docs/sdk/reference/changelog.md diff --git a/mkdocs.yml b/mkdocs.yml index c0322012..e2c83ac2 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,45 +1,47 @@ -site_name: "YData SDK" +site_name: "YData Fabric" repo_url: https://github.com/ydataai/ydata-sdk repo_name: ydataai/ydata-sdk edit_uri: "" dev_addr: 0.0.0.0:1235 site_dir: "static/docs" nav: - - Getting started: - - 'index.md' - - Overview: 'index.md' - - Installation: 'getting-started/installation.md' - - Quickstart: 'getting-started/quickstart.md' - - Examples: - - Generate Tabular Data: "examples/synthesize_tabular_data.md" - - Generate Time-Series Data: "examples/synthesize_timeseries_data.md" - - Anonymization: "examples/synthesize_with_anonymization.md" - - Privacy Level: "examples/synthesize_with_privacy_control.md" - - Conditional Sampling: "examples/synthesize_with_conditional_sampling.md" - - Components: - - Connectors: - - "modules/connectors.md" - - Reference: - - Changelog: 'reference/changelog.md' - - API: - - Client: 'reference/api/common/client.md' - - Connectors: - - 'Connector': 'reference/api/connectors/connector.md' - - DataSources: - - 'DataSource': 'reference/api/datasources/datasource.md' - - 'Metadata': 'reference/api/datasources/metadata.md' - - Synthesizers: - - Synthesizer: 'reference/api/synthesizers/base.md' - - Regular: 'reference/api/synthesizers/regular.md' - - TimeSeries: 'reference/api/synthesizers/timeseries.md' - - Types: 'reference/api/common/types.md' + - Welcome: 'index.md' + - Get started with Fabric: "get-started/fabric_community.md" + - SDK: + - Overview: "sdk/index.md" + - Installation: 'sdk/installation.md' + - Quickstart: 'sdk/quickstart.md' + - Components: + - "sdk/modules/connectors.md" + - Examples: + - Generate Tabular Data: "examples/synthesize_tabular_data.md" + - Generate Time-Series Data: "examples/synthesize_timeseries_data.md" + - Anonymization: "examples/synthesize_with_anonymization.md" + - Privacy Level: "examples/synthesize_with_privacy_control.md" + - Conditional Sampling: "examples/synthesize_with_conditional_sampling.md" + - Reference: + - Changelog: 'sdk/reference/changelog.md' + - API: + - Client: 'sdk/reference/api/common/client.md' + - Connectors: + - 'Connector': 'sdk/reference/api/connectors/connector.md' + - DataSources: + - 'DataSource': 'sdk/reference/api/datasources/datasource.md' + - 'Metadata': 'sdk/reference/api/datasources/metadata.md' + - Synthesizers: + - Synthesizer: 'sdk/reference/api/synthesizers/base.md' + - Regular: 'sdk/reference/api/synthesizers/regular.md' + - TimeSeries: 'sdk/reference/api/synthesizers/timeseries.md' + - Types: 'sdk/reference/api/common/types.md' + theme: name: material language: en font: 'Roboto' palette: - - media: "(prefers-color-scheme: light)" - scheme: ydata + - scheme: ydata + media: "(prefers-color-scheme: light)" + primary: custom logo: 'https://assets.ydata.ai/logo_notext_nbg.png' features: - content.code.annotate @@ -135,6 +137,6 @@ plugins: - http://pandas.pydata.org/pandas-docs/stable/objects.inv setup_commands: - import sys - - sys.path.append('../src') + - sys.path.append('.src') merge_init_into_class: yes show_submodules: no diff --git a/templates/config.md b/templates/config.md new file mode 100644 index 00000000..e69de29b