From 676e0ffd0378a782723200f6f1d8af194145f33c Mon Sep 17 00:00:00 2001 From: Dong Lin Date: Sat, 5 Aug 2023 10:24:32 +0800 Subject: [PATCH] [hotfix] Improve README --- README.md | 25 ++++---- docs/content/README.md | 11 ++-- docs/content/basic-concepts.md | 54 +---------------- docs/content/deep-dive/optimizations.md | 2 +- docs/content/engines/README.md | 1 + docs/content/engines/flink.md | 8 ++- docs/content/engines/local.md | 16 +++++ docs/content/engines/spark.md | 8 ++- docs/content/expression-language/README.md | 10 ---- docs/content/feathub-sdk/README.md | 17 ++++++ .../aggregation_functions.md | 0 .../dtypes.md | 0 docs/content/feathub-sdk/feature-view.md | 59 +++++++++++++++++++ .../functions.md | 0 docs/content/metric-stores/README.md | 1 + docs/content/metric-stores/metrics.md | 50 ++++++++++++++++ docs/content/metric-stores/overview.md | 50 ---------------- 17 files changed, 177 insertions(+), 135 deletions(-) create mode 100644 docs/content/engines/local.md delete mode 100644 docs/content/expression-language/README.md create mode 100644 docs/content/feathub-sdk/README.md rename docs/content/{expression-language => feathub-sdk}/aggregation_functions.md (100%) rename docs/content/{expression-language => feathub-sdk}/dtypes.md (100%) create mode 100644 docs/content/feathub-sdk/feature-view.md rename docs/content/{expression-language => feathub-sdk}/functions.md (100%) create mode 100644 docs/content/metric-stores/metrics.md diff --git a/README.md b/README.md index 64ffd717..9754a255 100644 --- a/README.md +++ b/README.md @@ -37,10 +37,10 @@ engines, connectors, expression language, and more. Similar to other feature stores, FeatHub provides the following core benefits: -- **Simplified feature development**: The Python SDK provided by FeatHub makes -it easy to develop features without worrying about point-in-time correctness. -This helps to avoid training-serving skew, which can negatively impact the -accuracy of machine learning models. +- **Simplified feature development**: The Pythonic [FeatHub +SDK](docs/content/feathub-sdk) makes it easy to develop features without worrying +about point-in-time correctness. This helps to avoid training-serving skew, +which can negatively impact the accuracy of machine learning models. - **Faster feature deployment**: FeatHub automatically compiles user-specified declarative feature definitions into performant distributed ETL jobs using state-of-the-art computation engines, such as Flink or Spark. This speeds up @@ -78,10 +78,12 @@ Flink](docs/content/engines/flink.md) for real-time features with low latency, throughput, and FeatureService for computing features online when the request is received. -- **Extensible framework**: FeatHub's Python SDK is declarative and decoupled -from the APIs of the underlying computation engines, providing flexibility and -avoiding lock-in. This allows for the support of additional computation engines -in the future. +- **Extensible framework**: FeatHub's Python SDK is decoupled from the APIs of +the underlying computation engines, providing flexibility and avoiding lock-in. +This allows for the support of additional computation engines in the future. +For example, FeatHub supports [Local +Processor](#quickstart-using-local-processor) that is implemented using Pandas +library, in addition to its support for Apache Flink and Apache Spark. Usability is a crucial factor that sets feature store projects apart. Our SDK is designed to be **Pythonic**, **declarative**, intuitive, and highly expressive to @@ -120,7 +122,7 @@ The workflow of defining, computing, and serving features using FeatHub is illus -See [Basic Concepts](docs/content/concepts/basic-concepts.md) for more details about the key components in FeatHub. +See [Basic Concepts](docs/content/basic-concepts.md) for more details about the key components in FeatHub. ## Supported Compute Engines @@ -137,7 +139,8 @@ FeatHub supports the following compute engines to execute feature ETL pipeline: ## FeatHub SDK Highlights The following examples demonstrate how to define a variety of features -concisely using FeatHub SDK. +concisely using FeatHub SDK. See [FeatHub +SDK](docs/content/feathub-sdk) for more details. See [NYC Taxi Demo](docs/examples/nyc_taxi.ipynb) to learn more about how to define, generate and serve features using FeatHub SDK. @@ -193,8 +196,6 @@ f_trip_time_duration = Feature( ) ``` -See [FeatHub Expression Language](docs/content/concepts/expression-language.md) for more details. - - Define a feature via Python UDF: ```python diff --git a/docs/content/README.md b/docs/content/README.md index bc80062e..e88a5ae4 100644 --- a/docs/content/README.md +++ b/docs/content/README.md @@ -9,14 +9,16 @@ applications. - [Flink Session Mode Quickstart](quickstarts/flink-session-mode.md) - [Spark Client Mode Quickstart](quickstarts/spark-client-mode.md) - [Basic Concepts](basic-concepts.md) -- [FeatHub Expression Language](expression-language) - - [Data Types and Reserved Keywords](expression-language/dtypes.md) - - [Built-in Operators and Functions](expression-language/functions.md) - - [Built-in Aggregations Functions](expression-language/aggregation_functions.md) +- [FeatHub SDK](feathub-sdk) + - [FeatureView and Transformation](feathub-sdk/feature-view.md) + - [Data Types and Reserved Keywords](feathub-sdk/dtypes.md) + - [Built-in Operators and Functions](feathub-sdk/functions.md) + - [Built-in Aggregations Functions](feathub-sdk/aggregation_functions.md) - [Common Configurations](configurations.md) - [Compute Engines](engines) - [Apache Flink](engines/flink.md) - [Apache Spark](engines/spark.md) + - [Local Processor](engines/local.md) - [Feature Registries](registries) - [MySQL](registries/mysql.md) - [Connectors](connectors) @@ -31,6 +33,7 @@ applications. - [Built-in Optimizations](deep-dive/optimizations.md) - [Metric Stores](metric-stores) - [Overview](metric-stores/overview.md) + - [Built-in Metrics](metric-stores/metrics.md) - [Prometheus](metric-stores/prometheus.md) - [How To](how-to) - [Deploy FeatHub Job on Alibaba Cloud](how-to/deploy-on-alibaba-cloud.md) diff --git a/docs/content/basic-concepts.md b/docs/content/basic-concepts.md index 48780c7d..3a6292a0 100644 --- a/docs/content/basic-concepts.md +++ b/docs/content/basic-concepts.md @@ -32,61 +32,11 @@ which we can construct FeatureTable in FeatHub. ### FeatureView -A `FeatureView` provides metadata to derive a table of feature values from -other tables. FeatHub currently supports the following types of FeatureViews. - -- [DerivedFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/derived_feature_view.py) - derives features by applying the given transformations on an existing table. - It supports per-row transformation, over window transformation and table join. - It does not support sliding window transformation. -- [SlidingFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/sql_feature_view.py) - derives features by applying the given transformations on an existing table. - It supports per-row transformation and sliding window transformation. It does - not support join or over window transformation. -- [OnDemandFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/on_demand_feature_view.py) - derives features by joining online request with features from tables in online - feature stores. It supports per-row transformation and join with tables in - online stores. It does not support over window transformation or sliding window - transformation. -- [SqlFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/sql_feature_view.py) - derives features by evaluating a given SQL statement. Currently, its - semantics depends on the processor used during deployment. We plan to make it - processor-agnostic in the future to ensure consistent semantics regardless of - processor choice. - -`FeatureView` provides APIs to specify and access `Feature`s. Each `Feature` is -defined by the following metadata: -- `name`: a string that uniquely identifies this feature in the parent table. -- `dtype`: the data type of this feature's values. -- `transform`: A declarative definition of how to derive this feature's values. -- `keys`: an optional list of strings, corresponding to the names of fields in - the parent table necessary to interpret this feature's values. If it is - specified, it is used as the join key when FeatHub joins this feature onto - another table. +See [here](feathub-sdk/feature-view.md) for more details. ## Transformation - Declarative Definition of Feature Computation -A `Transformation` defines how to derive a new feature from existing features. -FeatHub currently supports the following types of Transformations. - -- [ExpressionTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/expression_transform.py) - derives feature values by applying FeatHub expression on one row of the parent - table at a time. FeatHub expression language is a declarative language with - build-in functions. See [FeatHub expression](feathub_expression.md) for more - information. -- [OverWindowTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/expression_transform.py) - derives feature values by applying FeatHub expression and aggregation function - on multiple rows of a table at a time. -- [SlidingWindowTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/sliding_window_transform.py) - derives feature values by applying FeatHub expression and aggregation function - on multiple rows in a sliding window. -- [JoinTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/expression_transform.py) - derives feature values by joining parent table with a feature from another - table. -- [PythonUdfTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/python_udf_transform.py) - derives feature values by applying a Python UDF on one row of the parent table - at a time. - +See [here](feathub-sdk/feature-view.md) for more details. ## Processor - Pluggable Compute Engine for Feature ETL diff --git a/docs/content/deep-dive/optimizations.md b/docs/content/deep-dive/optimizations.md index 5c44f63b..11946592 100644 --- a/docs/content/deep-dive/optimizations.md +++ b/docs/content/deep-dive/optimizations.md @@ -52,5 +52,5 @@ significantly reduce I/O costs and network bandwidth usage. This optimization is enabled by default. See [SlidingFeatureView](../configurations.md#slidingfeatureview) configuration for -more detail. +more details. diff --git a/docs/content/engines/README.md b/docs/content/engines/README.md index db808ece..01fa7c76 100644 --- a/docs/content/engines/README.md +++ b/docs/content/engines/README.md @@ -2,4 +2,5 @@ - [Apache Flink](flink.md) - [Apache Spark](spark.md) +- [Local Processor](local.md) diff --git a/docs/content/engines/flink.md b/docs/content/engines/flink.md index b03e34aa..fbc752bb 100644 --- a/docs/content/engines/flink.md +++ b/docs/content/engines/flink.md @@ -1,8 +1,10 @@ # Apache Flink -The FlinkProcessor does feature ETL using Flink as the compute engine. In the -following sections we describe the deployment modes supported by FlinkProcessor -and the configuration keys accepted by each mode. +The +[FlinkProcessor](https://github.com/alibaba/feathub/blob/master/python/feathub/processors/flink/flink_processor.py) +does feature ETL using Flink as the compute engine. In the following sections we +describe the deployment modes supported by FlinkProcessor and the configuration +keys accepted by each mode. ## Supported Versions diff --git a/docs/content/engines/local.md b/docs/content/engines/local.md new file mode 100644 index 00000000..7ff76c90 --- /dev/null +++ b/docs/content/engines/local.md @@ -0,0 +1,16 @@ +# Local Processor + +[Local +Processor](https://github.com/alibaba/feathub/blob/master/python/feathub/processors/local/local_processor.py) +utilizes CPUs on the local machine to compute features and uses Pandas +DataFrame to store tabular data in memory. It is useful for doing experiments +on a local machine without having to deploy and connect to a distributed +Flink/Spark cluster. + +This processor is implemented using the Pandas library and computes features in +the given Python process. If the feathub-nightly[spark] is installed, the Local +processor can utilize Spark's local mode for accessing storages (e.g. HDFS) that +it otherwise would not support. + +See [here](../../../README.md#quickstart) for an example of using Local Processor. + diff --git a/docs/content/engines/spark.md b/docs/content/engines/spark.md index c9307c7b..6cb82060 100644 --- a/docs/content/engines/spark.md +++ b/docs/content/engines/spark.md @@ -1,8 +1,10 @@ # Apache Spark -The SparkProcessor does feature ETL using Spark as the compute engine. In the -following sections we describe the deployment modes supported by SparkProcessor -and the configuration keys accepted by each mode. +The +[SparkProcessor](https://github.com/alibaba/feathub/blob/master/python/feathub/processors/spark/spark_processor.py) +does feature ETL using Spark as the compute engine. In the following sections we +describe the deployment modes supported by SparkProcessor and the configuration +keys accepted by each mode. ## Supported Versions diff --git a/docs/content/expression-language/README.md b/docs/content/expression-language/README.md deleted file mode 100644 index 9d9a0149..00000000 --- a/docs/content/expression-language/README.md +++ /dev/null @@ -1,10 +0,0 @@ -# FeatHub Expression Language - -FeatHub expression language is a declarative language with built-in functions. -It can be used to describe how to derive a new feature value from existing -features' values. - -- [Data Types and Reserved Keywords](dtypes.md) -- [Built-in Operators and Functions](functions.md) -- [Built-in Aggregations Functions](aggregation_functions.md) - diff --git a/docs/content/feathub-sdk/README.md b/docs/content/feathub-sdk/README.md new file mode 100644 index 00000000..bbaa0c42 --- /dev/null +++ b/docs/content/feathub-sdk/README.md @@ -0,0 +1,17 @@ +# FeatHub SDK + +FeatHub empowers users to define features using a Pythonic and declarative SDK +with built-in functions, allowing recursive feature creation based on existing +definitions. + +Compared to SQL-based SDKs, FeatHub's Pythonic SDK seamlessly integrates with +Python-focused machine learning libraries like scikit-learn and PyTorch. +Python's expressiveness, including for-loops and if/else statements, leads to +more concise and readable feature definition code, especially for a multitude +of similar patterned features. + +- [FeatureView and Transformation](feature-view.md) +- [Data Types and Reserved Keywords](dtypes.md) +- [Built-in Operators and Functions](functions.md) +- [Built-in Aggregations Functions](aggregation_functions.md) + diff --git a/docs/content/expression-language/aggregation_functions.md b/docs/content/feathub-sdk/aggregation_functions.md similarity index 100% rename from docs/content/expression-language/aggregation_functions.md rename to docs/content/feathub-sdk/aggregation_functions.md diff --git a/docs/content/expression-language/dtypes.md b/docs/content/feathub-sdk/dtypes.md similarity index 100% rename from docs/content/expression-language/dtypes.md rename to docs/content/feathub-sdk/dtypes.md diff --git a/docs/content/feathub-sdk/feature-view.md b/docs/content/feathub-sdk/feature-view.md new file mode 100644 index 00000000..0da8ba82 --- /dev/null +++ b/docs/content/feathub-sdk/feature-view.md @@ -0,0 +1,59 @@ +# FeatureView - A Table of Features + +A `FeatureView` provides metadata to derive a table of feature values from +other tables. FeatHub currently supports the following types of FeatureViews. + +- [DerivedFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/derived_feature_view.py) + derives features by applying the given transformations on an existing table. + It supports per-row transformation, over window transformation and table join. + It does not support sliding window transformation. +- [SlidingFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/sliding_feature_view.py) + derives features by applying the given transformations on an existing table. + It supports per-row transformation and sliding window transformation. It does + not support join or over window transformation. +- [OnDemandFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/on_demand_feature_view.py) + derives features by joining online request with features from tables in online + feature stores. It supports per-row transformation and join with tables in + online stores. It does not support over window transformation or sliding window + transformation. +- [SqlFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/sql_feature_view.py) + derives features by evaluating a given SQL statement. Currently, its + semantics depends on the processor used during deployment. We plan to make it + processor-agnostic in the future to ensure consistent semantics regardless of + processor choice. + +`FeatureView` provides APIs to specify and access `Feature`s. Each `Feature` is +defined by the following metadata: +- `name`: a string that uniquely identifies this feature in the parent table. +- `dtype`: the data type of this feature's values. +- `transform`: A declarative definition of how to derive this feature's values. +- `keys`: an optional list of strings, corresponding to the names of fields in + the parent table necessary to interpret this feature's values. If it is + specified, it is used as the join key when FeatHub joins this feature onto + another table. + +# Transformation - Declarative Definition of Feature Computation + +A `Transformation` defines how to derive a new feature from existing features. +FeatHub currently supports the following types of Transformations. + +- [ExpressionTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/expression_transform.py) + derives feature values by applying FeatHub expression on one row of the + parent table at a time. The FeatHub expression language is a declarative + language, sharing a syntax and grammar reminiscent of the SQL SELECT clause. + See [here](./) for a comprehensive list of built-in data types, functions + and operators. +- [OverWindowTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/over_window_transform.py) + derives feature values by applying FeatHub expression and aggregation function + on multiple rows of a table at a time. +- [SlidingWindowTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/sliding_window_transform.py) + derives feature values by applying FeatHub expression and aggregation function + on multiple rows in a sliding window. +- [JoinTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/join_transform.py) + derives feature values by joining parent table with a feature from another + table. +- [PythonUdfTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/python_udf_transform.py) + derives feature values by applying a Python UDF on one row of the parent table + at a time. + + diff --git a/docs/content/expression-language/functions.md b/docs/content/feathub-sdk/functions.md similarity index 100% rename from docs/content/expression-language/functions.md rename to docs/content/feathub-sdk/functions.md diff --git a/docs/content/metric-stores/README.md b/docs/content/metric-stores/README.md index 63d06158..5a939d41 100644 --- a/docs/content/metric-stores/README.md +++ b/docs/content/metric-stores/README.md @@ -1,4 +1,5 @@ # Metric Stores - [Overview](overview.md) +- [Built-in Metrics](metrics.md) - [Prometheus](prometheus.md) diff --git a/docs/content/metric-stores/metrics.md b/docs/content/metric-stores/metrics.md new file mode 100644 index 00000000..fa3555ac --- /dev/null +++ b/docs/content/metric-stores/metrics.md @@ -0,0 +1,50 @@ +# Built-in Metrics + +Below are Feathub's built-in metrics's metric types, their parameters and their +exposed tags. + +## Count + +Count is a metric that shows the number of features. It has the following +parameters: + +- filter_expr: Optional with None as the default value. If it is not None, it + represents a partial FeatHub expression which evaluates to a boolean value. + The partial Feathub expression should be a binary operator whose left child is + absent and would be filled in with the host feature name. For example, "IS + NULL" will be enriched into "{feature_name} IS NULL". Only features that + evaluate this expression into True will be considered when computing the + metric. +- window_size: Optional with 0 as the default value. The time range to compute + the metric. It should be zero or a positive time span. If it is zero, the + metric will be computed from all feature values that have been processed since + the Feathub job is created. + +It exposes the following metric-specific tags: + +- metric_type: "count" +- filter_expr: The value of the filter_expr parameter. +- window_size_sec: The value of the window_size parameter in seconds. + +## Ratio + +Ratio is a metric that shows the proportion of the number features that meets +filter_expr to the number of all features. It has the following parameters: + +- filter_expr: A partial FeatHub expression which evaluates to a boolean value. + The partial Feathub expression should be a binary operator whose left child is + absent and would be filled in with the host feature name. For example, "IS + NULL" will be enriched into "{feature_name} IS NULL". Only features that + evaluate this expression into True will be considered when computing the + metric. +- window_size: Optional with 0 as the default value. The time range to compute + the metric. It should be zero or a positive time span. If it is zero, the + metric will be computed from all feature values that have been processed since + the Feathub job is created. + +It exposes the following metric-specific tags: + +- metric_type: "ratio" +- filter_expr: The value of the filter_expr parameter. +- window_size_sec: The value of the window_size parameter in seconds. + diff --git a/docs/content/metric-stores/overview.md b/docs/content/metric-stores/overview.md index 9a029dfb..f2d8092f 100644 --- a/docs/content/metric-stores/overview.md +++ b/docs/content/metric-stores/overview.md @@ -92,53 +92,3 @@ document of each metric store for details. - feature_name: The name of the host feature. - other metric-specific tags. -## Built-in metrics - -Below are Feathub's built-in metrics's metric types, their parameters and their -exposed tags. - -### Count - -Count is a metric that shows the number of features. It has the following -parameters: - -- filter_expr: Optional with None as the default value. If it is not None, it - represents a partial FeatHub expression which evaluates to a boolean value. - The partial Feathub expression should be a binary operator whose left child is - absent and would be filled in with the host feature name. For example, "IS - NULL" will be enriched into "{feature_name} IS NULL". Only features that - evaluate this expression into True will be considered when computing the - metric. -- window_size: Optional with 0 as the default value. The time range to compute - the metric. It should be zero or a positive time span. If it is zero, the - metric will be computed from all feature values that have been processed since - the Feathub job is created. - -It exposes the following metric-specific tags: - -- metric_type: "count" -- filter_expr: The value of the filter_expr parameter. -- window_size_sec: The value of the window_size parameter in seconds. - -### Ratio - -Ratio is a metric that shows the proportion of the number features that meets -filter_expr to the number of all features. It has the following parameters: - -- filter_expr: A partial FeatHub expression which evaluates to a boolean value. - The partial Feathub expression should be a binary operator whose left child is - absent and would be filled in with the host feature name. For example, "IS - NULL" will be enriched into "{feature_name} IS NULL". Only features that - evaluate this expression into True will be considered when computing the - metric. -- window_size: Optional with 0 as the default value. The time range to compute - the metric. It should be zero or a positive time span. If it is zero, the - metric will be computed from all feature values that have been processed since - the Feathub job is created. - -It exposes the following metric-specific tags: - -- metric_type: "ratio" -- filter_expr: The value of the filter_expr parameter. -- window_size_sec: The value of the window_size parameter in seconds. -