Skip to content

Commit

Permalink
[hotfix] Improve README
Browse files Browse the repository at this point in the history
  • Loading branch information
lindong28 committed Aug 5, 2023
1 parent 8d0b35f commit 676e0ff
Show file tree
Hide file tree
Showing 17 changed files with 177 additions and 135 deletions.
25 changes: 13 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@ engines, connectors, expression language, and more.

Similar to other feature stores, FeatHub provides the following core benefits:

- **Simplified feature development**: The Python SDK provided by FeatHub makes
it easy to develop features without worrying about point-in-time correctness.
This helps to avoid training-serving skew, which can negatively impact the
accuracy of machine learning models.
- **Simplified feature development**: The Pythonic [FeatHub
SDK](docs/content/feathub-sdk) makes it easy to develop features without worrying
about point-in-time correctness. This helps to avoid training-serving skew,
which can negatively impact the accuracy of machine learning models.
- **Faster feature deployment**: FeatHub automatically compiles user-specified
declarative feature definitions into performant distributed ETL jobs using
state-of-the-art computation engines, such as Flink or Spark. This speeds up
Expand Down Expand Up @@ -78,10 +78,12 @@ Flink](docs/content/engines/flink.md) for real-time features with low latency,
throughput, and FeatureService for computing features online when the request
is received.

- **Extensible framework**: FeatHub's Python SDK is declarative and decoupled
from the APIs of the underlying computation engines, providing flexibility and
avoiding lock-in. This allows for the support of additional computation engines
in the future.
- **Extensible framework**: FeatHub's Python SDK is decoupled from the APIs of
the underlying computation engines, providing flexibility and avoiding lock-in.
This allows for the support of additional computation engines in the future.
For example, FeatHub supports [Local
Processor](#quickstart-using-local-processor) that is implemented using Pandas
library, in addition to its support for Apache Flink and Apache Spark.

Usability is a crucial factor that sets feature store projects apart. Our SDK is
designed to be **Pythonic**, **declarative**, intuitive, and highly expressive to
Expand Down Expand Up @@ -120,7 +122,7 @@ The workflow of defining, computing, and serving features using FeatHub is illus

<img src="docs/static/img/architecture_2.png" width="70%" height="auto">

See [Basic Concepts](docs/content/concepts/basic-concepts.md) for more details about the key components in FeatHub.
See [Basic Concepts](docs/content/basic-concepts.md) for more details about the key components in FeatHub.

## Supported Compute Engines

Expand All @@ -137,7 +139,8 @@ FeatHub supports the following compute engines to execute feature ETL pipeline:
## FeatHub SDK Highlights

The following examples demonstrate how to define a variety of features
concisely using FeatHub SDK.
concisely using FeatHub SDK. See [FeatHub
SDK](docs/content/feathub-sdk) for more details.

See [NYC Taxi Demo](docs/examples/nyc_taxi.ipynb) to learn more about how to
define, generate and serve features using FeatHub SDK.
Expand Down Expand Up @@ -193,8 +196,6 @@ f_trip_time_duration = Feature(
)
```

See [FeatHub Expression Language](docs/content/concepts/expression-language.md) for more details.

- Define a feature via Python UDF:

```python
Expand Down
11 changes: 7 additions & 4 deletions docs/content/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,16 @@ applications.
- [Flink Session Mode Quickstart](quickstarts/flink-session-mode.md)
- [Spark Client Mode Quickstart](quickstarts/spark-client-mode.md)
- [Basic Concepts](basic-concepts.md)
- [FeatHub Expression Language](expression-language)
- [Data Types and Reserved Keywords](expression-language/dtypes.md)
- [Built-in Operators and Functions](expression-language/functions.md)
- [Built-in Aggregations Functions](expression-language/aggregation_functions.md)
- [FeatHub SDK](feathub-sdk)
- [FeatureView and Transformation](feathub-sdk/feature-view.md)
- [Data Types and Reserved Keywords](feathub-sdk/dtypes.md)
- [Built-in Operators and Functions](feathub-sdk/functions.md)
- [Built-in Aggregations Functions](feathub-sdk/aggregation_functions.md)
- [Common Configurations](configurations.md)
- [Compute Engines](engines)
- [Apache Flink](engines/flink.md)
- [Apache Spark](engines/spark.md)
- [Local Processor](engines/local.md)
- [Feature Registries](registries)
- [MySQL](registries/mysql.md)
- [Connectors](connectors)
Expand All @@ -31,6 +33,7 @@ applications.
- [Built-in Optimizations](deep-dive/optimizations.md)
- [Metric Stores](metric-stores)
- [Overview](metric-stores/overview.md)
- [Built-in Metrics](metric-stores/metrics.md)
- [Prometheus](metric-stores/prometheus.md)
- [How To](how-to)
- [Deploy FeatHub Job on Alibaba Cloud](how-to/deploy-on-alibaba-cloud.md)
54 changes: 2 additions & 52 deletions docs/content/basic-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,61 +32,11 @@ which we can construct FeatureTable in FeatHub.

### FeatureView

A `FeatureView` provides metadata to derive a table of feature values from
other tables. FeatHub currently supports the following types of FeatureViews.

- [DerivedFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/derived_feature_view.py)
derives features by applying the given transformations on an existing table.
It supports per-row transformation, over window transformation and table join.
It does not support sliding window transformation.
- [SlidingFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/sql_feature_view.py)
derives features by applying the given transformations on an existing table.
It supports per-row transformation and sliding window transformation. It does
not support join or over window transformation.
- [OnDemandFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/on_demand_feature_view.py)
derives features by joining online request with features from tables in online
feature stores. It supports per-row transformation and join with tables in
online stores. It does not support over window transformation or sliding window
transformation.
- [SqlFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/sql_feature_view.py)
derives features by evaluating a given SQL statement. Currently, its
semantics depends on the processor used during deployment. We plan to make it
processor-agnostic in the future to ensure consistent semantics regardless of
processor choice.

`FeatureView` provides APIs to specify and access `Feature`s. Each `Feature` is
defined by the following metadata:
- `name`: a string that uniquely identifies this feature in the parent table.
- `dtype`: the data type of this feature's values.
- `transform`: A declarative definition of how to derive this feature's values.
- `keys`: an optional list of strings, corresponding to the names of fields in
the parent table necessary to interpret this feature's values. If it is
specified, it is used as the join key when FeatHub joins this feature onto
another table.
See [here](feathub-sdk/feature-view.md) for more details.

## Transformation - Declarative Definition of Feature Computation

A `Transformation` defines how to derive a new feature from existing features.
FeatHub currently supports the following types of Transformations.

- [ExpressionTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/expression_transform.py)
derives feature values by applying FeatHub expression on one row of the parent
table at a time. FeatHub expression language is a declarative language with
build-in functions. See [FeatHub expression](feathub_expression.md) for more
information.
- [OverWindowTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/expression_transform.py)
derives feature values by applying FeatHub expression and aggregation function
on multiple rows of a table at a time.
- [SlidingWindowTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/sliding_window_transform.py)
derives feature values by applying FeatHub expression and aggregation function
on multiple rows in a sliding window.
- [JoinTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/expression_transform.py)
derives feature values by joining parent table with a feature from another
table.
- [PythonUdfTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/python_udf_transform.py)
derives feature values by applying a Python UDF on one row of the parent table
at a time.

See [here](feathub-sdk/feature-view.md) for more details.

## Processor - Pluggable Compute Engine for Feature ETL

Expand Down
2 changes: 1 addition & 1 deletion docs/content/deep-dive/optimizations.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,5 +52,5 @@ significantly reduce I/O costs and network bandwidth usage.

This optimization is enabled by default. See
[SlidingFeatureView](../configurations.md#slidingfeatureview) configuration for
more detail.
more details.

1 change: 1 addition & 0 deletions docs/content/engines/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@

- [Apache Flink](flink.md)
- [Apache Spark](spark.md)
- [Local Processor](local.md)

8 changes: 5 additions & 3 deletions docs/content/engines/flink.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Apache Flink

The FlinkProcessor does feature ETL using Flink as the compute engine. In the
following sections we describe the deployment modes supported by FlinkProcessor
and the configuration keys accepted by each mode.
The
[FlinkProcessor](https://github.com/alibaba/feathub/blob/master/python/feathub/processors/flink/flink_processor.py)
does feature ETL using Flink as the compute engine. In the following sections we
describe the deployment modes supported by FlinkProcessor and the configuration
keys accepted by each mode.

## Supported Versions

Expand Down
16 changes: 16 additions & 0 deletions docs/content/engines/local.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Local Processor

[Local
Processor](https://github.com/alibaba/feathub/blob/master/python/feathub/processors/local/local_processor.py)
utilizes CPUs on the local machine to compute features and uses Pandas
DataFrame to store tabular data in memory. It is useful for doing experiments
on a local machine without having to deploy and connect to a distributed
Flink/Spark cluster.

This processor is implemented using the Pandas library and computes features in
the given Python process. If the feathub-nightly[spark] is installed, the Local
processor can utilize Spark's local mode for accessing storages (e.g. HDFS) that
it otherwise would not support.

See [here](../../../README.md#quickstart) for an example of using Local Processor.

8 changes: 5 additions & 3 deletions docs/content/engines/spark.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Apache Spark

The SparkProcessor does feature ETL using Spark as the compute engine. In the
following sections we describe the deployment modes supported by SparkProcessor
and the configuration keys accepted by each mode.
The
[SparkProcessor](https://github.com/alibaba/feathub/blob/master/python/feathub/processors/spark/spark_processor.py)
does feature ETL using Spark as the compute engine. In the following sections we
describe the deployment modes supported by SparkProcessor and the configuration
keys accepted by each mode.

## Supported Versions

Expand Down
10 changes: 0 additions & 10 deletions docs/content/expression-language/README.md

This file was deleted.

17 changes: 17 additions & 0 deletions docs/content/feathub-sdk/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# FeatHub SDK

FeatHub empowers users to define features using a Pythonic and declarative SDK
with built-in functions, allowing recursive feature creation based on existing
definitions.

Compared to SQL-based SDKs, FeatHub's Pythonic SDK seamlessly integrates with
Python-focused machine learning libraries like scikit-learn and PyTorch.
Python's expressiveness, including for-loops and if/else statements, leads to
more concise and readable feature definition code, especially for a multitude
of similar patterned features.

- [FeatureView and Transformation](feature-view.md)
- [Data Types and Reserved Keywords](dtypes.md)
- [Built-in Operators and Functions](functions.md)
- [Built-in Aggregations Functions](aggregation_functions.md)

File renamed without changes.
59 changes: 59 additions & 0 deletions docs/content/feathub-sdk/feature-view.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# FeatureView - A Table of Features

A `FeatureView` provides metadata to derive a table of feature values from
other tables. FeatHub currently supports the following types of FeatureViews.

- [DerivedFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/derived_feature_view.py)
derives features by applying the given transformations on an existing table.
It supports per-row transformation, over window transformation and table join.
It does not support sliding window transformation.
- [SlidingFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/sliding_feature_view.py)
derives features by applying the given transformations on an existing table.
It supports per-row transformation and sliding window transformation. It does
not support join or over window transformation.
- [OnDemandFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/on_demand_feature_view.py)
derives features by joining online request with features from tables in online
feature stores. It supports per-row transformation and join with tables in
online stores. It does not support over window transformation or sliding window
transformation.
- [SqlFeatureView](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/sql_feature_view.py)
derives features by evaluating a given SQL statement. Currently, its
semantics depends on the processor used during deployment. We plan to make it
processor-agnostic in the future to ensure consistent semantics regardless of
processor choice.

`FeatureView` provides APIs to specify and access `Feature`s. Each `Feature` is
defined by the following metadata:
- `name`: a string that uniquely identifies this feature in the parent table.
- `dtype`: the data type of this feature's values.
- `transform`: A declarative definition of how to derive this feature's values.
- `keys`: an optional list of strings, corresponding to the names of fields in
the parent table necessary to interpret this feature's values. If it is
specified, it is used as the join key when FeatHub joins this feature onto
another table.

# Transformation - Declarative Definition of Feature Computation

A `Transformation` defines how to derive a new feature from existing features.
FeatHub currently supports the following types of Transformations.

- [ExpressionTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/expression_transform.py)
derives feature values by applying FeatHub expression on one row of the
parent table at a time. The FeatHub expression language is a declarative
language, sharing a syntax and grammar reminiscent of the SQL SELECT clause.
See [here](./) for a comprehensive list of built-in data types, functions
and operators.
- [OverWindowTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/over_window_transform.py)
derives feature values by applying FeatHub expression and aggregation function
on multiple rows of a table at a time.
- [SlidingWindowTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/sliding_window_transform.py)
derives feature values by applying FeatHub expression and aggregation function
on multiple rows in a sliding window.
- [JoinTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/join_transform.py)
derives feature values by joining parent table with a feature from another
table.
- [PythonUdfTransform](https://github.com/alibaba/feathub/blob/master/python/feathub/feature_views/transforms/python_udf_transform.py)
derives feature values by applying a Python UDF on one row of the parent table
at a time.


File renamed without changes.
1 change: 1 addition & 0 deletions docs/content/metric-stores/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Metric Stores

- [Overview](overview.md)
- [Built-in Metrics](metrics.md)
- [Prometheus](prometheus.md)
50 changes: 50 additions & 0 deletions docs/content/metric-stores/metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Built-in Metrics

Below are Feathub's built-in metrics's metric types, their parameters and their
exposed tags.

## Count

Count is a metric that shows the number of features. It has the following
parameters:

- filter_expr: Optional with None as the default value. If it is not None, it
represents a partial FeatHub expression which evaluates to a boolean value.
The partial Feathub expression should be a binary operator whose left child is
absent and would be filled in with the host feature name. For example, "IS
NULL" will be enriched into "{feature_name} IS NULL". Only features that
evaluate this expression into True will be considered when computing the
metric.
- window_size: Optional with 0 as the default value. The time range to compute
the metric. It should be zero or a positive time span. If it is zero, the
metric will be computed from all feature values that have been processed since
the Feathub job is created.

It exposes the following metric-specific tags:

- metric_type: "count"
- filter_expr: The value of the filter_expr parameter.
- window_size_sec: The value of the window_size parameter in seconds.

## Ratio

Ratio is a metric that shows the proportion of the number features that meets
filter_expr to the number of all features. It has the following parameters:

- filter_expr: A partial FeatHub expression which evaluates to a boolean value.
The partial Feathub expression should be a binary operator whose left child is
absent and would be filled in with the host feature name. For example, "IS
NULL" will be enriched into "{feature_name} IS NULL". Only features that
evaluate this expression into True will be considered when computing the
metric.
- window_size: Optional with 0 as the default value. The time range to compute
the metric. It should be zero or a positive time span. If it is zero, the
metric will be computed from all feature values that have been processed since
the Feathub job is created.

It exposes the following metric-specific tags:

- metric_type: "ratio"
- filter_expr: The value of the filter_expr parameter.
- window_size_sec: The value of the window_size parameter in seconds.

Loading

0 comments on commit 676e0ff

Please sign in to comment.