From 00ff329f8f4b66f38289bc4bc05f064e0df081ad Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Thu, 19 Sep 2024 13:07:47 +0200 Subject: [PATCH] Zyp: Improve documentation --- CHANGES.md | 1 + README.md | 2 +- doc/cdc/index.md | 8 + doc/conf.py | 4 +- doc/index.md | 8 - doc/prior-art.md | 16 -- doc/zyp/backlog.md | 76 +++--- doc/zyp/examples.md | 550 ++++++++++++++++++++++++++++++++++++++++++++ doc/zyp/index.md | 261 ++++++++------------- doc/zyp/research.md | 35 ++- pyproject.toml | 6 +- 11 files changed, 726 insertions(+), 241 deletions(-) delete mode 100644 doc/prior-art.md create mode 100644 doc/zyp/examples.md diff --git a/CHANGES.md b/CHANGES.md index 1e6ff98..8103e74 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -9,6 +9,7 @@ - MongoDB: Use improved decoding machinery also for `MongoDBCDCTranslator` - Dependencies: Make MongoDB subsystem not strictly depend on Zyp - Zyp: Translate a few special treatments to jq-based `MokshaTransformation` again +- Zyp: Improve documentation ## 2024/09/10 v0.0.15 - Added Zyp Treatments, a slightly tailored transformation subsystem diff --git a/README.md b/README.md index f57aa52..1ae0c8b 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ [![Tests](https://github.com/crate/commons-codec/actions/workflows/tests.yml/badge.svg)](https://github.com/crate/commons-codec/actions/workflows/tests.yml) [![Coverage](https://codecov.io/gh/crate/commons-codec/branch/main/graph/badge.svg)](https://app.codecov.io/gh/crate/commons-codec) -[![Build status (documentation)](https://readthedocs.org/projects/commons-codec/badge/)](https://cratedb.com/docs/commons-codec/) +[![Build status (documentation)](https://readthedocs.org/projects/commons-codec/badge/)](https://commons-codec.readthedocs.io/) [![PyPI Version](https://img.shields.io/pypi/v/commons-codec.svg)](https://pypi.org/project/commons-codec/) [![Python Version](https://img.shields.io/pypi/pyversions/commons-codec.svg)](https://pypi.org/project/commons-codec/) [![PyPI Downloads](https://pepy.tech/badge/commons-codec/month)](https://pepy.tech/project/commons-codec/) diff --git a/doc/cdc/index.md b/doc/cdc/index.md index 6208178..d5dc0f2 100644 --- a/doc/cdc/index.md +++ b/doc/cdc/index.md @@ -25,6 +25,14 @@ and need further curation and improvements. ::: +## Prior Art + +- [core-cdc] by Alejandro Cora González +- [Carabas Research] + + +[Carabas Research]: https://lorrystream.readthedocs.io/carabas/research.html +[core-cdc]: https://pypi.org/project/core-cdc/ [DynamoDB CDC Relay for CrateDB]: https://cratedb-toolkit.readthedocs.io/io/dynamodb/cdc.html [MongoDB CDC Relay for CrateDB]: https://cratedb-toolkit.readthedocs.io/io/mongodb/cdc.html [Replicating CDC Events from DynamoDB to CrateDB]: https://cratedb.com/blog/replicating-cdc-events-from-dynamodb-to-cratedb diff --git a/doc/conf.py b/doc/conf.py index ad7af42..942e7a1 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -85,7 +85,9 @@ intersphinx_mapping = { # "influxio": ("https://influxio.readthedocs.io/", None), } -linkcheck_ignore = [] +linkcheck_ignore = [ + r"https://stackoverflow.com/questions/70518350", +] # Disable caching remote inventories completely. # http://www.sphinx-doc.org/en/stable/ext/intersphinx.html#confval-intersphinx_cache_limit diff --git a/doc/index.md b/doc/index.md index cd3eee1..00871d2 100644 --- a/doc/index.md +++ b/doc/index.md @@ -37,14 +37,6 @@ decode zyp/index ``` -```{toctree} -:maxdepth: 3 -:caption: Topics -:hidden: - -prior-art -``` - ```{toctree} :maxdepth: 1 :caption: Workbench diff --git a/doc/prior-art.md b/doc/prior-art.md deleted file mode 100644 index fe48b31..0000000 --- a/doc/prior-art.md +++ /dev/null @@ -1,16 +0,0 @@ -# Prior Art - -## CDC -- [core-cdc] by Alejandro Cora González -- [Carabas Research](https://lorrystream.readthedocs.io/carabas/research.html) - -## Zyp -- https://github.com/cloud-custodian/cel-python -- https://github.com/MacHu-GWU/polars_transform-project -- https://github.com/MacHu-GWU/jsonpolars-project -- https://github.com/MacHu-GWU/polars_writer-project -- https://github.com/raw-labs/snapi -- https://github.com/danthedeckie/simpleeval - - -[core-cdc]: https://pypi.org/project/core-cdc/ diff --git a/doc/zyp/backlog.md b/doc/zyp/backlog.md index d3c7d0b..88afe0d 100644 --- a/doc/zyp/backlog.md +++ b/doc/zyp/backlog.md @@ -1,53 +1,51 @@ # Zyp Backlog ## Iteration +1 -- Refactor module namespace to `zyp` -- Documentation -- CLI interface -- Apply to MongoDB Table Loader in CrateDB Toolkit -- Document `jq` functions +- [x] Refactor module namespace to `zyp` +- [x] Documentation +- [ ] CLI interface +- [x] Apply to MongoDB Table Loader in CrateDB Toolkit +- [ ] Document `jq` functions - `builtin.jq`: https://github.com/jqlang/jq/blob/master/src/builtin.jq - `function.jq` +- [ ] Renaming needs JSON Pointer support. Alternatively, can `jq` do it? +- [ ] Documentation: Add Python example to "Synopsis" section on /index.html ## Iteration +2 -Demonstrate! -- math expressions -- omit key (recursively) -- combine keys -- filter on keys and/or values -- Pathological cases like "Not defined" in typed fields like `TIMESTAMP` -- Use simpleeval, like Meltano, and provide the same built-in functions -- https://sdk.meltano.com/en/v0.39.1/stream_maps.html#other-built-in-functions-and-names -- https://github.com/MeltanoLabs/meltano-map-transform/pull/255 -- https://github.com/MeltanoLabs/meltano-map-transform/issues/252 -- Use JSONPath, see https://sdk.meltano.com/en/v0.39.1/code_samples.html#use-a-jsonpath-expression-to-extract-the-next-page-url-from-a-hateoas-response -- Is `jqpy` better than `jq`? - https://baterflyrity.github.io/jqpy/ +Demonstrate more use cases, like... +- [ ] math expressions +- [ ] omit key (recursively) +- [ ] combine keys +- [ ] filter on keys and/or values +- [ ] Pathological cases like "Not defined" in typed fields like `TIMESTAMP` +- [ ] Use simpleeval, like Meltano, and provide the same built-in functions + - https://sdk.meltano.com/en/v0.39.1/stream_maps.html#other-built-in-functions-and-names + - https://github.com/MeltanoLabs/meltano-map-transform/pull/255 + - https://github.com/MeltanoLabs/meltano-map-transform/issues/252 +- [ ] Use JSONPath, see https://sdk.meltano.com/en/v0.39.1/code_samples.html#use-a-jsonpath-expression-to-extract-the-next-page-url-from-a-hateoas-response ## Iteration +3 -- Moksha transformations on Buckets -- Investigate using JSON Schema -- Fluent API interface -- https://github.com/Halvani/alphabetic -- Mappers do not support external API lookups. +- [ ] Moksha transformations on Buckets +- [ ] Fluent API interface + ```python + from zyp.model.fluent import FluentTransformation + + transformation = FluentTransformation() + .jmes("records[?starts_with(location, 'B')]") + .rename_fields({"_id": "id"}) + .convert_values({"/id": "int", "/value": "float"}, type="pointer-python") + .jq(".[] |= (.value /= 100)") + ``` +- [ ] Investigate using JSON Schema +- [ ] https://github.com/Halvani/alphabetic +- [ ] Mappers do not support external API lookups. To add external API lookups, you can either (a) land all your data and then joins using a transformation tool like dbt, or (b) create a custom mapper plugin with inline lookup logic. => Example from Luftdatenpumpe, using a reverse geocoder - [ ] Define schema - https://sdk.meltano.com/en/latest/typing.html -- https://docs.meltano.com/guide/v2-migration/#migrate-to-an-adapter-specific-dbt-transformer -- https://github.com/meltano/sdk/blob/v0.39.1/singer_sdk/mapper.py - -## Fluent API Interface - -```python - -from zyp.model.fluent import FluentTransformation - -transformation = FluentTransformation() -.jmes("records[?starts_with(location, 'B')]") -.rename_fields({"_id": "id"}) -.convert_values({"/id": "int", "/value": "float"}, type="pointer-python") -.jq(".[] |= (.value /= 100)") -``` + - https://sdk.meltano.com/en/latest/typing.html + - https://docs.meltano.com/guide/v2-migration/#migrate-to-an-adapter-specific-dbt-transformer + - https://github.com/meltano/sdk/blob/v0.39.1/singer_sdk/mapper.py +- [ ] Is `jqpy` better than `jq`? + - https://baterflyrity.github.io/jqpy/ diff --git a/doc/zyp/examples.md b/doc/zyp/examples.md new file mode 100644 index 0000000..8723a86 --- /dev/null +++ b/doc/zyp/examples.md @@ -0,0 +1,550 @@ +# Zyp Example Gallery + +This page gives you a hands-on introduction into Zyp Transformations on behalf +of a few example snippets, recipes, and use cases, in order to get you accustomed +to Zyp's capabilities. + +If you discover the need for another kind of transformation, or need assistance +crafting transformation rules, please reach out to us on the [issue tracker]. + + +## Bucket Transformation +A `BucketTransformation` works on **individual data records**, i.e. on a per-record level, +where each record can be a document with child elements, mostly nested. +You can slice into a document by using JSON Pointer, and apply functions from +arbitrary Python modules as transformer functions. + +Let's define a basic transformation including three rules. +- Rename the `_id` field to `id`, +- cast its value to `int`, and +- cast the value of the `reading` field to `float`. + +Let's also illustrate that as a basic example of input/output data. +:::::::{grid} +:gutter: 0 +:margin: 0 +:padding: 0 + +::::::{grid-item-card} Input Data +A slightly messy collection of records. +```json +[ + {"_id": "123", "reading": "42.42"}, + {"_id": "456", "reading": -84.01} +] +``` +:::::: +::::::{grid-item-card} Output Data +An edited variant suitable for storing. +```json +[ + {"id": 123, "reading": 42.42}, + {"id": 456, "reading": -84.01} +] +``` +:::::: +::::::{grid-item-card} Transformation Rules +The Python program can be executed in a Python REPL 1:1. +The YAML file needs to be loaded and applied, but there is +no tutorial yet: Please look at the source and software tests. + +:::::{dropdown} Syntax: Python API and YAML format + +::::{tab-set-code} +```{code-block} python +# Record-level "Zyp Bucket Transformation" definition. +# Includes column renaming and applying Python converter functions. + +from pprint import pprint +from zyp.model.bucket import BucketTransformation, FieldRenamer, ValueConverter + +# Consider a slightly messy collection of records. +data_in = [ + {"_id": "123", "name": "device-foo", "reading": "42.42"}, + {"_id": "456", "name": "device-bar", "reading": -84.01}, +] + +# Define a record-level "bucket transformation". +transformation = BucketTransformation( + names=FieldRenamer().add(old="_id", new="id"), + values=ValueConverter() + .add(pointer="/id", transformer="builtins.int") + .add(pointer="/reading", transformer="builtins.float"), +) + +# Transform data and dump to stdout. +data_out = list(map(transformation.apply, data_in)) +pprint(data_out) +``` +```{code-block} yaml +# Record-level "Zyp Bucket Transformation" definition. +# Includes column renaming and applying Python converter functions. +--- + +meta: + type: zyp-bucket + version: 1 +names: + rules: + - new: id + old: _id +values: + rules: + - pointer: /id + transformer: builtins.int + - pointer: /reading + transformer: builtins.float +``` +:::: +::::: + +:::{tip} +Please toggle the "Syntax" dropdown above to inspect usage of the Python API +how to define transformation rules, and the corresponding YAML representation. +::: + +:::::: +::::::: + + +## Collection Transformation +A more advanced transformation example for editing a **collection of data records**. + +:::::::{grid} +:gutter: 0 +:margin: 0 +:padding: 0 + +::::::{grid-item-card} Input Data +Consider a messy collection of input data. +- The actual collection is nested within the top-level `records` element. +- `_id` fields are conveyed in string format. +- `value` fields include both integer and string values. +- `value` fields are fixed-point values, using a scaling factor of `100`. +- The collection includes invalid `null` records. + Those records usually trip processing when, for example, filtering on object elements. +```json +{ + "message-source": "system-3000", + "message-type": "eai-warehouse", + "records": [ + {"_id": "12", "meta": {"name": "foo", "location": "B"}, "data": {"value": "4242"}}, + null, + {"_id": "34", "meta": {"name": "bar", "location": "BY"}, "data": {"value": -8401}}, + {"_id": "56", "meta": {"name": "baz", "location": "NI"}, "data": {"value": 2323}}, + {"_id": "78", "meta": {"name": "qux", "location": "NRW"}, "data": {"value": -580}}, + null, + null + ] +} +``` +:::::: + +::::::{grid-item-card} Output Data +Consider after applying the transformation rules outlined previously, the expected +outcome is a collection of valid records, optionally filtered, and values adjusted +according to relevant type hints and other conversions. +The edited variant is considered suitable for storing into consolidation and +analytics databases. +```json +[ + {"id": 12, "meta": {"name": "foo", "location": "B"}, "data": {"value": 42.42}}, + {"id": 34, "meta": {"name": "bar", "location": "BY"}, "data": {"value": -84.01}} +] +``` +:::::: + +::::::{grid-item-card} Transformation Rules +Let's come up with relevant pre-processing rules to cleanse and mangle the shape of the +input collection. In order to make this example more exciting, let's include two special +needs: +- Filter input collection by value of nested element. +- Rename top-level fields starting with underscore `_`. + +Other than those special rules, the fundamental ones to re-shape the data are: +- Unwrap `records` attribute from container dictionary into actual collection. +- Filter collection, both by omitting invalid/empty records, and by applying query + constrains. +- On each record, rename the top-level `_id` field to `id`. +- On each record, adjust the data types of the `id` and `value` fields. +- Postprocess collection, applying a custom scaling factor to the `value` field. + +Zyp let's you concisely write those rules down, using the Python language, and will +also let you serialize the transformation description into a text-based format. + +The Python program can be executed in a Python REPL 1:1. +The YAML file needs to be loaded and applied, but there is +no tutorial yet: Please look at the source and software tests. + +:::::{dropdown} Syntax: Python API and YAML format + +::::{tab-set-code} +```{code-block} python +# Collection-level "Zyp Collection Transformation" definition. +# Includes rules for different kinds of transformations and processors. +# Uses all of JMES, jq, and JSON Pointer technologies for demonstration purposes. + +from pprint import pprint +from zyp.model.bucket import BucketTransformation, FieldRenamer, ValueConverter +from zyp.model.collection import CollectionTransformation +from zyp.model.moksha import MokshaTransformation + +# Consider a slightly messy collection of records. +data_in = { + "message-source": "system-3000", + "message-type": "eai-warehouse", + "records": [ + {"_id": "12", "meta": {"name": "foo", "location": "B"}, "data": {"value": "4242"}}, + None, + {"_id": "34", "meta": {"name": "bar", "location": "BY"}, "data": {"value": -8401}}, + {"_id": "56", "meta": {"name": "baz", "location": "NI"}, "data": {"value": 2323}}, + {"_id": "78", "meta": {"name": "qux", "location": "NRW"}, "data": {"value": -580}}, + None, + None, + ], +} + +# Define a collection-level "collection transformation", including Bucket- and Moksha- +# transformations +transformation = CollectionTransformation( + pre=MokshaTransformation().jmes("records[?not_null(meta.location) && !starts_with(meta.location, 'N')]"), + bucket=BucketTransformation( + names=FieldRenamer().add(old="_id", new="id"), + values=ValueConverter() + .add(pointer="/id", transformer="builtins.int") + .add(pointer="/data/value", transformer="builtins.float"), + ), + post=MokshaTransformation().jq(".[] |= (.data.value /= 100)"), +) + +# Transform data and dump to stdout. +data_out = transformation.apply(data_in) +pprint(data_out) +``` +```{code-block} yaml +# Collection-level "Zyp Collection Transformation" definition. +# Includes rules for different kinds of transformations and processors. +# Uses all of JMES, jq, and JSON Pointer technologies for demonstration purposes. +--- + +meta: + version: 1 + type: zyp-collection +pre: + rules: + - expression: records[?not_null(meta.location) && !starts_with(meta.location, 'N')] + type: jmes +bucket: + names: + rules: + - new: id + old: _id + values: + rules: + - pointer: /id + transformer: builtins.int + - pointer: /data/value + transformer: builtins.float +post: + rules: + - expression: .[] |= (.data.value /= 100) + type: jq +``` +:::: + +In order to serialize the `zyp-collection` transformation description using the Python API, +for example into YAML format, use this code. +```python +print(transformation.to_yaml()) +``` +::::: + +:::{tip} +Please toggle the "Syntax" dropdown above to inspect usage of the Python API +how to define transformation rules, and the corresponding YAML representation. +::: + +:::::: + +::::::: + + +## Select elements with Moksha/jq +A compact transformation example that uses `jq` to **select elements** of input documents, +sometimes also called "pick fields" or "include columns". + +Let's illustrate that as a basic example of input/output data again. +:::::::{grid} 1 +:gutter: 0 +:margin: 0 +:padding: 0 + +::::::{grid-item-card} Input Data +A collection of records. +```json +[ + { + "meta": {"id": "Hotzenplotz", "timestamp": 123456789}, + "data": {"abc": 123, "def": 456} + } +] +``` +:::::: + +::::::{grid-item-card} Output Data +An edited variant suitable for storing. +```json +[ + { + "meta": {"id": "Hotzenplotz"}, + "data": {"abc": 123} + } +] +``` +:::::: + +::::::{grid-item-card} Transformation Rules +The Python program can be executed in a Python REPL 1:1. +The YAML file needs to be loaded and applied, but there is +no tutorial yet: Please look at the source and software tests. + +:::::{dropdown} Syntax: Python API and YAML format + +::::{tab-set-code} +```{code-block} python +# Collection-level "Zyp Collection Transformation" definition. +# Includes a Moksha/jq transformation rule for including elements. + +from pprint import pprint +from zyp.model.collection import CollectionTransformation +from zyp.model.moksha import MokshaTransformation + +data_in = [ + { + "meta": {"id": "Hotzenplotz", "timestamp": 123456789}, + "data": {"abc": 123, "def": 456}, + } +] + +transformation = CollectionTransformation( + pre=MokshaTransformation() + .jq(".[] |= pick(.meta.id, .data.abc)") +) + +# Transform data and dump to stdout. +data_out = transformation.apply(data_in) +pprint(data_out) +``` +```{code-block} yaml +# Collection-level "Zyp Collection Transformation" definition. +# Includes a Moksha/jq transformation rule for including elements. +--- + +meta: + type: zyp-collection + version: 1 +pre: + rules: + - expression: .[] |= pick(.meta.id, .data.abc) + type: jq +``` +:::: +::::: + +:::{tip} +Please toggle the "Syntax" dropdown above to inspect usage of the Python API +how to define transformation rules, and the corresponding YAML representation. +::: +:::::: + +::::::: + + +## Drop elements with Moksha/jq +A compact transformation example that uses `jq` to **drop elements** of input documents, +sometimes also called "ignore fields" or "exclude columns". + +Let's illustrate that as a basic example of input/output data again. +:::::::{grid} 1 +:gutter: 0 +:margin: 0 +:padding: 0 + +::::::{grid-item-card} Input Data +A collection of records. +```json +[ + { + "meta": {"id": "Hotzenplotz", "timestamp": 123456789}, + "data": {"abc": 123, "def": 456} + } +] +``` +:::::: + +::::::{grid-item-card} Output Data +An edited variant suitable for storing. +```json +[ + { + "meta": {"id": "Hotzenplotz"}, + "data": {"abc": 123} + } +] +``` +:::::: + +::::::{grid-item-card} Transformation Rules +The Python program can be executed in a Python REPL 1:1. +The YAML file needs to be loaded and applied, but there is +no tutorial yet: Please look at the source and software tests. + +:::::{dropdown} Syntax: Python API and YAML format + +::::{tab-set-code} +```{code-block} python +# Collection-level "Zyp Collection Transformation" definition. +# Includes a Moksha/jq transformation rule for excluding elements. + +from pprint import pprint +from zyp.model.collection import CollectionTransformation +from zyp.model.moksha import MokshaTransformation + +data_in = [ + { + "meta": {"id": "Hotzenplotz", "timestamp": 123456789}, + "data": {"abc": 123, "def": 456}, + } +] + +transformation = CollectionTransformation( + pre=MokshaTransformation() + .jq(".[] |= del(.meta.timestamp, .data.def)") +) + +# Transform data and dump to stdout. +data_out = transformation.apply(data_in) +pprint(data_out) +``` +```{code-block} yaml +# Collection-level "Zyp Collection Transformation" definition. +# Includes a Moksha/jq transformation rule for excluding elements. +--- + +meta: + type: zyp-collection + version: 1 +pre: + rules: + - expression: .[] |= del(.meta.timestamp, .data.def) + type: jq +``` +:::: +::::: + +:::{tip} +Please toggle the "Syntax" dropdown above to inspect usage of the Python API +how to define transformation rules, and the corresponding YAML representation. +::: +:::::: + +::::::: + + +## Unwrap and flatten with Moksha/jq +A compact transformation example that uses `jq` to: + +- Unwrap the actual collection which is nested within the top-level `records` element. +- Flatten the element `nested-list` which contains nested lists. + +Let's illustrate that as a basic example of input/output data once again. +:::::::{grid} 1 +:gutter: 0 +:margin: 0 +:padding: 0 + +::::::{grid-item-card} Input Data +A slightly messy collection of records. +```json +{ + "message-source": "community", + "message-type": "mixed-pickles", + "records": [ + {"nested-list": [{"foo": 1}, [{"foo": 2}, {"foo": 3}]]} + ] +} +``` +:::::: + +::::::{grid-item-card} Output Data +An edited variant suitable for storing. +```json +[ + {"nested-list": [{"foo": 1}, {"foo": 2}, {"foo": 3}]} +] +``` +:::::: + +::::::{grid-item-card} Transformation Rules +The Python program can be executed in a Python REPL 1:1. +The YAML file needs to be loaded and applied, but there is +no tutorial yet: Please look at the source and software tests. + +:::::{dropdown} Syntax: Python API and YAML format + +::::{tab-set-code} +```{code-block} python +# Collection-level "Zyp Collection Transformation" definition. +# Includes two Moksha/jq transformation rules for unwrapping and flattening. + +from pprint import pprint +from zyp.model.collection import CollectionTransformation +from zyp.model.moksha import MokshaTransformation + +data_in = { + "message-source": "community", + "message-type": "mixed-pickles", + "records": [ + {"nested-list": [{"foo": 1}, [{"foo": 2}, {"foo": 3}]]}, + ], +} + +transformation = CollectionTransformation( + pre=MokshaTransformation() + .jq(".records") + .jq('.[] |= (."nested-list" |= flatten)'), +) + +# Transform data and dump to stdout. +data_out = transformation.apply(data_in) +pprint(data_out) +``` +```{code-block} yaml +# Collection-level "Zyp Collection Transformation" definition. +# Includes two Moksha/jq transformation rules for unwrapping and flattening. +--- + +meta: + type: zyp-collection + version: 1 +pre: + rules: + - expression: .records + type: jq + - expression: .[] |= (.data."nested-list" |= flatten) + type: jq +``` +:::: +::::: + +:::{tip} +Please toggle the "Syntax" dropdown above to inspect usage of the Python API +how to define transformation rules, and the corresponding YAML representation. +::: +:::::: + +::::::: + + + +[issue tracker]: https://github.com/crate/commons-codec/issues diff --git a/doc/zyp/index.md b/doc/zyp/index.md index f3f85a8..b9e1182 100644 --- a/doc/zyp/index.md +++ b/doc/zyp/index.md @@ -1,133 +1,87 @@ # Zyp Transformations ## About -A data model and implementation for a compact transformation engine written -in [Python], based on [JSON Pointer] (RFC 6901), [JMESPath], and [transon], -implemented using [attrs] and [cattrs]. +A data model and implementation for a compact transformation engine +based on [JSON Pointer] (RFC 6901), [JMESPath], [jq], [transon], and [DWIM]. + +The reference implementation is written in [Python], using [attrs] and [cattrs]. +The design, conventions, and definitions also encourage implementations +in other programming languages. ## Ideas :Conciseness: Define a multistep data refinement process with as little code as possible. -:Low Footprint: - Doesn't need any infrastructure or pipeline framework. It's just a little library. +:Flexibility: + Zyp is a data transformation library that can be used within frameworks and + ad hoc pipelines equally well. To be invoked, it doesn't need any infrastructure + services and is pipeline framework agnostic. :Interoperability: - Marshal transformation recipe definition to/from text-only representations (JSON, - YAML), in order to encourage implementations in other languages. + Transformation recipe definitions are represented by a concise data model, which + can be marshalled to/from text-only representations like JSON or YAML, in order to + a) encourage implementations in other programming languages, and + b) be transferred, processed and stored by third party systems. :Performance: - Well, it is written in Python. Fragments can be re-written in Rust, when applicable. + Depending on how many transformation rules are written in pure Python vs. more + efficient processors like jqlang or other compiled transformation languages, it + may be slower or faster. When applicable, hot spots of the library + may gradually be rewritten in Rust if that topic becomes an issue. :Immediate: Other ETL frameworks and concepts often need to first land your data in the target system before applying subsequent transformations. Zyp is working directly within the data pipeline, before data is inserted into the target system. -## Example I -A basic transformation example for individual data records. - -```python -from zyp.model.bucket import BucketTransformation, FieldRenamer, ValueConverter - -# Consider a slightly messy collection of records. -data_in = [ - {"_id": "123", "name": "device-foo", "reading": "42.42"}, - {"_id": "456", "name": "device-bar", "reading": -84.01}, -] - -# Define a transformation that renames the `_id` field to `id`, -# casts its value to `int`, and casts the `reading` field to `float`. -transformation = BucketTransformation( - names=FieldRenamer().add(old="_id", new="id"), - values=ValueConverter() - .add(pointer="/id", transformer="builtins.int") - .add(pointer="/reading", transformer="builtins.float"), -) - -for record in data_in: - print(transformation.apply(record)) -``` -The result is a transformed data collection. -```json -[ - {"id": 123, "name": "device-foo", "reading": 42.42}, - {"id": 456, "name": "device-bar", "reading": -84.01} -] -``` +## Design +:Data Model: + The data model of Zyp is hierarchical: A Zyp project includes definitions for + multiple Zyp collections, whose includes definitions for possibly multiple sets + of transformation rules of different kinds, for example multiple items of + type `BucketTransformation` or `MokshaTransformation`. -## Example II -A more advanced transformation example for a collection of data records. - -Consider a messy collection of input data. -- The actual collection is nested within the top-level `records` item. -- `_id` fields are conveyed in string format. -- `value` fields include both integer and string values. -- `value` fields are fixed-point values, using a scaling factor of `100`. -- The collection includes invalid `null` records. - Those records usually trip processing when, for example, filtering on object items. -```python -data_in = { - "message-source": "system-3000", - "message-type": "eai-warehouse", - "records": [ - {"_id": "12", "meta": {"name": "foo", "location": "B"}, "data": {"value": "4242"}}, - None, - {"_id": "34", "meta": {"name": "bar", "location": "BY"}, "data": {"value": -8401}}, - {"_id": "56", "meta": {"name": "baz", "location": "NI"}, "data": {"value": 2323}}, - {"_id": "78", "meta": {"name": "qux", "location": "NRW"}, "data": {"value": -580}}, - None, - None, - ], -} -``` +:Components and Rules: + Those transformation components offer different kinds of features, mostly by + building upon well-known data transformation standards and processors like + JSON Pointer, `jq`, and friends. The components are configured using rules. -Consider after applying a corresponding transformation, the expected outcome is a -collection of valid records, optionally filtered, and values adjusted according -to relevant type hints and other conversions. -```python -data_out = [ - {"id": 12, "meta": {"name": "foo", "location": "B"}, "data": {"value": 42.42}}, - {"id": 34, "meta": {"name": "bar", "location": "BY"}, "data": {"value": -84.01}}, -] -``` +:Phases and Process: + The transformation process is conducted on behalf of multiple phases that are + defined by labels like `pre`, `bucket`, `post`, `treatment`, in that order. + Each phase can include multiple rules of different kinds. -Let's come up with relevant pre-processing rules to cleanse and mangle the shape of the -input collection. In order to make this example more exciting, let's include two special -needs: -- Filter input collection by value of nested element. -- Rename top-level fields starting with underscore `_`. - -Other than those special rules, the fundamental ones to re-shape the data are: -- Unwrap `records` attribute from container dictionary into actual collection. -- Filter collection, both by omitting invalid/empty records, and by applying query - constrains. -- On each record, rename the top-level `_id` field to `id`. -- On each record, adjust the data types of the `id` and `value` fields. -- Postprocess collection, applying a custom scaling factor to the `value` field. - -Zyp let's you concisely write those rules down, using the Python language. - -```python -from zyp.model.bucket import BucketTransformation, FieldRenamer, ValueConverter -from zyp.model.collection import CollectionTransformation -from zyp.model.moksha import MokshaTransformation - -transformation = CollectionTransformation( - pre=MokshaTransformation().jmes("records[?not_null(meta.location) && !starts_with(meta.location, 'N')]"), - bucket=BucketTransformation( - names=FieldRenamer().add(old="_id", new="id"), - values=ValueConverter() - .add(pointer="/id", transformer="builtins.int") - .add(pointer="/data/value", transformer="builtins.float"), - ), - post=MokshaTransformation().jq(".[] |= (.data.value /= 100)"), -) - -assert transformation.apply(data_in) == data_out -``` -Alternatively, serialize the `zyp-collection` transformation description, -for example into YAML format. -```python -print(transformation.to_yaml()) + +## Synopsis +::::{tab-set} + +:::{tab-item} zyp-project +```{code-block} yaml +:caption: A definition for a Zyp project in YAML format. +meta: + type: zyp-project + version: 1 +collections: +- address: + container: testdrive-db + name: foobar-collection + schema: + rules: + - pointer: /some_date + type: DATETIME + - pointer: /another_date + type: DATETIME + bucket: + values: + rules: + - pointer: /some_date + transformer: to_unixtime + - pointer: /another_date + transformer: to_unixtime ``` -```yaml + +::: + +:::{tab-item} zyp-collection +```{code-block} yaml +:caption: A definition for a Zyp collection in YAML format. + meta: version: 1 type: zyp-collection @@ -151,86 +105,51 @@ post: - expression: .[] |= (.data.value /= 100) type: jq ``` +::: -## Example III -A compact transformation example that uses `jq` to: +:::: -- Unwrap the actual collection which is nested within the top-level `records` item. -- Flatten the item `nested-list` which contains nested lists. -```python -from zyp.model.collection import CollectionTransformation -from zyp.model.moksha import MokshaTransformation - -data_in = { - "message-source": "system-3000", - "message-type": "eai-warehouse", - "records": [ - {"nested-list": [{"foo": 1}, [{"foo": 2}, {"foo": 3}]]}, - ], -} - -data_out = [ - {"nested-list": [{"foo": 1}, {"foo": 2}, {"foo": 3}]}, -] - -transformation = CollectionTransformation( - pre=MokshaTransformation() - .jq(".records") - .jq('.[] |= (."nested-list" |= flatten)'), -) - -assert transformation.apply(data_in) == data_out -``` +## Example Gallery +In order to learn how to use Zyp, please explore the hands-on example gallery. +```{toctree} +:maxdepth: 2 -The same transformation represented in YAML format looks like this. -```yaml -meta: - type: zyp-collection - version: 1 -pre: - rules: - - expression: .records - type: jq - - expression: .[] |= (.data."nested-list" |= flatten) - type: jq +Examples ``` - +You are also welcome to explore and inspect the software test cases to get further +inspirations that might not have been reflected on the documentation yet. +- [tests/zyp] +- [tests/transform/mongodb] +- [tests/transform/test_zyp_generic.py] ## Prior Art -- [Singer Transformer] -- [PipelineWise Transformations] -- [singer-transform] -- [Meltano Inline Data Mapping] -- [Meltano Inline Stream Maps] -- [AWS DMS source filter rules] -- [AWS DMS table selection and transformation rules] -- ... and many more. Thanks for the inspirations. +See [research and development notes](project:#zyp-research), +specifically [an introduction and overview about Singer]. ## Etymology -With kudos to [Kris Zyp] for conceiving [JSON Pointer]. +With kudos to [Kris Zyp] for conceiving [JSON Pointer] the other day. -## More ```{toctree} :maxdepth: 1 +:hidden: -research -backlog +Research +Backlog ``` +[An introduction and overview about Singer]: https://github.com/daq-tools/lorrystream/blob/main/doc/singer/intro.md [attrs]: https://www.attrs.org/ -[AWS DMS source filter rules]: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.Filters.html -[AWS DMS table selection and transformation rules]: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TableMapping.SelectionTransformation.html [cattrs]: https://catt.rs/ +[DWIM]: https://en.wikipedia.org/wiki/DWIM [Kris Zyp]: https://github.com/kriszyp +[jq]: https://jqlang.github.io/jq/ [JMESPath]: https://jmespath.org/ [JSON Pointer]: https://datatracker.ietf.org/doc/html/rfc6901 -[Meltano Inline Data Mapping]: https://docs.meltano.com/guide/mappers/ -[Meltano Inline Stream Maps]: https://sdk.meltano.com/en/latest/stream_maps.html -[PipelineWise Transformations]: https://transferwise.github.io/pipelinewise/user_guide/transformations.html [Python]: https://en.wikipedia.org/wiki/Python_(programming_language) -[Singer Transformer]: https://github.com/singer-io/singer-python/blob/master/singer/transform.py -[singer-transform]: https://github.com/dkarzon/singer-transform +[tests/zyp]: https://github.com/crate/commons-codec/tree/main/tests/zyp +[tests/transform/mongodb]: https://github.com/crate/commons-codec/tree/main/tests/transform/mongodb +[tests/transform/test_zyp_generic.py]: https://github.com/crate/commons-codec/blob/main/tests/transform/test_zyp_generic.py [transon]: https://transon-org.github.io/ diff --git a/doc/zyp/research.md b/doc/zyp/research.md index 5780533..6714d8f 100644 --- a/doc/zyp/research.md +++ b/doc/zyp/research.md @@ -1,12 +1,23 @@ +(zyp-prior-art)= (zyp-research)= -# Zyp Research +# Zyp Research and Prior Art ## Toolbox - jq, jsonpointer, jmespath, funcy, morph, boltons, toolz - json-spec, jdata, jolt, json-document-transforms, transon +## Prior Art I +- [An introduction and overview about Singer] +- [Singer Transformer] +- [PipelineWise Transformations] +- [singer-transform] +- [Meltano Inline Data Mapping] +- [Meltano Inline Stream Maps] +- [AirbyteCatalog] +- [AWS DMS source filter rules] +- [AWS DMS table selection and transformation rules] -## Prior Art +## Prior Art II - https://pypi.org/project/json-spec/ - https://pypi.org/project/transon/ - https://pypi.org/project/jdata/ @@ -23,3 +34,23 @@ - https://github.com/pacifica/python-jsonpath2 - https://github.com/reagento/adaptix - https://blog.panoply.io/best-data-transformation-tools + +## Prior Art III +- https://github.com/cloud-custodian/cel-python +- https://github.com/MacHu-GWU/learn_polars-project +- https://github.com/MacHu-GWU/jsonpolars-project +- https://github.com/MacHu-GWU/polars_writer-project +- https://github.com/MacHu-GWU/aws_sdk_polars-project +- https://github.com/raw-labs/snapi +- https://github.com/danthedeckie/simpleeval + + +[AirbyteCatalog]: https://docs.airbyte.com/understanding-airbyte/beginners-guide-to-catalog +[An introduction and overview about Singer]: https://github.com/daq-tools/lorrystream/blob/main/doc/singer/intro.md +[AWS DMS source filter rules]: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.Filters.html +[AWS DMS table selection and transformation rules]: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TableMapping.SelectionTransformation.html +[Meltano Inline Data Mapping]: https://docs.meltano.com/guide/mappers/ +[Meltano Inline Stream Maps]: https://sdk.meltano.com/en/latest/stream_maps.html +[PipelineWise Transformations]: https://transferwise.github.io/pipelinewise/user_guide/transformations.html +[Singer Transformer]: https://github.com/singer-io/singer-python/blob/master/singer/transform.py +[singer-transform]: https://github.com/dkarzon/singer-transform diff --git a/pyproject.toml b/pyproject.toml index 81b2054..4e186e3 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -304,13 +304,13 @@ check = [ "test", ] -doc-autobuild = [ +docs-autobuild = [ { cmd = "sphinx-autobuild --open-browser --watch src doc doc/_build" }, ] -doc-html = [ +docs-html = [ { cmd = "sphinx-build -W --keep-going doc doc/_build" }, ] -doc-linkcheck = [ +docs-linkcheck = [ { cmd = "sphinx-build -W --keep-going -b linkcheck doc doc/_build" }, ]