expand columns having json into multiple columns
- Plugin type: filter
- json_column_name: a column name having json to be expanded (string, required)
- root: root property to start fetching each entries, specify in JsonPath style (string, default:
"$."
) - expanded_columns: columns expanded into multiple columns (array of hash, required)
- name: name of the column. you can define JsonPath style.
- type: type of the column (see below)
- format: format of the timestamp if type is timestamp
- timezone: Time zone of each timestamp columns if values don’t include time zone description (
UTC
by default)
- keep_expanding_json_column: Not remove the expanding json column from input schema if it's true (false by default)
- default_timezone: Time zone of timestamp columns if values don’t include time zone description (
UTC
by default) - stop_on_invalid_record: Stop bulk load transaction if an invalid record is included (false by default)
- cache_provider: Cache provider name for JsonPath.
"LRU"
and"NOOP"
are built-in. You can specify user defined class. (string, default:"LRU"
)"NOOP"
becomes default in the future.
type of the column
name | description |
---|---|
boolean | true or false |
long | 64-bit signed integers |
timestamp | Date and time with nano-seconds precision |
double | 64-bit floating point numbers |
string | Strings |
filters:
- type: expand_json
json_column_name: json_payload
root: "$."
expanded_columns:
- {name: "phone_numbers", type: string}
- {name: "app_id", type: long}
- {name: "point", type: double}
- {name: "created_at", type: timestamp, format: "%Y-%m-%d", timezone: "UTC"}
- {name: "profile.anniversary.et", type: string}
- {name: "profile.anniversary.voluptatem", type: string}
- {name: "profile.like_words[1]", type: string}
- {name: "profile.like_words[2]", type: string}
- {name: "profile.like_words[0]", type: string}
- If the value evaluated by JsonPath is Array or Hash, the value is written as JSON.
- https://github.com/jayway/JsonPath
- use to evaluate JsonPath
- Apache License Version 2.0
$ ./gradlew gem
$ embulk run -Ibuild/gemContents/lib ./example/config.yml
$ ./gradlew gem # -t to watch change of files and rebuild continuously
In some cases, cache_provider: NOOP
improves the performance of this plugin by 3 times (embulk#41).
So we do a benchmark about cache_provider
. In our case, cache_provider: noop
improves the performance by 1.5 times.
use expand_json filter |
cache_provider | Time took | records/s |
---|---|---|---|
false |
none | 7.62s | 1,325,459/s |
true |
"LRU" |
2m9s | 78,025/s |
true |
"NOOP" |
1m25s | 118,476/s |
You can reproduce the bench by the below way.
./gradlew gem
./bench/run.sh
Modify version
in build.gradle
at a detached commit, and then tag the commit with an annotation.
git checkout --detach master
(Edit: Remove "-SNAPSHOT" in "version" in build.gradle.)
git add build.gradle
git commit -m "Release vX.Y.Z"
git tag -a vX.Y.Z
(Edit: Write a tag annotation in the changelog format.)
See Keep a Changelog for the changelog format. We adopt a part of it for Git's tag annotation like below.
## [X.Y.Z] - YYYY-MM-DD
### Added
- Added a feature.
### Changed
- Changed something.
### Fixed
- Fixed a bug.
Push the annotated tag, then. It triggers a release operation on GitHub Actions after approval.
git push -u origin vX.Y.Z
- @Civitaspo
- @muga
- @sakama