Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destinations: extract common documentation about airbyte fields #53177

Merged
merged 1 commit into from
Feb 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 17 additions & 15 deletions docs/integrations/destinations/bigquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,27 +121,29 @@ deduplicated.

### Raw Table schema

| Airbyte field | Description | Column type |
| ---------------------- | ------------------------------------------------------------------ | ----------- |
| \_airbyte_raw_id | A UUID assigned to each processed event | STRING |
| \_airbyte_extracted_at | A timestamp for when the event was pulled from the data source | TIMESTAMP |
| \_airbyte_loaded_at | Timestamp to indicate when the record was loaded into Typed tables | TIMESTAMP |
| \_airbyte_data | A JSON blob with the event data. | STRING |
The raw table contains these fields:
- `_airbyte_raw_id`
- `_airbyte_generation_id`
- `_airbyte_extracted_at`
- `_airbyte_loaded_at`
- `_airbyte_meta`
- `_airbyte_data`

`_airbyte_data` is a JSON blob with the event data. See [here](/understanding-airbyte/airbyte-metadata-fields)
for more information about the other fields.

**Note:** Although the contents of the `_airbyte_data` are fairly stable, schema of the raw table
could be subject to change in future versions.

### Final Table schema

- `airbyte_raw_id`: A UUID assigned by Airbyte to each event that is processed. The column type in
BigQuery is `String`.
- `airbyte_extracted_at`: A timestamp representing when the event was pulled from the data source.
The column type in BigQuery is `Timestamp`.
- `_airbyte_meta`: A JSON blob representing typing errors. You can query these results to audit
misformatted or unexpected data. The column type in BigQuery is `JSON`. ... and a column of the
proper data type for each of the top-level properties from your source's schema. Arrays and
Objects will remain as JSON columns in BigQuery. Learn more about Typing and Deduping
[here](/using-airbyte/core-concepts/typing-deduping)
The final table contains these fields, in addition to the columns declared in your stream schema:
- `airbyte_raw_id`
- `_airbyte_generation_id`
- `airbyte_extracted_at`
- `_airbyte_meta`

Again, see [here](/understanding-airbyte/airbyte-metadata-fields) for more information about these fields.

The output tables in BigQuery are partitioned by the Time-unit column `airbyte_extracted_at` at a
daily granularity and clustered by `airbyte_extracted_at` and the table Primary Keys. Partitions
Expand Down
32 changes: 23 additions & 9 deletions docs/integrations/destinations/snowflake.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,21 +191,35 @@ Airbyte outputs each stream into its own raw table in `airbyte_internal` schema
overriden by user) and a final table with Typed columns. Contents in raw table are _NOT_
deduplicated.

**Note:** By default, Airbyte creates permanent tables. If you prefer transient tables, create a
dedicated transient database for Airbyte. For more information, refer
to[Working with Temporary and Transient Tables](https://docs.snowflake.com/en/user-guide/tables-temp-transient.html)

### Raw Table schema

| Airbyte field | Description | Column type |
| ---------------------- | ------------------------------------------------------------------ | ------------------------ |
| \_airbyte_raw_id | A UUID assigned to each processed event | VARCHAR |
| \_airbyte_extracted_at | A timestamp for when the event was pulled from the data source | TIMESTAMP WITH TIME ZONE |
| \_airbyte_loaded_at | Timestamp to indicate when the record was loaded into Typed tables | TIMESTAMP WITH TIME ZONE |
| \_airbyte_data | A JSON blob with the event data. | VARIANT |
The raw table contains these fields:
- `_airbyte_raw_id`
- `_airbyte_generation_id`
- `_airbyte_extracted_at`
- `_airbyte_loaded_at`
- `_airbyte_meta`
- `_airbyte_data`

`_airbyte_data` is a JSON blob with the event data. See [here](/understanding-airbyte/airbyte-metadata-fields)
for more information about the other fields.

**Note:** Although the contents of the `_airbyte_data` are fairly stable, schema of the raw table
could be subject to change in future versions.

**Note:** By default, Airbyte creates permanent tables. If you prefer transient tables, create a
dedicated transient database for Airbyte. For more information, refer
to[ Working with Temporary and Transient Tables](https://docs.snowflake.com/en/user-guide/tables-temp-transient.html)
### Final Table schema

The final table contains these fields, in addition to the columns declared in your stream schema:
- `airbyte_raw_id`
- `_airbyte_generation_id`
- `airbyte_extracted_at`
- `_airbyte_meta`

Again, see [here](/understanding-airbyte/airbyte-metadata-fields) for more information about these fields.

## Data type map

Expand Down
60 changes: 60 additions & 0 deletions docs/understanding-airbyte/airbyte-metadata-fields.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Airbyte Metadata fields

In addition to the fields declared in a stream's schema, Airbyte destinations
append additional columns to your data. These fields are intended to aid in
understanding your data, as well as debugging various errors.

| Airbyte field | Description | Column type |
| ------------------------ | ------------------------------------------------------------------------------------------------- | ------------------------- |
| `_airbyte_raw_id` | A random UUID assigned to each incoming record | String |
| `_airbyte_generation_id` | Incremented each time a [refresh](https://docs.airbyte.com/operator-guides/refreshes) is executed | String |
| `_airbyte_extracted_at` | A timestamp for when the event was pulled from the data source | Timestamp with timezone |
| `_airbyte_loaded_at` | Timestamp to indicate when the record was loaded into the destination | Timestamp with timezone |
| `_airbyte_meta` | Additional information about the record; see [below](#the-_airbyte_meta-field) | Object |

Note that not all destinations populate the `_airbyte_loaded_at` field; it is
typically only useful for destinations that execute [typing and deduping](https://docs.airbyte.com/using-airbyte/core-concepts/typing-deduping).

## The `_airbyte_meta` field

This field contains additional information about the record. It is written as a JSON object.
All records have a `sync_id` field on this object. This ID has no inherent meaning, but is guaranteed
to increase monotonically across syncs.

There is also a `changes` field, which is used to record any modifications that Airbyte performed on
the record. For example, if a record contained a value which did not match the stream's schema,
the destination connector could write `null` to the destination and add an entry to the `changes`
list.

Each entry in the `changes` list is itself an object; the schema for these objects is defined in the
[Airbyte protocol](https://github.com/airbytehq/airbyte-protocol/blob/master/protocol-models/src/main/resources/airbyte_protocol/airbyte_protocol.yaml#L88),
as the `AirbyteRecordMessageMetaChange` struct.

For example, if you saw this value in `_airbyte_meta`:
```json
{
"sync_id": 1234,
"changes": [
{
"field": "foo",
"change": "NULLED",
"reason": "DESTINATION_SERIALIZATION_ERROR"
}
]
}
```
You would know:
* This record was written during sync 1234
* The `foo` column was nulled out, because it was not a valid value for the destination

## Pre-Destinations V2

Destinations which predate [Destinations V2](https://docs.airbyte.com/release_notes/upgrading_to_destinations_v2/)
have a different set of metadata fields: some fields are not supported pre-DV2,
and other fields are present under a different name.

| Airbyte field | Destinations V2 equivalent |
| --------------------- | -------------------------- |
| `_airbyte_ab_id` | `_airbyte_raw_id` |
| `_airbyte_emitted_at` | `_airbyte_extracted_at` |
| `_airbyte_loaded_at` | `_airbyte_loaded_at` |
Loading