diff --git a/docs/integrations/destinations/bigquery.md b/docs/integrations/destinations/bigquery.md index df333e51da9f9..65ec0901269d3 100644 --- a/docs/integrations/destinations/bigquery.md +++ b/docs/integrations/destinations/bigquery.md @@ -121,27 +121,29 @@ deduplicated. ### Raw Table schema -| Airbyte field | Description | Column type | -| ---------------------- | ------------------------------------------------------------------ | ----------- | -| \_airbyte_raw_id | A UUID assigned to each processed event | STRING | -| \_airbyte_extracted_at | A timestamp for when the event was pulled from the data source | TIMESTAMP | -| \_airbyte_loaded_at | Timestamp to indicate when the record was loaded into Typed tables | TIMESTAMP | -| \_airbyte_data | A JSON blob with the event data. | STRING | +The raw table contains these fields: +- `_airbyte_raw_id` +- `_airbyte_generation_id` +- `_airbyte_extracted_at` +- `_airbyte_loaded_at` +- `_airbyte_meta` +- `_airbyte_data` + +`_airbyte_data` is a JSON blob with the event data. See [here](/understanding-airbyte/airbyte-metadata-fields) +for more information about the other fields. **Note:** Although the contents of the `_airbyte_data` are fairly stable, schema of the raw table could be subject to change in future versions. ### Final Table schema -- `airbyte_raw_id`: A UUID assigned by Airbyte to each event that is processed. The column type in - BigQuery is `String`. -- `airbyte_extracted_at`: A timestamp representing when the event was pulled from the data source. - The column type in BigQuery is `Timestamp`. -- `_airbyte_meta`: A JSON blob representing typing errors. You can query these results to audit - misformatted or unexpected data. The column type in BigQuery is `JSON`. ... and a column of the - proper data type for each of the top-level properties from your source's schema. Arrays and - Objects will remain as JSON columns in BigQuery. Learn more about Typing and Deduping - [here](/using-airbyte/core-concepts/typing-deduping) +The final table contains these fields, in addition to the columns declared in your stream schema: +- `airbyte_raw_id` +- `_airbyte_generation_id` +- `airbyte_extracted_at` +- `_airbyte_meta` + +Again, see [here](/understanding-airbyte/airbyte-metadata-fields) for more information about these fields. The output tables in BigQuery are partitioned by the Time-unit column `airbyte_extracted_at` at a daily granularity and clustered by `airbyte_extracted_at` and the table Primary Keys. Partitions diff --git a/docs/integrations/destinations/snowflake.md b/docs/integrations/destinations/snowflake.md index 57e98410c26dd..a32bb9e4765a4 100644 --- a/docs/integrations/destinations/snowflake.md +++ b/docs/integrations/destinations/snowflake.md @@ -191,21 +191,35 @@ Airbyte outputs each stream into its own raw table in `airbyte_internal` schema overriden by user) and a final table with Typed columns. Contents in raw table are _NOT_ deduplicated. +**Note:** By default, Airbyte creates permanent tables. If you prefer transient tables, create a +dedicated transient database for Airbyte. For more information, refer +to[Working with Temporary and Transient Tables](https://docs.snowflake.com/en/user-guide/tables-temp-transient.html) + ### Raw Table schema -| Airbyte field | Description | Column type | -| ---------------------- | ------------------------------------------------------------------ | ------------------------ | -| \_airbyte_raw_id | A UUID assigned to each processed event | VARCHAR | -| \_airbyte_extracted_at | A timestamp for when the event was pulled from the data source | TIMESTAMP WITH TIME ZONE | -| \_airbyte_loaded_at | Timestamp to indicate when the record was loaded into Typed tables | TIMESTAMP WITH TIME ZONE | -| \_airbyte_data | A JSON blob with the event data. | VARIANT | +The raw table contains these fields: +- `_airbyte_raw_id` +- `_airbyte_generation_id` +- `_airbyte_extracted_at` +- `_airbyte_loaded_at` +- `_airbyte_meta` +- `_airbyte_data` + +`_airbyte_data` is a JSON blob with the event data. See [here](/understanding-airbyte/airbyte-metadata-fields) +for more information about the other fields. **Note:** Although the contents of the `_airbyte_data` are fairly stable, schema of the raw table could be subject to change in future versions. -**Note:** By default, Airbyte creates permanent tables. If you prefer transient tables, create a -dedicated transient database for Airbyte. For more information, refer -to[ Working with Temporary and Transient Tables](https://docs.snowflake.com/en/user-guide/tables-temp-transient.html) +### Final Table schema + +The final table contains these fields, in addition to the columns declared in your stream schema: +- `airbyte_raw_id` +- `_airbyte_generation_id` +- `airbyte_extracted_at` +- `_airbyte_meta` + +Again, see [here](/understanding-airbyte/airbyte-metadata-fields) for more information about these fields. ## Data type map diff --git a/docs/understanding-airbyte/airbyte-metadata-fields.md b/docs/understanding-airbyte/airbyte-metadata-fields.md new file mode 100644 index 0000000000000..5993edf123d82 --- /dev/null +++ b/docs/understanding-airbyte/airbyte-metadata-fields.md @@ -0,0 +1,60 @@ +# Airbyte Metadata fields + +In addition to the fields declared in a stream's schema, Airbyte destinations +append additional columns to your data. These fields are intended to aid in +understanding your data, as well as debugging various errors. + +| Airbyte field | Description | Column type | +| ------------------------ | ------------------------------------------------------------------------------------------------- | ------------------------- | +| `_airbyte_raw_id` | A random UUID assigned to each incoming record | String | +| `_airbyte_generation_id` | Incremented each time a [refresh](https://docs.airbyte.com/operator-guides/refreshes) is executed | String | +| `_airbyte_extracted_at` | A timestamp for when the event was pulled from the data source | Timestamp with timezone | +| `_airbyte_loaded_at` | Timestamp to indicate when the record was loaded into the destination | Timestamp with timezone | +| `_airbyte_meta` | Additional information about the record; see [below](#the-_airbyte_meta-field) | Object | + +Note that not all destinations populate the `_airbyte_loaded_at` field; it is +typically only useful for destinations that execute [typing and deduping](https://docs.airbyte.com/using-airbyte/core-concepts/typing-deduping). + +## The `_airbyte_meta` field + +This field contains additional information about the record. It is written as a JSON object. +All records have a `sync_id` field on this object. This ID has no inherent meaning, but is guaranteed +to increase monotonically across syncs. + +There is also a `changes` field, which is used to record any modifications that Airbyte performed on +the record. For example, if a record contained a value which did not match the stream's schema, +the destination connector could write `null` to the destination and add an entry to the `changes` +list. + +Each entry in the `changes` list is itself an object; the schema for these objects is defined in the +[Airbyte protocol](https://github.com/airbytehq/airbyte-protocol/blob/master/protocol-models/src/main/resources/airbyte_protocol/airbyte_protocol.yaml#L88), +as the `AirbyteRecordMessageMetaChange` struct. + +For example, if you saw this value in `_airbyte_meta`: +```json +{ + "sync_id": 1234, + "changes": [ + { + "field": "foo", + "change": "NULLED", + "reason": "DESTINATION_SERIALIZATION_ERROR" + } + ] +} +``` +You would know: +* This record was written during sync 1234 +* The `foo` column was nulled out, because it was not a valid value for the destination + +## Pre-Destinations V2 + +Destinations which predate [Destinations V2](https://docs.airbyte.com/release_notes/upgrading_to_destinations_v2/) +have a different set of metadata fields: some fields are not supported pre-DV2, +and other fields are present under a different name. + +| Airbyte field | Destinations V2 equivalent | +| --------------------- | -------------------------- | +| `_airbyte_ab_id` | `_airbyte_raw_id` | +| `_airbyte_emitted_at` | `_airbyte_extracted_at` | +| `_airbyte_loaded_at` | `_airbyte_loaded_at` |