Skip to content

Commit

Permalink
Update mapping.md (#873)
Browse files Browse the repository at this point in the history
* Update mapping.md

* Formatting

* Update note

* Formatting

* formatting

* Remove parquet uuid note
  • Loading branch information
clairemcginty authored Dec 12, 2023
1 parent a4924e3 commit c7619c9
Showing 1 changed file with 39 additions and 28 deletions.
67 changes: 39 additions & 28 deletions docs/mapping.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,40 @@
Type Mapping
============

| Scala | Avro | BigQuery | Bigtable<sup>7</sup> | Datastore | Parquet | Protobuf | TensorFlow |
| ------------------------- | ------------------------ | ---------------------- | ------------------------------- | --------------------- | ------------------------------- | ----------------------- | ------------------- |
| `Unit` | `NULL` | - | - | `Null` | - | - | - |
| `Boolean` | `BOOLEAN` | `BOOL` | `Byte` | `Boolean` | `BOOLEAN` | `Boolean` | `INT64`<sup>3</sup> |
| `Char` | `INT`<sup>3</sup> | `INT64`<sup>3</sup2> | `Char` | `Integer`<sup>3</sup> | `INT32`<sup>3</sup> | `Int`<sup>3</sup> | `INT64`<sup>3</sup> |
| `Byte` | `INT`<sup>3</sup> | `INT64`<sup>3</sup2> | `Byte` | `Integer`<sup>3</sup> | `INT32`<sup>9</sup> | `Int`<sup>3</sup> | `INT64`<sup>3</sup> |
| `Short` | `INT`<sup>3</sup> | `INT64`<sup>3</sup2> | `Short` | `Integer`<sup>3</sup> | `INT32`<sup>9</sup> | `Int`<sup>3</sup> | `INT64`<sup>3</sup> |
| `Int` | `INT` | `INT64`<sup>3</sup2> | `Int` | `Integer`<sup>3</sup> | `INT32`<sup>9</sup> | `Int` | `INT64`<sup>3</sup> |
| `Long` | `LONG` | `INT64` | `Long` | `Integer` | `INT64`<sup>9</sup> | `Long` | `INT64` |
| `Float` | `FLOAT` | `FLOAT64`<sup>3</sup2> | `Float` | `Double`<sup>3</sup> | `FLOAT` | `Float` | `FLOAT` |
| `Double` | `DOUBLE` | `FLOAT64` | `Double` | `Double` | `DOUBLE` | `Double` | `FLOAT`<sup>3</sup> |
| `String` | `STRING` | `STRING` | `String` | `String` | `BINARY` + `STRING`<sup>9</sup> | `String` | `BYTES`<sup>3</sup> |
| `Array[Byte]` | `BYTES` | `BYTES` | `ByteString` | `Blob` | `BINARY` | `ByteString` | `BYTES` |
| `ByteString` | - | - | `ByteString` | `Blob` | - | `ByteString` | `BYTES` |
| Enum<sup>1</sup> | `ENUM` | `STRING`<sup>3</sup2> | `String` | `String`<sup>3</sup> | `BINARY` + `ENUM`<sup>9</sup> | Enum | `BYTES`<sup>3</sup> |
| `BigInt` | - | - | `BigInt` | - | - | - | - |
| `BigDecimal` | `BYTES`<sup>4</sup> | `NUMERIC`<sup>6</sup2> | `Int` scale + unscaled `BigInt` | - | Int/Binary/Fixed<sup>9</sup> | - | - |
| `Option[T]` | `UNION`<sup>5</sup> | `NULLABLE` | Empty as `None` | Absent as `None` | `OPTIONAL` | `optional`<sup>10</sup> | Size <= 1 |
| `Iterable[T]`<sup>2</sup> | `ARRAY` | `REPEATED` | - | `Array` | `REPEATED` | `repeated` | Size >= 0 |
| Nested | `RECORD` | `STRUCT` | Flat<sup>8</sup> | `Entity` | Group | `Message` | Flat<sup>8</sup> |
| `Map[String, T]` | `MAP` | - | - | - | - | - | - |
| `Instant` | `INT`<sup>4</sup> | `TIMESTAMP` | - | `Integer` | `INT64`<sup>9</sup> | - | - |
| `LocalDateTime` | `LONG`<sup>4</sup> | `DATETIME` | - | - | `INT64`<sup>9</sup> | - | - |
| `OffsetTime` | - | - | - | - | `INT32`/`INT64`<sup>9</sup> | - | - |
| `LocalTime` | `INT`/`LONG`<sup>4</sup> | `TIME` | - | - | `INT32`/`INT64`<sup>9</sup> | - | - |
| `LocalDate` | `INT`<sup>4</sup> | `DATE` | - | - | `INT32`<sup>9</sup> | - | - |
| `UUID` | `STRING`<sup>4</sup> | - | 16 bytes | - | Fixed<sup>9</sup> | - | - |
| Scala | Avro | BigQuery | Bigtable<sup>7</sup> | Datastore | Parquet | Protobuf | TensorFlow |
|-----------------------------------|------------------------------|------------------------|---------------------------------|-----------------------|-----------------------------------|-------------------------|---------------------|
| `Unit` | `NULL` | - | - | `Null` | - | - | - |
| `Boolean` | `BOOLEAN` | `BOOL` | `Byte` | `Boolean` | `BOOLEAN` | `Boolean` | `INT64`<sup>3</sup> |
| `Char` | `INT`<sup>3</sup> | `INT64`<sup>3</sup2> | `Char` | `Integer`<sup>3</sup> | `INT32`<sup>3</sup> | `Int`<sup>3</sup> | `INT64`<sup>3</sup> |
| `Byte` | `INT`<sup>3</sup> | `INT64`<sup>3</sup2> | `Byte` | `Integer`<sup>3</sup> | `INT32`<sup>9</sup> | `Int`<sup>3</sup> | `INT64`<sup>3</sup> |
| `Short` | `INT`<sup>3</sup> | `INT64`<sup>3</sup2> | `Short` | `Integer`<sup>3</sup> | `INT32`<sup>9</sup> | `Int`<sup>3</sup> | `INT64`<sup>3</sup> |
| `Int` | `INT` | `INT64`<sup>3</sup2> | `Int` | `Integer`<sup>3</sup> | `INT32`<sup>9</sup> | `Int` | `INT64`<sup>3</sup> |
| `Long` | `LONG` | `INT64` | `Long` | `Integer` | `INT64`<sup>9</sup> | `Long` | `INT64` |
| `Float` | `FLOAT` | `FLOAT64`<sup>3</sup2> | `Float` | `Double`<sup>3</sup> | `FLOAT` | `Float` | `FLOAT` |
| `Double` | `DOUBLE` | `FLOAT64` | `Double` | `Double` | `DOUBLE` | `Double` | `FLOAT`<sup>3</sup> |
| `CharSequence` | `STRING` | - | - | - | - | - | - |
| `String` | `STRING` | `STRING` | `String` | `String` | `BINARY` | `String` | `BYTES`<sup>3</sup> |
| `Array[Byte]` | `BYTES` | `BYTES` | `ByteString` | `Blob` | `BINARY` | `ByteString` | `BYTES` |
| `ByteString` | - | - | `ByteString` | `Blob` | - | `ByteString` | `BYTES` |
| `ByteBuffer` | `BYTES` | - | - | | - | - | - |
| Enum<sup>1</sup> | `ENUM` | `STRING`<sup>3</sup2> | `String` | `String`<sup>3</sup> | `BINARY`/`ENUM`<sup>9</sup> | Enum | `BYTES`<sup>3</sup> |
| `BigInt` | - | - | `BigInt` | - | - | - | - |
| `BigDecimal` | `BYTES`<sup>4</sup> | `NUMERIC`<sup>6</sup2> | `Int` scale + unscaled `BigInt` | - | `LOGICAL[DECIMAL]`<sup>9,14</sup> | - | - |
| `Option[T]` | `UNION[NULL, T]`<sup>5</sup> | `NULLABLE` | Empty as `None` | Absent as `None` | `OPTIONAL` | `optional`<sup>10</sup> | Size <= 1 |
| `Iterable[T]`<sup>2</sup> | `ARRAY` | `REPEATED` | - | `Array` | `REPEATED`<sup>13</sup> | `repeated` | Size >= 0 |
| Nested | `RECORD` | `STRUCT` | Flat<sup>8</sup> | `Entity` | Group | `Message` | Flat<sup>8</sup> |
| `Map[CharSequence, T]` | `MAP[STRING, T]` | - | - | - | - | - | |
| `Map[String, T]` | `MAP[STRING, T]` | - | - | - | - | - | - |
| `java.time.Instant` | `LONG`<sup>11</sup> | `TIMESTAMP` | - | `Timestamp` | `LOGICAL[TIMESTAMP]`<sup>9</sup> | - | - |
| `java.time.LocalDateTime` | `LONG`<sup>11</sup> | `DATETIME` | - | - | `LOGICAL[TIMESTAMP]`<sup>9</sup> | - | - |
| `java.time.OffsetTime` | - | - | - | - | `LOGICAL[TIME]`<sup>9</sup> | - | - |
| `java.time.LocalTime` | `LONG`<sup>11</sup> | `TIME` | - | - | `LOGICAL[TIME]`<sup>9</sup> | - | - |
| `java.time.LocalDate` | `INT`<sup>11</sup> | `DATE` | - | - | `LOGICAL[DATE]`<sup>9</sup> | - | - |
| `org.joda.time.LocalDate` | `INT`<sup>11</sup> | - | - | - | - | - | - |
| `org.joda.time.DateTime` | `INT`<sup>11</sup> | - | - | - | - | - | - |
| `org.joda.time.LocalTime` | `INT`<sup>11</sup> | - | - | - | - | - | - |
| `java.util.UUID` | `STRING`<sup>4</sup> | - | ByteString (16 bytes) | - | `FIXED[16]` | - | - |
| `(Long, Long, Long)`<sup>12</sup> | `FIXED[12]` | - | - | - | - | - | - |

1. Those wrapped in`UnsafeEnum` are encoded as strings, see [enums.md](https://github.com/spotify/magnolify/blob/master/docs/enums.md) for more
2. Any subtype of `Iterable[T]`
Expand All @@ -37,5 +44,9 @@ Type Mapping
6. Fixed precision of 38 and scale of 9 ([doc](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#numeric-type))
7. All Scala types are encoded as big endian `ByteString` for Bigtable
8. Nested fields are encoded flat with field names joined with `.`, e.g. `level1.level2.level3`
9. Parquet logical types ([doc](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md))
9. More information on Parquet logical type schemas can be found [here](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). Time types are available at multiple precisions; import `magnolify.parquet.logical.micros._`, `magnolify.avro.logical.millis._`, or `magnolify.avro.logical.nanos._` accordingly.
10. See [protobuf.md](https://github.com/spotify/magnolify/blob/master/docs/protobuf.md) for more
11. Logical types available at micro- or milli-second precision; import `magnolify.avro.logical.micros._` or `magnolify.avro.logical.millis._` accordingly. BigQuery-compatible conversions are available in `magnolify.avro.logical.bigquery._`.
12. Special tuple used to represent Duration in the [Avro spec](https://avro.apache.org/docs/1.11.0/spec.html#Duration). This has not been made implicit in Magnolify; import `AvroType.afDuration` implicitly to enable
13. If `magnolify.parquet.ParquetArray.AvroCompat._` is imported, array fields use the nested, Avro-compatible schema format: `required group $FIELDNAME (LIST) { repeated $FIELDTYPE array ($FIELDSCHEMA); }`.
14. Parquet's Decimal logical format supports multiple representations, and are not implicitly scoped by default. Import one of: `magnolify.parquet.ParquetField.{decimal32, decimal64, decimalFixed, decimalBinary}`.

0 comments on commit c7619c9

Please sign in to comment.