Skip to content

cognitect/transit-format

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transit

Transit is a format and set of libraries for conveying values between applications written in different programming languages. This spec describes Transit in order to facilitate its implementation in a wide range of languages.

Version: 0.8

The Transit Mailing List is the best resource for discussing this specification.

Rationale

Transit provides a set of basic elements and a set of extension elements for representing typed values. The extension mechanism is open, allowing programs using Transit to add new elements specific to their needs. Users of data formats without such facilities must rely on either schemas, convention, or context to convey elements not included in the base set, making application code much more complex. With Transit, schemas, convention, and context-sensitive logic are not required.

Transit is designed to be implemented as an encoding on top of formats for which high performance processors already exist, specifically JSON and MessagePack. Transit uses these formats' native representations for built-in elements, e.g., strings and arrays, wherever possible. Extension elements which have no native representation in these formats, e.g., dates, are represented using a tag-based encoding scheme. Extension always bottoms out on built-in types, there are no opaque binary blobs. Thus Transit format can always be decoded, and can be subject to editing, transformation and search operations, even by applications which do not 'know about' particular extension tags. In this sense, Transit is self-describing.

Transit also supports compression via caching of repeated elements, e.g., map keys, that can significantly reduce payload size and decoding time, as well as memory in the resulting application representation.

The design of Transit is focused on program-to-program communication, as opposed to human readability. While it does support an explicit verbose mode for representing Transit elements in JSON (called JSON-Verbose), Transit is not targeted for situations where human readability is paramount.

Transit processes elements in terms of semantic types, but it is not a type system, and has no schemas. Nor is it a system for representing object graphs - there are no reference types nor identity, nor should a consumer have an expectation that two equivalent elements in some body of Transit will yield distinct object identities when read, unless a reader implementation goes out of its way to make such a promise. Thus the resulting values should be considered immutable, and a reader implementation should yield values that ensure this, to the extent possible.

Transit defines the encoding of elements. There is no enclosing element required at the top level. Thus, Transit is suitable for streaming and interactive applications. A use of transit might be a stream or file containing a series of elements, but it could be as small as the conveyance of a single element in e.g. an HTTP query param.

The base set of built-in and extension elements in Transit is meant to cover the basic set of data structures common to most programming languages. While Transit specifies how those elements are encoded, it does not dictate the application memory/object representation on either the producer or consumer side. A well behaved implementation library should endeavor to map the elements to common programming language types with similar semantics.

Implementations

There are currently verified implementations for the following languages:

Each library's major.minor version number corresponds to the version of this specification it implements.

NOTE: Transit is intended primarily as a wire protocol for transferring data between applications. If storing Transit data durably, readers and writers are expected to use the same version of Transit and you are responsible for migrating/transforming/re-storing that data when and if the transit format changes.

For additional languages, see the list of community-provided implementations.

Specification

How it works

Transit is defined in terms of an extensible set of elements used to represent values. The elements correspond to semantic types common across programming languages, e.g., strings, arrays, URIs, etc. When an object is written with Transit, a language-specific Transit library maps the object's type to one of the supported semantic types. Then it encodes the value into MessagePack or JSON using the rules defined for that semantic type. Whenever possible, data is written directly to MessagePack or JSON using those protocols' built-in types. For instance, a string or an array from any language is always just represented as a string or an array in MessagePack or JSON. When a value cannot be represented directly as a built-in type in MessagePack or JSON, it must be encoded. Encoding captures the semantic type and value of the data in a form that can be represented as a built-in type in MessagePack or JSON, either a string, a two element array or a JSON object or MessagePack map (referred to as object/map in the rest of this specification).

When Transit data is read, any encoded values are decoded and programming-language appropriate representations are produced.

Transit defines the rules for encoding and decoding semantically typed values. It does not define how encoded data is stored, transmitted, or otherwise used.

Recursive tag-based encoding

When necessary, Transit encodes values as a tag indicating their semantic type and the value in a form that can be represented directly in MessagePack or JSON, or which can itself be further encoded. Each of the semantic types that Transit supports has a unique tag. Scalar values have single-character tags and composite values have multi-character tags. When a value cannot be directly represented in MessagePack or JSON, it is encoded one of three ways:

  • as a string "~" + tag-char + value-str
  • as an array ["~#tag", value]
  • as a JSON object {"~#tag" : value}

The diagram below provides an overview of the Transit encoding and decoding processes (the Write Flow and Read Flow diagrams below show the logic in detail).

Transit Overview

Both the writing and reading processes differentiate between ground types and extension types. In general, instances of ground types are represented directly in MessagePack or JSON (although there are some exceptions). Instances of extended types are never represented directly in MessagePack or JSON, they are always encoded. Whether they are encoded in string, array or object/map form depends on whether the data is a scalar or a composite as well as whether it is being written to MessagePack or JSON. Write handlers are used to recursively map instance of programming language types to Transit ground or extension types, which themselves map to ground types, while writing. Read handlers are used to recursively map Transit values to instances of programming language types.

Ground and extension types

The two tables below lists all of the built-in semantic types and their corresponding tags. The first table lists scalar types, the second table lists composite types. The first column indicates whether the type is a ground type or an extension type. For each extended type, the rep tag, rep, and string rep columns show the corresponding encoded form. The MessagePack, JSON and JSON-Verbose columns show how a tag and encoded form are combined in the target format and write mode.

Scalar Types

Semantic Type Tag Rep Tag Rep String rep (if not already) MessagePack JSON JSON-Verbose (no caching)
ground null _ nil "_" nil null when not key, else "~_" null when not key, else "~_"
ground string s "string" String String String
ground boolean ? boolean "t" or "f" Boolean Boolean when not key, else "~?t" or "~?f" Boolean when not key, else "~?t" or "~?f"
ground integer, signed 64 bit i integer "123" smallest int that holds value < 2^53 and not key, JSON number; else "~i1234..." < 2^53 and not key, JSON number; else "~i1234..."
ground floating pt decimal d floating pt number "123.456" smallest float that matches precision JSON number when not key, else "~d123.456" JSON number when not key, else "~d123.456"
ground bytes b base64 encoded bytes (RFC 4648) "base64 encoded bytes" "~bbase64" "~bbase64" "~bbase64"
extension keyword : s "key" "~:key" "~:key" "~:key"
extension symbol $ s "sym" "~$sym" "~$sym" "~$sym"
extension arbitrary precision decimal f s "123.456" "~f123.456" "~f123.456" "~f123.456"
extension arbitrary precision integer n s "123" "~n1234" "~n1234" "~n1234"
extension point in time m i int msecs since 1970 "123456789" ["~#m", int] "~m123456789" N/A
extension point in time t s timestamp (RFC 3339, no offset) NA NA "~t1985-04-12T23:20:50.52Z"
extension uuid u s or array [hi64, lo64] (RFC 4122) UUID string (RFC 4122) ["~#u", [hi64, lo64]] "~u531a379e-31bb-4ce1-8690-158dceb64be6" "~u531a379e-31bb-4ce1-8690-158dceb64be6"
extension uri r s uri string (RFC 3986) "~rhttp://..." "~rhttp://..." "~rhttp://..."
extension char c s "c" "~cc" "~cc" "~cc"
extension quoted value ' value NA ["~#'", value] ["~#'", value] {"~#'" : value }
extension special numbers z s "NaN", "INF", "-INF" "~zrep" "~zrep" "~zrep"
extension Scalar extension type X specify or s "arep" or arep "arep" "~Xarep" or ["~#X", arep] "~Xarep" or ["~#X", arep] "~Xarep" or {"~#X": arep}

Composite Types

Semantic Type Tag Rep Tag Rep String rep (if not already) MessagePack JSON JSON-Verbose (no caching)
ground array array iterable Array Array Array
ground map map iterable <map entry> Map Array: ["^ ", k1, v1, ...] Object
extension set set array [vals...] ["~#set", [vals ...]] ["~#set", [vals ...]] {"~#set" : [vals ...]}
extension list list array [vals...] ["~#list", [vals ...]] ["~#list", [vals ...]] {"~#list" : [vals ...]}
extension map w/ composite keys cmap array [k1, v1, ...] ["~#cmap", [k1, v1, ...]] ["~#cmap", [k1, v1, ...]] {"~#cmap" : [k1, v1, ...]}
extension link link map map with string keys: "href", "rel", "name", "render", "prompt"; name, render, prompt are optional; value of href is a URI, value of all other keys is a string, value of render key must be "image" or "link", as per Collection+JSON ["~#link" , {"href": "~rhttp://...", "rel": "a-rel", "name": "a-name", "render": "link or image", "prompt": "a-prompt"}] ["~#link" , ["^ ", "href", "~rhttp://...", "rel", "a-rel", "name", "a-name", "render", "link or image", "prompt", "a-prompt"]] {"~#link" : {"href": "~rhttp://...", "rel": "a-rel", "name": "a-name", "render": "link or image", "prompt": "a-prompt"}}
extension Composite extension type tag specify rep ["~#tag", rep] ["~#tag", rep] {"~#tag" : rep}

Note that there are two write modes for JSON. In normal JSON mode, caching is enabled (explained below) and maps are represented as arrays with a special marker element. There is also JSON-Verbose mode, which is less efficient, but easier for a person to read. In JSON-Verbose mode, caching is disabled and maps are represented as JSON objects. This is useful for configuration files, debugging, or any other situation where readability is more important than performance. A JSON reader is expected to transparently handle data written in either mode and to remain unaware of which mode was used to write the data.

Special Characters

Transit relies on a small number of character sequences to encode specific information. They are summarized in the table below.

Chars Usage Notes
~ string tag followed by single char, upper-case reserved for app extensions, then string value
~# tag followed by tag name, one or more chars
^ cache followed by one or two chars (see Caching below)
"^ " map-as-array marker when it is first item in array, indicates array represents a map
` reserved save backquote for expansion, escaped for now

Because the ~, ^, and ` characters have special meaning, any data string that begins with one of those characters is escaped by prepending a ~.

Caching

Transit implements a caching stream to compress repetitive data. Specifically, all ~#tag, keyword and symbol values are cached when they are more than 3 characters long (including the tag). Strings more than 3 characters long are also cached when they are used as keys in maps whose keys are all "stringable". Once a value is cached, subsequent appearances of the same value are replaced with cache codes.

Cache codes

Cache codes are generated using an increasing integer index. The number is converted to a one or two digit string expressed ASCII 48-91, inclusive, for numerals with a "^" prefix, i.e., "^c" or "^cc". Since there are 44 numerals and up to 2 digits, the possible range of cache codes is from 0 to 44^2. The code below shows how to convert back and forth between integer indexes and the corresponding cache codes.

private static final int CACHE_CODE_DIGITS = 44;
private static final int BASE_CHAR_INDEX = 48;
private static final String SUB_STR = "^";

private String indexToCode(int index) {
    int hi = index / CACHE_CODE_DIGITS;
    int lo = index % CACHE_CODE_DIGITS;
    if (hi == 0) {
        return SUB_STR + (char)(lo + BASE_CHAR_INDEX);
    } else {
        return SUB_STR + (char)(hi + BASE_CHAR_INDEX) + (char)(lo + BASE_CHAR_INDEX);
    }
}

private int codeToIndex(String s) {
    int sz = s.length();
    if (sz == 2) {
        return ((int)s.charAt(1) - WriteCache.BASE_CHAR_INDEX);
    } else {
        return (((int)s.charAt(1) - WriteCache.BASE_CHAR_INDEX) * WriteCache.CACHE_CODE_DIGITS) +
                ((int)s.charAt(2) - WriteCache.BASE_CHAR_INDEX);
    }
}

Write caching

On the writing side, the cache is implemented as two data structures: an incrementing counter and a map of original values to cache code.

String code = indexToCode(index++);
String representation Replacement
"abcd" "^0"
"~:ab" "^1"
"~$cd" "^2"
... ...

The first time a cacheable value is written, Transit adds an entry to to the cache map, increments the counter and writes the original value. The next time the cacheable value is encountered, the cache code is written instead. When the counter reaches its maximum it wraps to 0, the map is discarded, and the process starts again.

Read caching

On the reading side, the cache is also implemented as two data structures: an incrementing counter and an array.

String code = indexToCode(index++);
Index Replacement value
0 "abcd"
1 :ab
2 cd
... ...

The first time a cacheable value is read, Transit adds an entry to the cache array, increments the counter and processes the original value. If the counter wraps, the process starts again from 0. When a cache code is read, the corresponding original value is retrieved from the indicated index in the read cache.

Because the writer and the reader encounter cacheable values in the same order, cache code generation stays in sync.

Extensibility

Applications can extend Transit as necessary. There are two steps to extending Transit: defining a new semantic type and adding write and read handlers to map from / to programming language types.

To define a new semantic type, specify its meaning, tag and representation. You can also define a string representation and a verbose representation, but they are not required. For instance, you could define a new semantic type representing a point in the Cartesian coordinate system, with the tag "point" and represented as an array of two integers x and y:

Semantic Type Tag Rep Rep Tag String rep (if not already) MessagePack JSON JSON-Verbose (no caching)
extension point point array [int, int] ["~#point", [int, int] ] ["~#point", [int, int] ] {"~#point" : [int, int] }

Once the semantic type is defined, you can create write and read handlers.

A write handler maps values of a programming language type to values of a Transit semantic type. A write handler is a logical interface with the following operations (details differ by programming language):

Operation Args Notes
tag object return tag for object
rep object return encodable representation of object
stringRep object return string representation of object
getVerboseHandler <none> return an alternate handler to use in verbose mode to produce a more readable representation

A write handler for an extension type must implement the rep operation. Transit calls rep to get an encodable representation of a value. The encodable representation may be any type for which a handler exists OR a type that can be mapped directly to a ground semantic type. The tag rep column in the semantic type table above lists the programming language types that map directly to ground types without requiring handlers (for instance, an iterable maps directly to an array). Transit implementations provide an tagged-value function to allow you to specify a particular tag and rep to use to represent an extension type, if that is more efficient than representing an extension type using a type for which a handler exists. For example, you can represent an extension type as an array by either having rep return an array (for which there is a handler) or by having it return an tagged-value with the tag "array" and an iterable value as the rep.

Maps are used to associate programming language types with write handlers.

A read handler maps values of a Transit semantic type to values of a programming language type. A read handler is a logical function that constructs a new value from a representation (details differ by programming language).

Maps are used to associate Transit tags with read handlers.

Here is an example of write and read handlers for the point semantic type that map from/to a Point record type in Clojure:

(defrecord Point [x y])

;; write handler
{Point
   (reify Handler
     (tag [_ _] "point")
     (rep [_ p] [(.x p) (.y p)])
     (stringRep [_ kw] nil)
     (verboseHandler [_] nil))}

;; read handler
{"point"
  (fn [rep] (let [[x y] rep] (Point. x y)))}

Recursive Extensions

You can define extension types in terms of other extension types. Transit manages the details of recursively encoding and decoding representations.

For example, imagine a circle semantic type with the tag "circle" represented as an array of its origin (a point) and radius (an integer):

Semantic Type Tag Rep Rep Tag String rep (if not already) MessagePack JSON JSON-Verbose (no caching)
extension circle circle array [point, int] ["~#circle", [point, int]] ["~#circle", [point, int] ] {"~#circle" : [point, int] }

Here is an example of write and read handlers for the circle semantic type that map from/to a Circle record type in Clojure:

(defrecord Circle [origin radius])

;; write handler
{Circle
   (reify Handler
     (tag [_ _] "circle")
     (rep [_ c] [(.origin c) (.radius c)])
     (stringRep [_ kw] nil)
     (verboseHandler [_] nil))}

;; read handler
{"circle"
  (fn [rep] (let [[origin radius] rep] (Circle. origin radius)))}

The Transit encoding of a circle at 10, 20 with a radius of 5 looks like this in JSON:

["~#circle", [ ["~#point", [10, 20] ], 5] ]

Quoting

There is a special representation for quoted values. Quote tag is ' (~#') and yields its value intact. It is used to wrap scalar values at top-level, where some JSON parsers will not accept scalars.

["~#'", "a quoted string"]

Transit handles quoting top-level scalars on write and unquoting them on read as necessary.

TaggedValues

It is possible that Transit encoded data will contain a semantic type that a processing application does not have a read handler for. In that case, the encoded value cannot be decoded and is returned as an instance of a special TaggedValue type with two properties, a tag and a value (details vary by programming language). TaggedValues can be inspected by application code if required. If a TaggedValue instance is written with Transit, the tag and value are used for the encoded form. ensuring that TaggedValues roundtrip correctly.

Write Flow

The diagram below describes the Transit encoding process.

Transit Write Flow

Read Flow

The diagram below describes the Transit decoding process.

Transit Read Flow

MIME Types

The MIME type for Transit format data depends on the encoding scheme:

Encoding MIME type
JSON / JSON-Verbose application/transit+json
MessagePack application/transit+msgpack

License

Copyright © 2014 Cognitect Inc

Creative Commons License
Transit Format Specification by Cognitect is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.