Skip to content

[WIP] Streaming

Eric Pailleau edited this page Oct 26, 2018 · 3 revisions

[WIP] Streaming

jason allow to do JSON streaming decoding for below goals :

  • Reduce memory footprint for huge JSON documents decoding
  • Parallelize treatment with several workers

How it works

jason do not do assumptions on how the stream should be configured, so it is up to you to file:open/2 (or something equivalent) with the required options and give the io_pid() to jason. An unique reference will be returned at first call with this PID. This unique reference, hereafter called ref(), is a global identifier of a gen_statem process. This permit several workers, eventually on different nodes, to receive decoded chunks of the JSON document by using this ref().

Each subsequent calls with this ref() will read a chunk of the document (by default 1024 bytes, but can be changed), and tokenize data, then try to parse the tokens, with possible tokens from precedent readings, until a valid term is found. If a valid Erlang term is found, it will return tuple {sofar, Term} to caller, otherwise simply atom 'sofar', until end of document. If the end of the document is reached, atom 'end' will be returned to caller.

As it is a gen_statem keeping the offset of reading, several workers can do safe parallel calls.

JSON document structure importance

If a huge JSON document starts with a start of array [, no valid term can be parsed until a balanced end of array ]. This would result in a complete reading of the whole file until a term can be returned, and so streaming becomes useless. In order to avoid this issue, jason will store this context, and try to parse JSON elements inside this first level array, until a end of array token.

In order to warn the caller of this decision, atom 'array_start' is returned at first, and 'array_stop' when array elements totally parsed. If several workers are used, each one will be aware of this context, as jason will track first call of any worker.

Clone this wiki locally