-
Notifications
You must be signed in to change notification settings - Fork 4
[WIP] Streaming
jason
allow to do JSON streaming decoding for below goals :
- Reduce memory footprint for huge JSON documents decoding
- Parallelize treatment with several workers
jason
do not do assumptions on how the stream should be configured, so it is up to you to file:open/2
(or something equivalent) with the required options and give the io_pid() to jason
.
An unique reference will be returned at first call with this PID. This unique reference, hereafter called ref(), is a global identifier of a gen_statem process. This permit several workers, eventually on different nodes, to receive decoded chunks of the JSON document by using this ref().
Each subsequent calls with this ref() will read a chunk of the document (by default 1024 bytes, but can be changed), and tokenize data, then try to parse the tokens, with possible tokens from precedent readings, until a valid term is found. If a valid Erlang term is found, it will return tuple {sofar, Term}
to caller, otherwise simply atom 'sofar', until end of document. If the end of the document is reached, atom 'end' will be returned to caller.
As it is a gen_statem keeping the offset of reading, several workers can do safe parallel calls.
If a huge JSON document starts with a start of array [
, no valid term can be parsed until a balanced end of array ]
. This would result in a complete reading of the whole file until a term can be returned, and so streaming becomes useless. To avoid this issue, jason
will store this context, and try to parse JSON elements inside this first level array, until a end of array token.
In order to warn the caller of this decision, atom 'array_start' is returned at first, and 'array_stop' when array elements totally parsed. If several workers are used, each one will be aware of this context, as jason
will track first call of any worker.