-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent behavior in json functions #24563
Comments
@tdcmeehan This is a major correctness hole with json_extract right ? json_extract:
compared to json_parse:
|
Ideally, as everything eventually moves to Prestissimo and it will behave 'canonically', then I see no reason of not fixing Presto Java. Can it be achieved by making a canonicalize() call before returning the result? |
Hi Sergey, maybe I misunderstand, but you are saying if everything is moving to prestissimo, then we shouldnt have to fix Presto Java right ? (It reads the other way).
This should take back seat to correctness in my view, but agree with having a config property. |
While working on implementing JSON functions in Velox for Presto-CPP, I came across some inconsistencies that I wanted to highlight and get the community's opinion on whether it makes sense for us to fix them, either both in Presto-Java and Presto-CPP or just in Presto-CPP.
1.
json_extract
Produces Non-Canonicalized JSON as OutputWe expect the JSON type to be canonicalized whenever it exists in a query. Since Presto does not store JSON as a type that can be written to files, this canonicalization is usually done when a VARCHAR column is converted to JSON using one of the JSON functions (typically
json_parse
). This is important because it can affect the equality of two JSON objects and provide inconsistent results if sometimes the JSONs are canonicalized and sometimes they are not.For example, suppose the input string is:
json_parse
will canonicalize this to:NOTE: canonicalization puts the keys in ascending order and removes spaces.
Now, if there are two inputs:
These should be considered equal JSONs when parsed and compared; otherwise, this can result in correctness issues. If we use
json_extract
, which returns a JSON object type, it would extract and return a non-canonicalized JSON:NOTE: Spaces are removed, but keys remain untouched.
The same issue occurs in
json_array_get
:Example of a possible correctness issue:
A group by on the json column returning wrong results
Recommendation: The expectation is that once a VARCHAR is converted to JSON, it should be in a canonicalized form. Therefore, we should always canonicalize the resultant JSON in all functions that take VARCHAR as prospective JSON input and output a JSON.
2.
json_extract
/_scalar
,json_array_get
Accepts Invalid JSON to Produce Valid ResultIf the JSON is invalid, it will still produce valid output if and only if the target key/index specified in the path comes before the part where the JSON becomes invalid.
NOTE:
"z"a1"
is the invalid part. If youjson_parse
the above string, it will throw an error.The same issue occurs with
json_array_get
:Why is this a problem?
json_extract
does not canonicalize the input, this can return a valid result in one input and invalid in another, even if the two inputs are considered equal when canonicalized.json_extract
seems to have inconsistency depending on the path. For example, Jayway treats$.
as optional in the beginning, therefore$.key_2
should be the same askey_2
. However, as we'll see, it returns different values when the JSON is invalid.NOTE: Path with
$.key_2
returns a value, butkey_2
returns NULL.Recommendation: If we decide to move ahead with canonicalization of all functions that produce JSON from VARCHAR (as discussed in # 1), then we can catch invalid JSONs in that step and always return null.
3.
json_array_get
Returns Invalid JSON When Output is a String ScalarSince we are already on this topic and talking about
json_array_get
, I thought this would be a good time to discuss this known problem injson_array_get
(this is already highlighted in Presto docs, see [Presto Documentation](https://prestodb.io/docs/current/functions/json.html#json_array_get-json_array-index-json)).As we decide to add support for this in Presto-CPP (Velox), I was wondering if there are any objections to ensuring we support the correct behavior, that is, a string scalar is correctly enclosed in quotes:
cc: @amitkdutta @spershin @kgpai @kevinwilfong @Yuhta @kaikalur @feilong-liu @rschlussel @tdcmeehan @aditi-pandit
The text was updated successfully, but these errors were encountered: