Inconsistent behavior in json functions #24563

bikramSingh91 · 2025-02-14T02:19:21Z

While working on implementing JSON functions in Velox for Presto-CPP, I came across some inconsistencies that I wanted to highlight and get the community's opinion on whether it makes sense for us to fix them, either both in Presto-Java and Presto-CPP or just in Presto-CPP.

1. `json_extract` Produces Non-Canonicalized JSON as Output

We expect the JSON type to be canonicalized whenever it exists in a query. Since Presto does not store JSON as a type that can be written to files, this canonicalization is usually done when a VARCHAR column is converted to JSON using one of the JSON functions (typically json_parse). This is important because it can affect the equality of two JSON objects and provide inconsistent results if sometimes the JSONs are canonicalized and sometimes they are not.

For example, suppose the input string is:

{ "key_2": 2, "key_3": 3, "key_1": 1 }

json_parse will canonicalize this to:

presto:default> select JSON_PARSE('{ "key_2": 2, "key_3": 3, "key_1": 1 }');
_col0
---------------------------------
{"key_1":1,"key_2":2,"key_3":3}

NOTE: canonicalization puts the keys in ascending order and removes spaces.

Now, if there are two inputs:

{ "key_2": 2, "key_3": 3, "key_1": 1 }
{ "key_1": 1, "key_2": 2, "key_3": 3 }

These should be considered equal JSONs when parsed and compared; otherwise, this can result in correctness issues. If we use json_extract, which returns a JSON object type, it would extract and return a non-canonicalized JSON:

presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_3": 3, "key_1": 1 }', '$');
_col0
---------------------------------
{"key_2":2,"key_3":3,"key_1":1}

NOTE: Spaces are removed, but keys remain untouched.

The same issue occurs in json_array_get:

presto:default> select json_array_get('[{ "key_2": 2, "key_1": 1 }]', 0);
_col0
-----------------------
{"key_2":2,"key_1":1}

Example of a possible correctness issue:
A group by on the json column returning wrong results

SELECT
    ("json_extract"((output_fields), '$.key')) c0
FROM (
    SELECT
        *
    FROM (
        VALUES
            ('{"key": {"inner_1":"1", "inner_2":"2"}'),
            ('{"key": {"inner_2":"2", "inner_1":"1"}}')
    ) AS t (output_fields)
)
GROUP BY
    1;

              c0               
-------------------------------
 {"inner_1":"1","inner_2":"2"} 
 {"inner_2":"2","inner_1":"1"}

Recommendation: The expectation is that once a VARCHAR is converted to JSON, it should be in a canonicalized form. Therefore, we should always canonicalize the resultant JSON in all functions that take VARCHAR as prospective JSON input and output a JSON.

2. `json_extract` / `_scalar`, `json_array_get` Accepts Invalid JSON to Produce Valid Result

If the JSON is invalid, it will still produce valid output if and only if the target key/index specified in the path comes before the part where the JSON becomes invalid.

presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_2');
_col0
-------
2
presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_1');
_col0
-------
"z"

NOTE: "z"a1" is the invalid part. If you json_parse the above string, it will throw an error.

The same issue occurs with json_array_get:

presto:default> select json_array_get('[{ "key_2": 2, "key_1": 1 }, {"z"aa"}]', 0);
_col0
-----------------------
{"key_2":2,"key_1":1}

Why is this a problem?

This makes the behavior unpredictable in the face of invalid JSON.
Since json_extract does not canonicalize the input, this can return a valid result in one input and invalid in another, even if the two inputs are considered equal when canonicalized.
Moreover, json_extract seems to have inconsistency depending on the path. For example, Jayway treats $. as optional in the beginning, therefore $.key_2 should be the same as key_2. However, as we'll see, it returns different values when the JSON is invalid.

presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_2');
_col0
-------
2
presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', 'key_2');
_col0
-------
NULL

NOTE: Path with $.key_2 returns a value, but key_2 returns NULL.

Recommendation: If we decide to move ahead with canonicalization of all functions that produce JSON from VARCHAR (as discussed in # 1), then we can catch invalid JSONs in that step and always return null.

3. `json_array_get` Returns Invalid JSON When Output is a String Scalar

Since we are already on this topic and talking about json_array_get, I thought this would be a good time to discuss this known problem in json_array_get (this is already highlighted in Presto docs, see [Presto Documentation](https://prestodb.io/docs/current/functions/json.html#json_array_get-json_array-index-json)).

SELECT json_array_get('["a", [3, 9], "c"]', 0); -- JSON 'a' < == (invalid JSON)

As we decide to add support for this in Presto-CPP (Velox), I was wondering if there are any objections to ensuring we support the correct behavior, that is, a string scalar is correctly enclosed in quotes:

SELECT json_array_get('["a", [3, 9], "c"]', 0); -- JSON '"a"' < == (valid JSON)

cc: @amitkdutta @spershin @kgpai @kevinwilfong @Yuhta @kaikalur @feilong-liu @rschlussel @tdcmeehan @aditi-pandit

The text was updated successfully, but these errors were encountered:

kgpai · 2025-02-14T17:54:27Z

@tdcmeehan This is a major correctness hole with json_extract right ?

json_extract:

 SELECT
    y,
    COUNT(1)
FROM (
    SELECT
        JSON_EXTRACT(x, '$') AS y
    FROM (
        VALUES
            ('{ "key_2": 2, "key_3": 3, "key_1": 1 }'),
            ('{"key_3": 3, "key_2": 2, "key_1": 1 }')
    ) AS t(x)
)
GROUP BY
    1

------- Returns

{"key_2":2,"key_3":3,"key_1":1} | 1
{"key_3":3,"key_2":2,"key_1":1} | 1

compared to json_parse:

SELECT
    y,
    COUNT(1)
FROM (
    SELECT
        JSON_PARSE(x) AS y
    FROM (
        VALUES
            ('{ "key_2": 2, "key_3": 3, "key_1": 1 }'),
            ('{"key_3": 3, "key_2": 2, "key_1": 1 }')
    ) AS t(x)
)
GROUP BY
    1

---- Returns


{"key_1":1,"key_2":2,"key_3":3} | 2

spershin · 2025-02-14T21:06:10Z

Ideally, as everything eventually moves to Prestissimo and it will behave 'canonically', then I see no reason of not fixing Presto Java.
Depends how hard it is to achieve.

Can it be achieved by making a canonicalize() call before returning the result?
This would introduce a performance regression, which we could overcome by wrapping it into a config property, which we would only turn on for verification purposes and it can also be turned on by whoever wants to opt in into this behavior.

kgpai · 2025-02-14T21:23:21Z

as everything eventually moves to Prestissimo and it will behave 'canonically', then I see no reason of not fixing Presto Java.

Hi Sergey, maybe I misunderstand, but you are saying if everything is moving to prestissimo, then we shouldnt have to fix Presto Java right ? (It reads the other way).

This would introduce a performance regression,

This should take back seat to correctness in my view, but agree with having a config property.

bikramSingh91 added the bug label Feb 14, 2025

github-project-automation bot added this to Bugs and support requests Feb 14, 2025

github-project-automation bot moved this to 🆕 Unprioritized in Bugs and support requests Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior in json functions #24563

Inconsistent behavior in json functions #24563

bikramSingh91 commented Feb 14, 2025 •

edited

Loading

kgpai commented Feb 14, 2025

spershin commented Feb 14, 2025

kgpai commented Feb 14, 2025

Inconsistent behavior in json functions #24563

Inconsistent behavior in json functions #24563

Comments

bikramSingh91 commented Feb 14, 2025 • edited Loading

1. json_extract Produces Non-Canonicalized JSON as Output

2. json_extract / _scalar, json_array_get Accepts Invalid JSON to Produce Valid Result

3. json_array_get Returns Invalid JSON When Output is a String Scalar

kgpai commented Feb 14, 2025

spershin commented Feb 14, 2025

kgpai commented Feb 14, 2025

bikramSingh91 commented Feb 14, 2025 •

edited

Loading

1. `json_extract` Produces Non-Canonicalized JSON as Output

2. `json_extract` / `_scalar`, `json_array_get` Accepts Invalid JSON to Produce Valid Result

3. `json_array_get` Returns Invalid JSON When Output is a String Scalar