Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior in json functions #24563

Open
bikramSingh91 opened this issue Feb 14, 2025 · 3 comments
Open

Inconsistent behavior in json functions #24563

bikramSingh91 opened this issue Feb 14, 2025 · 3 comments
Labels

Comments

@bikramSingh91
Copy link
Contributor

bikramSingh91 commented Feb 14, 2025

While working on implementing JSON functions in Velox for Presto-CPP, I came across some inconsistencies that I wanted to highlight and get the community's opinion on whether it makes sense for us to fix them, either both in Presto-Java and Presto-CPP or just in Presto-CPP.

1. json_extract Produces Non-Canonicalized JSON as Output

We expect the JSON type to be canonicalized whenever it exists in a query. Since Presto does not store JSON as a type that can be written to files, this canonicalization is usually done when a VARCHAR column is converted to JSON using one of the JSON functions (typically json_parse). This is important because it can affect the equality of two JSON objects and provide inconsistent results if sometimes the JSONs are canonicalized and sometimes they are not.

For example, suppose the input string is:

{ "key_2": 2, "key_3": 3, "key_1": 1 }

json_parse will canonicalize this to:

presto:default> select JSON_PARSE('{ "key_2": 2, "key_3": 3, "key_1": 1 }');
_col0
---------------------------------
{"key_1":1,"key_2":2,"key_3":3}

NOTE: canonicalization puts the keys in ascending order and removes spaces.

Now, if there are two inputs:

{ "key_2": 2, "key_3": 3, "key_1": 1 }
{ "key_1": 1, "key_2": 2, "key_3": 3 }

These should be considered equal JSONs when parsed and compared; otherwise, this can result in correctness issues. If we use json_extract, which returns a JSON object type, it would extract and return a non-canonicalized JSON:

presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_3": 3, "key_1": 1 }', '$');
_col0
---------------------------------
{"key_2":2,"key_3":3,"key_1":1}

NOTE: Spaces are removed, but keys remain untouched.

The same issue occurs in json_array_get:

presto:default> select json_array_get('[{ "key_2": 2, "key_1": 1 }]', 0);
_col0
-----------------------
{"key_2":2,"key_1":1}

Example of a possible correctness issue:
A group by on the json column returning wrong results

SELECT
    ("json_extract"((output_fields), '$.key')) c0
FROM (
    SELECT
        *
    FROM (
        VALUES
            ('{"key": {"inner_1":"1", "inner_2":"2"}'),
            ('{"key": {"inner_2":"2", "inner_1":"1"}}')
    ) AS t (output_fields)
)
GROUP BY
    1;

              c0               
-------------------------------
 {"inner_1":"1","inner_2":"2"} 
 {"inner_2":"2","inner_1":"1"} 

Recommendation: The expectation is that once a VARCHAR is converted to JSON, it should be in a canonicalized form. Therefore, we should always canonicalize the resultant JSON in all functions that take VARCHAR as prospective JSON input and output a JSON.

2. json_extract / _scalar, json_array_get Accepts Invalid JSON to Produce Valid Result

If the JSON is invalid, it will still produce valid output if and only if the target key/index specified in the path comes before the part where the JSON becomes invalid.

presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_2');
_col0
-------
2
presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_1');
_col0
-------
"z"

NOTE: "z"a1" is the invalid part. If you json_parse the above string, it will throw an error.

The same issue occurs with json_array_get:

presto:default> select json_array_get('[{ "key_2": 2, "key_1": 1 }, {"z"aa"}]', 0);
_col0
-----------------------
{"key_2":2,"key_1":1}

Why is this a problem?

  • This makes the behavior unpredictable in the face of invalid JSON.
  • Since json_extract does not canonicalize the input, this can return a valid result in one input and invalid in another, even if the two inputs are considered equal when canonicalized.
  • Moreover, json_extract seems to have inconsistency depending on the path. For example, Jayway treats $. as optional in the beginning, therefore $.key_2 should be the same as key_2. However, as we'll see, it returns different values when the JSON is invalid.
presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_2');
_col0
-------
2
presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', 'key_2');
_col0
-------
NULL

NOTE: Path with $.key_2 returns a value, but key_2 returns NULL.

Recommendation: If we decide to move ahead with canonicalization of all functions that produce JSON from VARCHAR (as discussed in # 1), then we can catch invalid JSONs in that step and always return null.

3. json_array_get Returns Invalid JSON When Output is a String Scalar

Since we are already on this topic and talking about json_array_get, I thought this would be a good time to discuss this known problem in json_array_get (this is already highlighted in Presto docs, see [Presto Documentation](https://prestodb.io/docs/current/functions/json.html#json_array_get-json_array-index-json)).

SELECT json_array_get('["a", [3, 9], "c"]', 0); -- JSON 'a' < == (invalid JSON)

As we decide to add support for this in Presto-CPP (Velox), I was wondering if there are any objections to ensuring we support the correct behavior, that is, a string scalar is correctly enclosed in quotes:

SELECT json_array_get('["a", [3, 9], "c"]', 0); -- JSON '"a"' < == (valid JSON)

cc: @amitkdutta @spershin @kgpai @kevinwilfong @Yuhta @kaikalur @feilong-liu @rschlussel @tdcmeehan @aditi-pandit

@kgpai
Copy link
Contributor

kgpai commented Feb 14, 2025

@tdcmeehan This is a major correctness hole with json_extract right ?

json_extract:

 SELECT
    y,
    COUNT(1)
FROM (
    SELECT
        JSON_EXTRACT(x, '$') AS y
    FROM (
        VALUES
            ('{ "key_2": 2, "key_3": 3, "key_1": 1 }'),
            ('{"key_3": 3, "key_2": 2, "key_1": 1 }')
    ) AS t(x)
)
GROUP BY
    1

------- Returns

{"key_2":2,"key_3":3,"key_1":1} | 1
{"key_3":3,"key_2":2,"key_1":1} | 1

compared to json_parse:

SELECT
    y,
    COUNT(1)
FROM (
    SELECT
        JSON_PARSE(x) AS y
    FROM (
        VALUES
            ('{ "key_2": 2, "key_3": 3, "key_1": 1 }'),
            ('{"key_3": 3, "key_2": 2, "key_1": 1 }')
    ) AS t(x)
)
GROUP BY
    1

---- Returns


{"key_1":1,"key_2":2,"key_3":3} | 2

@spershin
Copy link
Contributor

Ideally, as everything eventually moves to Prestissimo and it will behave 'canonically', then I see no reason of not fixing Presto Java.
Depends how hard it is to achieve.

Can it be achieved by making a canonicalize() call before returning the result?
This would introduce a performance regression, which we could overcome by wrapping it into a config property, which we would only turn on for verification purposes and it can also be turned on by whoever wants to opt in into this behavior.

@kgpai
Copy link
Contributor

kgpai commented Feb 14, 2025

as everything eventually moves to Prestissimo and it will behave 'canonically', then I see no reason of not fixing Presto Java.

Hi Sergey, maybe I misunderstand, but you are saying if everything is moving to prestissimo, then we shouldnt have to fix Presto Java right ? (It reads the other way).

This would introduce a performance regression,

This should take back seat to correctness in my view, but agree with having a config property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🆕 Unprioritized
Development

No branches or pull requests

3 participants