Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backslash escape character is always stripped #224

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

flyingsilverfin
Copy link
Member

@flyingsilverfin flyingsilverfin commented Aug 3, 2022

DO NOT MERGE (see comment)

What is the goal of this PR?

We address various String escaping difficulties by which the user could not get a raw string containing a quote, or a backslash, into TypeDB without also serialising an escape character into the database. This is contrary to the existing database conventions.

We now expect that a user can insert any dataset into TypeDB by applying the following operation to their raw input strings:

escaped_data = raw_data.replace("\\", "\\\\").replace("\"", "\\\"")
insert_query = "insert $x isa person, has data \"" + escaped_data + "\""

This process will replace each \ character with a \\ sequence, and then replace each " with an \" sequence. This will ensure that by the time the query is parsed, the user's raw data is held in memory on the server side.

Example
If raw_data above contains hello " world \ , then the escaped_data will contain hello \" world \\ . After the query is submitted to the server and parsed, the data value held for attribute data will be hello " world \ and this is the value that will be persisted into the database.

What are the changes implemented in this PR?

  • TypeQL establishes \ as the official escape character
  • Align behaviour with database convention: console query strings that include attribute values with escapes, remove those escape characters during parsing.
  • When parsing a String attribute or Regex string, any backslash character is replaced its following character
  • When printing out parsed queries, we escape strings and regexes according to the reverse convention: include a backslash before each existing backslash and double quote character.
  • Previously, we broke user expectations set by other databases: the users escaped " and \, but we always kept the escape character after parsing - as a result the user data plus backslashes ended up stored on disk.

Existing database conventions:

Let's split query languages into two domains: console languages, and programmatic languages.

Programmatic languages include parametrised queries, for example in SQL a query can exclude attribute values, and pass them to the server separately to the parametrised query. This means escaping is never needed (and protects against SQL injection attacks) because a String attribute is passed directly to the server as a string. TypeDB does not yet support parametrised queries. As a result, we must always construct query strings that are parsed by the parser in full. Neo4j and MongoDB have their own parametrised input mechanisms that behave similarly.

Console languages are languages whose queries must be inputted entirely as strings. All TypeQL queries are provided as strings, so let's define them as console languages. Console languages all have an escape mechanism to input quoting into string attribute values. SQL and Neo4j have console languages available that use different conventions:

SQL uses the quoting character itself as the escape character:

select 'it''s escaped'

This will be parsed as it's escaped and utilised in the query exection. The escaping ' is removed.

Neo4J uses backslashes:

CREATE (n:Node {name:"hello \" world \\ "}) 
RETURN n.name

n.name
hello " world \

In this case, the parsed value removes the escape character \ in both usages.

We can see that existing console languages use the convention of stripping the escape characters from the input, which TypeQL now matches.

Copy link
Member

@haikalpribadi haikalpribadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to add BDD tests.

@flyingsilverfin
Copy link
Member Author

flyingsilverfin commented Aug 3, 2022

DO NOT MERGE
We have decided to park this change since it practically breaks backward compatibility of TypeDB: user's that did string escape previously have databases containing quoted \" value \" inside. However, in the new convention we would store quoted " value " inside.

To avoid forcing users to re-load, we will save merging this until we have to break backward compatibility for other reasons.

Changes required when merging:

  • The export-import should on import detect if importing from an older TypeDB version. If so, on import replace the \" with " to bring behaviour in line with the new parsing.
  • Add BDD scenario to test escaping works as expected end to end:
  Scenario: inserting strings including escapes are stored without escaping
    When typeql insert
      """
      insert
        $x isa name; $x "Normal";
        $y isa name; $y "With quote \" character";
        $z isa name; $z "With backslash \\ character";
      """
    Then transaction commits

    When session opens transaction of type: read
    When get answers of typeql match
      """
      match $x isa name;
      """
    Then uniquely identify answer concepts
      | x                                     |
      | value:name:Normal                     |
      | value:name:With quote " character     |
      | value:name:With backslash \ character |

    When get answers of typeql match
      """
      match $x "Normal" isa name;
      """
    Then answer size is: 1

    When get answers of typeql match
      """
      match $x "With quote \" character" isa name;
      """
    Then answer size is: 1

    When get answers of typeql match
      """
      match $x "With backslash \\ character" isa name;
      """
    Then answer size is: 1

@brettforbes
Copy link

I am writing a hard coded workaround to this issue atm, as shown here https://forum.vaticle.com/t/escape-hell-handling-string-values/166/8

@flyingsilverfin
Copy link
Member Author

We have further considerations for when this issue comes back:
#239 fixed this for now, but have to settle on a behaviour of how to actually insert a newline character (not the two characters "\n").

One proposal is that the parser should parse "\n" pair of raw text characters as one newline character, which is then serialisable to the server. In client-java, we have to use an unformatted-toString() to make sure that we don't introduce new tabs into strings with newlines in them with this solution (alternatively, we can introduce templated queries and serialise attribute values separately). On printing of an attribute value, we would print a newline character as a pair of "\n" characters, making this operation fully reversible. If the user wishes to write a pair of "\n" characters explicitly, they would write "\n". One interesting edge case here is that writing an actual newline (eg. "enter" key) in console leaves another way of getting a newline character into the database - perhaps this one should be banned?

@brettforbes
Copy link

hmm, so your suggestion would work, but currently it works anyway, just using a substitution approach as shown here https://forum.vaticle.com/t/escape-hell-handling-string-values/166/12

It just took me some time to work it out, and its the kind of thing that would be useful in documentation , for the less talented people like myself (i.e. why i posted the fix in the discussion forum)

so certainly this should be way down the list of priorities, and I would need to change that code when you make the change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants