SQL interpreter for Mock Testing - Design Document #11

Samyak2 · 2022-04-22T19:38:54Z

Samyak2
Apr 22, 2022

Below is an initial version of the design document for a new SQL interpreter to be used for testing SeaORM.

Design

I’m using the temporary name “shark-lite” to refer to the in-memory database which is being described here.

Execution Flow

All statements executed on shark-lite are performed on an instance of it. All data added to an instance is gone forever when it's dropped.

The input to the execute function is an arbitrary SQL statement provided as a string. The input SQL is first tokenized into a sequence of individual tokens and then parsed according to the grammar rules into an Abstract Syntax Tree (AST). The AST is then executed directly - there is no need for another intermediate representation as the AST is high level enough here. The below diagram illustrates this flow:

Data Structures

An instance of SharkLite can contain many databases, each of which can contain many schemas (default is “main” if not specified to be compatible with SQLite), each of which contains many tables. A table has metadata about its columns and the actual data as rows. This represents a complete picture of data stored in a database - not including stored procedures, triggers, users, etc.

Along with the execute function which can run arbitrary SQL and mutate the DB, there will be methods provided on each of them to create and modify them. For example, a database has methods to create, update and delete tables.

Supported subset of SQL

Using SQLite as a reference, I have selected a subset of the SQL grammar to implement in shark-lite. The implementation of each is divided in phases.

CREATE DATABASE:
- Phase 1: create database by name, IF NOT EXISTS
CREATE SCHEMA:
- Phase 1: create schema by name, IF NOT EXISTS
CREATE TABLE:
- Phase 1: Columns with name and types only, IF NOT EXISTS
- Phase 2: Columns with constraints (PRIMARY KEY, UNIQUE, CHECK)
- Phase 3: Foreign keys (relationships)
- Phase 4: CREATE TABLE AS SELECT
SELECT statement:
- Phase 1: FROM a table, filter using WHERE
- Phase 2: ORDER BY, LIMIT
- Phase 3: JOIN clause, FROM VALUES
- Phase 4: GROUP BY, DISTINCT
INSERT statement:
- Phase 1: INTO a table, VALUES
- Phase 2: selected columns only, REPLACE
- Phase 3: INSERT INTO table SELECT
- Phase 4 (optional): ON CONFLICT (upsert clause)
UPDATE statement:
- Phase 1: set multiple columns, FROM table
- Phase 2: filter using WHERE
- Phase 3: JOIN clause
DELETE statement:
- Phase 1: DELETE FROM (entire table)
- Phase 2: filter using WHERE
DROP TABLE:
- Phase 1: table, if exists
DROP SCHEMA:
- Phase 1: schema, if exists
DROP DATABASE:
- Phase 1: database, if exists
Expression:
- Phase 1: Literals (booleans, ints, floats), Column name (incl. Optional table name and schema name)
- Phase 2: IS NULL, NOT NULL, +, _, =, !=, >, <, <=, >=
- Phase 3: NOT, AND, OR, *, /, BETWEEN
- Phase 4 (optional): CAST
- Can be prioritized based on which are being used in SeaORM
Data types:
- Supported types:
  - NULL: ()
  - INTEGER: u64
  - REAL: f64
  - TEXT: String
  - BLOB: Vec<u8>
  - BOOLEAN: bool
- Phase 1: NULL, INTEGER, REAL, BOOLEAN
- Phase 2: TEXT, BLOB
- Phase 3: Casting

Implementation

The SQL parser will be written using pest, a parser generator library in Rust. Defining our SQL grammar using PEG will make it easy to extend over time and avoids parsing unsupported SQL (which may be parsed by an existing SQL parser).

Each value of a row is represented as an Enum. Each variant of the enum describes a type in SQL - NULL, INTEGER, etc. The Value enum will have methods for casting, extracting types, operators, etc. The Row itself will be a variable sized Vec of Values.

Good to haves

Shark-lite will also come with helpers for testing such as:

Create a new temporary database that is deleted when Dropped - the database name will be randomized. The database object itself will also have an execute method.
Similar to above but with schemas and tables.

Improvements and Feedback

The project's name needs to better indicate it's smolness :). Ideas are welcome here.
Look at sqlparser-rs and avoid writing our own parser
Details on the Executor are lacking. I'm researching and thinking about this, will keep updating it in the next 1-2 days.

Reviewer/Mentors

@tyt2y3 @billy1624

Original idea from GSoC list: https://github.com/SeaQL/summer-of-code/tree/main/2022#5-a-sql-interpreter-primarily-intended-for-mock-testing

tyt2y3 · 2022-04-24T03:49:56Z

tyt2y3
Apr 24, 2022
Maintainer

Thank you for the writing. Yes, the design makes sense overall. And yes, we should avoid writing our own parser.
I think we should focus on thinking the type system and the mechanism of the executor. Starting from the AST, we'd want to linearize the query into series of operations.

I have something in my imagination:

# For example, give the query 
SELECT * from user WHERE id = 1 AND name LIKE '%john%'

# We can represent it as: 

TABLE SOURCE user A
FILTER A WHERE id = 1
FILTER A WHERE name LIKE ‘%john%’
OUTPUT * FROM A

# If the AND becomes OR, we’d:

TABLE SOURCE user A
FILTER A WHERE id = 1
TABLE SOURCE user B
FILTER B WHERE name LIKE ‘%john%’
UNION A, B as C
OUTPUT * FROM C

0 replies

tyt2y3 · 2022-04-24T03:53:03Z

tyt2y3
Apr 24, 2022
Maintainer

I honestly think that the work you did on https://github.com/Samyak2/gopy is highly relevant
That say, we should first think about the executor state machine before defining what operations are possible on them

0 replies

Samyak2 · 2022-05-25T07:47:52Z

Samyak2
May 25, 2022
Author

I have been thinking of the executor, here's one way I think it can be done. Apologies for the late response, I had been busy with wrapping up my college work.

Executor

A register based executor which runs on a linear Intermediate Code/IR similar to your example

Examples

Query: SELECT * FROM user WHERE id = 1 AND name LIKE '%john%'

TABLE %1 user
FILTER %1 EQUALS 'id' 1
FILTER %1 LIKE 'name' ‘%john%’
OUTPUT %1 ALL

Query: SELECT username, email FROM user WHERE id = 1 AND name LIKE '%john%'

TABLE %1 user
FILTER %1 EQUALS 'id' 1
FILTER %1 LIKE 'name' ‘%john%’
OUTPUT %1 'username' 'email'

Query: SELECT * FROM user WHERE id = 1 OR name LIKE '%john%'

TABLE %1 user
FILTER %1 EQUALS 'id' 1
TABLE %2 user
FILTER %2 LIKE 'name' '%john%'
UNION %1 %2 %3
OUTPUT %3 ALL

Query: INSERT INTO user VALUES ('john', 'doe', '[email protected]', 'hunter2')

TABLE %1 user
INSERT %1 ('john', 'doe', '[email protected]', 'hunter2')

Query: UPDATE user SET email = '[email protected]' WHERE username = 'john'

TABLE %1 user
FILTER %1 EQUALS 'username' 'john'
UPDATE %1 'email' '[email protected]'

Questions

Will we need to do register allocation while generating the intermediate code?
- or we can do it like LLVM - unlimited number of registers.
- Then use a hashmap to map register number to value.
- What about old unused registers then - will this need garbage collection??

Data structures

TABLE %n tablename
- If tablename is a valid table, this will store the following in the register n:
  - Table name
  - Table ID: some identifier or hash made out of DB name, schema name and table name that uniquely identifies this table
  - Primary key: column name
  - Columns: list of columns
    - Each column stores name, type, is PK.
    - Will also store column constraints, relationship, etc. once they are implemented.
  - Rows: all data of the table
    - Indexed by PK if it exists
FILTER %n operator column value
- Stores a subset of the table in register n
  - A reference to the original table (using Rc)
  - Subset of data after filtering
    - I was considering making this lazy - do not execute when the command is issued, only when the data is first used

A question

How does having a linear language help?
- I know that most intermediate and machine languages are linear. For example, most assembly languages, IC of most compilers and python byte code.
Since we are optimizing for correctness, does having a linear intermediate code help towards this?
We can only ask these questions now, not after the project has started and details have been made concrete. So I wanted to ask.

0 replies

billy1624 · 2022-05-25T10:33:56Z

billy1624
May 25, 2022
Maintainer

I guess the registers required for generating the intermediate code would be minimal and we can use Hashmap for that purpose. It can be drop (when out of scope) after the generation of intermediate code.

I think having a linear representation of the original query keeps the execution engine simple and it's good starting point. Of cause we can optimize the execution by parallelizing operation like multiple OR filters.

1 reply

Samyak2 May 27, 2022
Author

I guess the registers required for generating the intermediate code would be minimal and we can use Hashmap for that purpose

That makes sense. Since we'll be generating IC for a single query, it wouldn't use too many tables.

It can be drop (when out of scope) after the generation of intermediate code.

I'm not sure if this can be done because:

We are using these registers while executing intermediate code, not while generating it. The hashmap will store the currently used registers and their values.
A value in any register could be used at any point in the executing. We cannot deterministically drop things from the hashmap.

I think having a linear representation of the original query keeps the execution engine simple and it's good starting point.

That makes sense. Thanks!

tyt2y3 · 2022-06-02T16:26:51Z

tyt2y3
Jun 2, 2022
Maintainer

Yes, as you said, a program has to be executed linearly in the end. Although there also exist an alternative where we walk the AST and execute along the way. The question is whether we want to have a representation.

I am more inclined in having a bytecode-like representation in which it is clear, concise and more importantly we can effectively unit test every instruction.

FYI, SQLite also has a VM + bytecode https://www.sqlite.org/opcode.html but in my opinion that is too low level and performance oriented. For example, I don't like the concept of a cursor and that the machine does not prefer random access (granted, given that disk read/write is linear and seek is expensive the time SQLite was designed).

But then it is also an art in designing an elegant instruction set.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL interpreter for Mock Testing - Design Document #11

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

SQL interpreter for Mock Testing - Design Document #11

Samyak2 Apr 22, 2022

Design

Execution Flow

Data Structures

Supported subset of SQL

Implementation

Good to haves

Improvements and Feedback

Reviewer/Mentors

Replies: 5 comments · 1 reply

tyt2y3 Apr 24, 2022 Maintainer

tyt2y3 Apr 24, 2022 Maintainer

Samyak2 May 25, 2022 Author

Executor

Examples

Questions

Data structures

A question

billy1624 May 25, 2022 Maintainer

Samyak2 May 27, 2022 Author

tyt2y3 Jun 2, 2022 Maintainer

Samyak2
Apr 22, 2022

Replies: 5 comments 1 reply

tyt2y3
Apr 24, 2022
Maintainer

tyt2y3
Apr 24, 2022
Maintainer

Samyak2
May 25, 2022
Author

billy1624
May 25, 2022
Maintainer

Samyak2 May 27, 2022
Author

tyt2y3
Jun 2, 2022
Maintainer