Replies: 5 comments 1 reply
-
Thank you for the writing. Yes, the design makes sense overall. And yes, we should avoid writing our own parser. I have something in my imagination: # For example, give the query
SELECT * from user WHERE id = 1 AND name LIKE '%john%'
# We can represent it as:
TABLE SOURCE user A
FILTER A WHERE id = 1
FILTER A WHERE name LIKE ‘%john%’
OUTPUT * FROM A
# If the AND becomes OR, we’d:
TABLE SOURCE user A
FILTER A WHERE id = 1
TABLE SOURCE user B
FILTER B WHERE name LIKE ‘%john%’
UNION A, B as C
OUTPUT * FROM C |
Beta Was this translation helpful? Give feedback.
-
I honestly think that the work you did on https://github.com/Samyak2/gopy is highly relevant |
Beta Was this translation helpful? Give feedback.
-
I have been thinking of the executor, here's one way I think it can be done. Apologies for the late response, I had been busy with wrapping up my college work. ExecutorA register based executor which runs on a linear Intermediate Code/IR similar to your example Examples
Questions
Data structures
A question
|
Beta Was this translation helpful? Give feedback.
-
I guess the registers required for generating the intermediate code would be minimal and we can use Hashmap for that purpose. It can be drop (when out of scope) after the generation of intermediate code. I think having a linear representation of the original query keeps the execution engine simple and it's good starting point. Of cause we can optimize the execution by parallelizing operation like multiple OR filters. |
Beta Was this translation helpful? Give feedback.
-
Yes, as you said, a program has to be executed linearly in the end. Although there also exist an alternative where we walk the AST and execute along the way. The question is whether we want to have a representation. I am more inclined in having a bytecode-like representation in which it is clear, concise and more importantly we can effectively unit test every instruction. FYI, SQLite also has a VM + bytecode https://www.sqlite.org/opcode.html but in my opinion that is too low level and performance oriented. For example, I don't like the concept of a cursor and that the machine does not prefer random access (granted, given that disk read/write is linear and seek is expensive the time SQLite was designed). But then it is also an art in designing an elegant instruction set. |
Beta Was this translation helpful? Give feedback.
-
Below is an initial version of the design document for a new SQL interpreter to be used for testing SeaORM.
Design
I’m using the temporary name “shark-lite” to refer to the in-memory database which is being described here.
Execution Flow
All statements executed on shark-lite are performed on an instance of it. All data added to an instance is gone forever when it's dropped.
The input to the execute function is an arbitrary SQL statement provided as a string. The input SQL is first tokenized into a sequence of individual tokens and then parsed according to the grammar rules into an Abstract Syntax Tree (AST). The AST is then executed directly - there is no need for another intermediate representation as the AST is high level enough here. The below diagram illustrates this flow:
Data Structures
An instance of SharkLite can contain many databases, each of which can contain many schemas (default is “main” if not specified to be compatible with SQLite), each of which contains many tables. A table has metadata about its columns and the actual data as rows. This represents a complete picture of data stored in a database - not including stored procedures, triggers, users, etc.
Along with the execute function which can run arbitrary SQL and mutate the DB, there will be methods provided on each of them to create and modify them. For example, a database has methods to create, update and delete tables.
Supported subset of SQL
Using SQLite as a reference, I have selected a subset of the SQL grammar to implement in shark-lite. The implementation of each is divided in phases.
Implementation
The SQL parser will be written using pest, a parser generator library in Rust. Defining our SQL grammar using PEG will make it easy to extend over time and avoids parsing unsupported SQL (which may be parsed by an existing SQL parser).
Each value of a row is represented as an Enum. Each variant of the enum describes a type in SQL - NULL, INTEGER, etc. The Value enum will have methods for casting, extracting types, operators, etc. The Row itself will be a variable sized Vec of Values.
Good to haves
Shark-lite will also come with helpers for testing such as:
Improvements and Feedback
Reviewer/Mentors
@tyt2y3 @billy1624
Original idea from GSoC list: https://github.com/SeaQL/summer-of-code/tree/main/2022#5-a-sql-interpreter-primarily-intended-for-mock-testing
Beta Was this translation helpful? Give feedback.
All reactions