List needed changes to implement UDTFs on FROMs #746

ajnavarro · 2019-03-21T13:13:20Z

Check all the changes that we will need to be able to execute the following query:

SELECT file_path FROM commits as c, FILES(c.repository_id, c.commit_hash)

Files function internally will be like a table with mandatory filters (repository_id and commit_hash)

The text was updated successfully, but these errors were encountered:

kuba-- · 2019-03-22T12:28:03Z

What's the purpose?
Or maybe in other words - what's the difference of having:

SELECT file_path
FROM commits  c
INNER JOIN commit_files cf
ON c.repository_id = cf.repository_id AND c.commit_hash = cf.commit_hash

erizocosmico · 2019-03-26T10:21:45Z

Function calling in the from is not valid sql syntax. Or, at least, not valid MySQL syntax. So we will need to change it in vitess (which they're not gonna accept because is not valid MySQL syntax) or manually parse the froms if we want this.

Also, what about indexes? What would be the syntax for creating indexes with this?

If FILES is going to behave like a table, why not just use a table?

Is there any other construct in MySQL that behaves like this? If there isn't, why not use the concepts that are already there: the tables? Just like we have ref_commits and so on right now.

What's the advantage of this over tables? We must take into account the disadvantages that this brings to the table (invented concept, invalid syntax, creating index awkwardness, ...).

smola · 2019-03-26T11:38:14Z

I'll try to clarify the context, since it seems there're no appropriate minutes or extensive docs about this.

The context is the discussion about functions returning multiple rows per input row, also known as table-generating functions (or UDTF, when we're talking about the UDF counterpart). This is something that came up during the discussion about diff (#553), diff-tree (#691) and blame implementation.

We discussed two broad approaches:

Table-based
Function-based

Tables

Implementing each of these (diff, diff-tree, blame) as a table. This has some major advantages:

Relatively simple syntax (or, if not simple, familiar to gitbase users).
Easy to implement, both on gitbase and gitbase-spark-connector.
Easy to apply same optimizations as with other tables.

While it has some drawbacks:

Easy to get wrong by using standalone (a bit like files). This can be workarounded by adding some validation rules that prevent users from executing some common crazy queries.
Hard to extend for parameters. For the use cases at hand, one can think of rename detection for diff-tree and blame, which it's quite common to enable but it's very expensive to compute, or certain thresholds for the diff algorithm. This kind of parameter could be controlled through session variables though.

Functions

This is actually not a single approach, but a family of them. The common thing is that there is some function construct that allows to map one input row to multiple output rows. There are two main approaches: array based and table based.

For array based:

EXPLODE: this is what Apache Spark implements (see docs, this is like an aggregation function... but in reverse. It takes an array as input and returns a row per array element.
UNNEST: this, in its simple form, is standard SQL (since SQL:1999 I think). UNNEST is a reserved keyword that can be used in FROM to turn a collection value (i.e. array) into a table. This is supported by PostgreSQL and in BigQuery with the standard SQL mode enabled. BigQuery supports some additional flexibility here (see docs.

Then there's also the possibility of using table-generating functions, that return tables right ahead. This is supported in HiveQL (not SQL!, see docs), and to some degree on IBM DB2 and Oracle. There's a similar concept in latest SQL standard (SQL:2016) called polymorphic table functions, currently supported by Oracle.

The array-based methods are still pretty familiar for users of advanced SQL in PostgreSQL, Spark or BigQuery. They have the short-comming, depending on how they are implemented, that they might imply accumulating a full array in memory before exploding results though. In any case, they provide some more flexibility regarding passing parameters.

On the other hand, they can be quite more complex to implement. I understand it would imply changing to vitess and Spark parser, as well as a few planning rules, and probably also changing go-mysql-server interfaces to have a common ground between all tables (including this kind of derivative tables) so that we can still apply optimizations.

In any case, the last status of these discussions is that we need further research on what it would take to implement a function-based approach both on gitbase and gitbase-spark-connector.

erizocosmico · 2019-03-26T17:06:03Z

For UDTFs to be implemented in gitbase we would need the following things in both go-mysql-server and gitbase.

The biggest workload is on go-mysql-server, along with adapting the squash rule to work with this.

go-mysql-server

Parser

Vitess' parser does not handle functions in the from part of a query. This is not likely to be merged in vitess, because this is not even a thing in MySQL, so we would have to fork vitess' or something.

Joins

Assuming we want something like we have on the example query, if we have a cross join like:

FROM commits c, FILES(c.repository_id, c.commit_hash)

Right now, it's transformed into:

CrossJoin
 | - commits
 | - FILES(c.repository_id, c.commit_hash)

Where both branches are absolutely isolated and don't know anything about each other. This is how the semantics of cross join work.

The only place where both branches have access to each other's values is in an ON clause, and even then they only have access to one row at a time.

If we want FILES in this example to access the row in commits we need to change the semantics of cross join (and inner join) to allow this. But more issues arise from this:
Do we allow access in the left side? right side? What if both sides are UDTFs?

In any case, we would need to generate a set of rows in the UDTF for each row in the other branch.

UDTFs could implement this interface instead of sql.Table:

// GeneratedTable is a table whose rows are generated for a given set of values.
type GeneratedTable struct {
        // RowsForValues returns the rows generated for the given values.
        RowsForValues(ctx *sql.Context, values ...interface{}) (sql.RowIter, error)
        // ... other methods such as Schema, ...
}

Then, the joins would need to special-case if one of the branches is a generated table and for each row in the other side, gather the values and get the iterator for those values from the generated table.

Probably, it would just be better to use a rule to replace CrossJoin and InnerJoin with a node that handles this, instead of having the special case on them.

Caveats

Intuitive to do FROM commits, FILES(c.repository_id, c.commit_hash), BLOBS(c.repository_id, c.commit_hash) for the user, but this is doing a full cartesian product of all the tables, which may be crazy.

Analyzer

Resolver for UDTFs.
Rule to translate joins with UDTFs to joins that can actually handle the UDTF.
Rule to translate constant UDTFs (SELECT * FROM FILES('harcoded-repo', 'hardcoded-hash')) to a table (just a wrapper node, shouldn't be hard).

Core

A way to register UDTFs and ensure UDTF names cannot clash with real tables or other UDFs.

gitbase

Squash

Squash works for inner joins, but this pattern is used with cross joins (because FROM commits INNER JOIN files(c.repository_id, c.commit_hash) ON 1=1 is weird). It does not make sense for current squash to work with cross joins, so this would require to also process cross joins and validate them just as inner joins taking into account that there is no ON to check.

It would require new rules to chain these new UDTFs with what's already there based on the parameters they take.

UDTFs

Implementation of the UDTFs themselves and registration on the server.

Personal opinion on using UDTFs

Disclaimer: I know this was not asked for in the task, but there is no other place right now to put this or any meetings in the near future that I know of about the topic, so I'm adding this here.

While I see why it might be desired to use this instead of just plain tables, there are a lot of objective disadvantages about this approach:

Semantics of joins need to be subverted in go-mysql-server just to fit something we want to do in gitbase.
Even further deviation from mysql on go-mysql-server (which already happened with expressions on indexes, but that was a tiny chance that didn't really deviate that much).
We need to fork and diverge from vitess' parser. Even if it was just this one alone, it's a huge, huge disadvantage. We don't want to be maintaining a SQL parser on our own and take care of merging features and fixes from upstream vitess to our vitess parser.
Requires a lot of work for something that we can already do with tables, join tables and validation rules to ensure parameters are given if we don't want a files table effect. Also, most performance issues would be gone once those tables were added to the squash joins rule and only needed results would be generated.
Table generated functions are not supported in spark sql, just in HiveQL, which is one more piece to take care of, one more piece to configure and one more piece where stuff could go wrong. And, if our spark rules don't work with HiveQL, a titanic effort of migrating to it. The alternative is EXPLODE, which will load potentially huge arrays in memory with all that comes with it, or UNNEST, which will probably need some more work on rules. Plus, commits NATURAL JOIN commit_files NATURAL JOIN files is perhaps easier than commits c, UNNEST(files(c.repository_id, c.commit_hash)).
Functions are not self-explanatory, as opposed to SHOW TABLES and DESCRIBE files.
You have to remember which arguments the function takes, whereas NATURAL JOIN already does it for you, so it's harder on the end user.
Might be unfamiliar to most users.
If this does not work and we need to move to tables (which has already happened once), it's not trivial because of the huge amount of things that would need to be reverted. With tables it would just require to remove the new squash rules and tables on gitbase, not even modifying anything on go-mysql-server.

And the only advantages I see are:

Easier to enforce constraints and configure (except there are no named arguments in SQL and you will end up with diff(a, b, true, true, 0, false) and go figure what those are).
Only required data generated (which is only true if we're not taking squash into account).

kuba-- · 2019-03-28T12:31:32Z

Personally I still don't see any function benefits, here. And don't want to repeat what Miguel has already written. Functions are much harder to optimize.
For small things (in memory stuff) can be handy, otherwise the syntax becomes unreadable.
Look at how easy is to play with UAST-like function.

smola · 2019-03-28T19:31:47Z

Thanks for the analysis @erizocosmico and also for your feedback @kuba--! We'll be settling this probably next week.

smola · 2019-06-20T07:52:16Z

@ajnavarro I guess this issue can be closed already.

ajnavarro added the research Something that requires research label Mar 21, 2019

ajnavarro changed the title ~~List needed changes to implement FROM functions~~ List needed changes to implement UDTFs on FROMs Mar 21, 2019

kuba-- self-assigned this Mar 22, 2019

kuba-- removed their assignment Mar 22, 2019

erizocosmico self-assigned this Mar 26, 2019

ajnavarro closed this as completed Jun 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List needed changes to implement UDTFs on FROMs #746

List needed changes to implement UDTFs on FROMs #746

ajnavarro commented Mar 21, 2019 •

edited

Loading

kuba-- commented Mar 22, 2019

erizocosmico commented Mar 26, 2019 •

edited

Loading

smola commented Mar 26, 2019

erizocosmico commented Mar 26, 2019

kuba-- commented Mar 28, 2019

smola commented Mar 28, 2019

smola commented Jun 20, 2019

List needed changes to implement UDTFs on FROMs #746

List needed changes to implement UDTFs on FROMs #746

Comments

ajnavarro commented Mar 21, 2019 • edited Loading

kuba-- commented Mar 22, 2019

erizocosmico commented Mar 26, 2019 • edited Loading

smola commented Mar 26, 2019

Tables

Functions

erizocosmico commented Mar 26, 2019

go-mysql-server

Parser

Joins

Caveats

Analyzer

Core

gitbase

Squash

UDTFs

Personal opinion on using UDTFs

kuba-- commented Mar 28, 2019

smola commented Mar 28, 2019

smola commented Jun 20, 2019

ajnavarro commented Mar 21, 2019 •

edited

Loading

erizocosmico commented Mar 26, 2019 •

edited

Loading