Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database Access RFC #161

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Database Access RFC #161

wants to merge 3 commits into from

Conversation

dfellis
Copy link
Member

@dfellis dfellis commented Aug 3, 2023

No description provided.

@dfellis dfellis self-assigned this Aug 3, 2023
@dfellis dfellis requested review from depombo and aguillenv August 3, 2023 02:45
Comment on lines +76 to +83
```md
# schema {name} for {dbname}
column 1, column 2, column 3
example 1, example 2, example 3
example 4, example 5, example 6
## index [{name} columns] {column name 1, column name 2}
## join {column name} = {schema name}.{column name}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an usage example? I think I do not fully follow the ## index and ## join syntax

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's sort of an unholy mixture of SQL and Markdown. Declare an index, optionally name it, if you name it include the columns keyword and then put the comma separated column names after that.

Similarly for declaring a join column. It's not as flexible as real SQL and this is a negative of this approach because any feature in any database we support may be requested by a user to be added and further bulk up the complexity of this syntax.

rfcs/002 - Database Access.md Outdated Show resolved Hide resolved
Just alternatives right now as we dig into the possibilities here. The solution ideally should:

1. Declare what database will be used by the application, whether a local sqlite file, a remote Postgres/MySQL/etc server, or perhaps something more esoteric like MongoDB, Redis, etc.
2. Declare the connection is read-only vs read-write, so it *never* tries to mutate a read-only database connection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this is a complexity we do not need to address now? we might rely on the db user and roles management and the credentials given to connect with marsha

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I agree that it would be a later feature that may not be implemented for a while, but I don't think relying on credential permissions is the right approach here: how often were you given a separate set of read-only credentials for a database in a company when you were allowed write access to said database?

Basically there can be times when you want to reduce the "blast radius" but had to do it by being careful instead of relying on the DBA to be responsive :) This is basically telling Marsha to be careful. ;)


#### Embedded syntax

Databases would still be declared top-level, and the `# schema` type would also exist, but queries would be in a new `## query` subsection for a function, instead. This would have to be after the description section but before the function examples, so the query would be defined *after* it's "use" in the description, which might be a bit weird, but if defined before the description we would need a `## description` sub-heading to make it clear again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, not sure if adding complexity to the function definitions makes sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a path to consider, though, since it reduces the number of top-level things to learn.


But for something like the `# references` block to make any sense, it also needs to *modify* the behavior of the `# func` block to update it with the documents the function description references, if any. This implies that it may be a better idea for the extensions to decide to either replace or extend the existing behavior of `# something` blocks they have referenced. The current behavior that parses the `# func` blocks into more verbose markdown would then have that extended markdown intercepted by the references extension and modified with the references used by the function.

Similarly, the current type-insertion behavior for `# func` blocks that specify an explicit type could be turned into an extension interception behavior in that way, simplifying and segregating those different concerns, likely with near-zero latency impact (the function could not be parsed by the LLM until the dependent type is turned into a Python `class`, but that is true today. It would just insert an extra function call in the Markdown parse and re-generation, afaict.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not sure if I follow this, but the classes and the functions are generated in the same call, there's no previous call to generate classes and then another one for functions, is all in one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was reading through it more closely this evening and noticed that, however that doesn't mean it can't be broken into separate paths to do just a text transform, then merged together and then fed to the LLM as separate event handlers so we can add new # blocks independently.


LLM abilities probably haven't plateaued yet, and I see value in both the Fully-separated syntax and the References syntax -- which is desired may depend on personal preference more than anything, but needing to choose one over the other may limit the future of Marsha in undesirable ways. At least while we're still figuring out what to leave to the LLM, what to guide the LLM with, and what to explicitly require the user to provide, it could be useful to have users specify which manipulations of the Markdown syntax they even want to have with special interpreter directives defined at the beginning of the file, kinda like [Raku](https://www.raku.org/) or [Racket](https://racket-lang.org/).

We could define a `# using {extension 1} [, {extension 2} ...]` header that *must* be the first if present, specifying what parsing extensions are to be used on the text after it. We could even potentially put *all* functionality behind different extensions to give us the flexibility to make breaking changes to things like functions without breaking any code still using the older function behavior, but that could be a wonky barrier to entry for people who are not programmers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, when I started reading it my thought was that las sentence, but thinking about it, while we get a "final" solution users that are not programmers either are not going to use these "advance modes" so often or they have not discovered marsha yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're actively developing on it during this alpha state I think that's fine, but I do think a default "base" extension that's automatically used and extensions long-term are just development artifacts and/or for esoteric purposes in the future. I think I mention something like that in the following paragraph.

```md
# func get_users(): list of user objects

This function gets all user records from [the database][1] and returns them, or fails if unable to connect to the database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a bit tricky what substitution would need to be done here depending on each case for the final prompt for the LLM to know what to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so I was thinking it would lightly transform the [the database][1] into [the database](#the-database) where #the-database is whatever the title is of the reference after it is turned into a markdown document itself following this convention

The exact transform would depend on the extension and how it deals with the URL (https should be queried and maybe it drops part of the <body> into the document if it's an HTML document, while a file path that ends in .sqlite reads that file with the sqlite3 standard library and provides a summary of the schema).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants