Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add request.spec to db-root-spec #1794

Merged
merged 3 commits into from
May 30, 2021

Conversation

steve-chavez
Copy link
Member

@steve-chavez steve-chavez commented Apr 7, 2021

Aims to close #1731. Partially addresses #1698.

When db-root-spec is used and when the root endpoint(/) is requested, this will add a request.spec GUC for that transaction.

The request.spec GUC is PostgREST schema cache in plain json(no OpenAPI), it contains the relationships between tables/views we infer, among other things.

Example usage:

create or replace function root_override() returns json as $$
  select current_setting('request.spec', true)::json;
$$ language sql;

-- on the config file, add:
-- db-root-spec = "root_override"

Doing a GET /, will obtain the following big json(this one uses our test suite fixtures), the dbRelations key contains the relationships, the relType is the cardinality.

Requesting root_override is done as regular RPC, the function can contain parameters, return any type, filters can be applied, etc. The function can internally omit certain fields or add new ones.

Caveat

db-root-spec disables the default OpenAPI output.

TODO

  • Cache the request.spec json. It would be wasteful to convert the dbStructure to json on each request.
  • Our current OpenAPI output considers the JWT role privileges for including accessible tables/procs. Not sure if this should be done by the root-spec as well or if we should leave it to the SQL function. For a later enhancement if needed.
  • Define a better name for the request.spec GUC. Keep the name(as decided below).
  • Add tests

Implementation notes

  • On Split off openapi into postgrest-contrib #1698 (comment), we discussed adding an special parameter to the root spec function for passing the schema cache. This option needed special handling in code(injecting the parameter without allowing the user to edit it through the url, appending the json to the payload, etc), passing a GUC was much more simple to implement.

@steve-chavez steve-chavez changed the title Db root spec Add request.spec to db-root-spec Apr 7, 2021
@LorenzHenk
Copy link

So this will replace the OpenAPI endpoint. As far as I understand, the OpenAPI information can be rebuilt via #1686.

My concern with this approach is that the OpenAPI endpoint should be publicly available (acts as API documentation), but the schema cache should only be used via our internal services (they contain unfiltered details, which should not be available to everybody - e.g. foreign key table information from other schemas).

How would you suggest making the OpenAPI information publicly available while not publicly exposing the schema cache?

You mentioned that the function can also take parameters, so one idea is to add an optional parameter retrieve_schema_cache boolean DEFAULT FALSE to control whether the schema cache or the OpenAPI information should be returned. If retrieve_schema_cache is true, additional checks on the jwt could be performed.

example:

create or replace function root_override(retrieve_schema_cache boolean DEFAULT FALSE) returns json as $$
  select case 
    when retrieve_schema_cache AND current_setting('request.jwt.claim.is_special_connection', true)::boolean then
      current_setting('request.spec', true)::json
    else
      -- OpenAPI
      get_openapi()
    end;
$$ language sql;

@steve-chavez
Copy link
Member Author

but the schema cache should only be used via our internal services (they contain unfiltered details, which should not be available to everybody - e.g. foreign key table information from other schemas)

@LorenzHenk Great observation. That is true, in the big json example, the relationships include the private schema tables.

Do you think you'll need private schemas(unexposed through REST) for your internal services? I think I should just filter the json schema cache for not including these ones. I don't see how would they be useful for OpenAPI or other spec.

@wolfgangwalther
Copy link
Member

When db-root-spec is used and when the root endpoint(/) is requested, this will add a request.spec GUC for that transaction.

The request.spec GUC is PostgREST schema cache in plain json(no OpenAPI), it contains the relationships between tables/views we infer, among other things.

[...]

Requesting root_override is done as regular RPC, the function can contain parameters, return any type, filters can be applied, etc. The function can internally omit certain fields or add new ones.

[...]

Implementation notes

* On [#1698 (comment)](https://github.com/PostgREST/postgrest/issues/1698#issuecomment-748536949), we discussed adding an special parameter to the root spec function for passing the schema cache. This option needed special handling in code(injecting the parameter without allowing the user to edit it through the url, appending the json to the payload, etc), passing a GUC was much more simple to implement.

Hm. I see very much value in getting the interface for the db-root-spec RPC done future-safe now. Once we implement it now, we'd want to keep it the same, so that the OpenAPI RPC we develop for postgrest-contrib will work across more versions, hopefully.

If we can find a way to pass this via argument to the function instead of GUC, that would go a long way. I don't like the GUC interface at all.

How about using an unnamed JSON argument for this? This would avoid most of the issues you described for implementation.

Let's keep it simple and:

  • when calling an RPC, see whether it has an unnamed JSONB argument
  • if yes, pass the spec info in as an extra param to the query. No need to add it to the payload - just like any other prepared parameter. Basically make the call to the proc: SELECT [...] root_override($2, a=> pgrst_args.a, ...)

This would allow to fetch the spec in a regular RPC, too, for now. But we don't need to document this (it's not supported) and can remove it in the future, when we improve the handling here.

My concern with this approach is that the OpenAPI endpoint should be publicly available (acts as API documentation), but the schema cache should only be used via our internal services (they contain unfiltered details, which should not be available to everybody - e.g. foreign key table information from other schemas).

How would you suggest making the OpenAPI information publicly available while not publicly exposing the schema cache?

You mentioned that the function can also take parameters, so one idea is to add an optional parameter retrieve_schema_cache boolean DEFAULT FALSE to control whether the schema cache or the OpenAPI information should be returned. If retrieve_schema_cache is true, additional checks on the jwt could be performed.

example:

create or replace function root_override(retrieve_schema_cache boolean DEFAULT FALSE) returns json as $$
  select case 
    when retrieve_schema_cache AND current_setting('request.jwt.claim.is_special_connection', true)::boolean then
      current_setting('request.spec', true)::json
    else
      -- OpenAPI
      get_openapi()
    end;
$$ language sql;

I don't think you'd want to just dump spec to the user in production in any case. The spec json is provided as additional information to the RPC to allow building the OpenAPI output. So the function really needs to build the output first. And of course, while doing that it should avoid to output any private data.

@LorenzHenk Great observation. That is true, in the big json example, the relationships include the private schema tables.

Do you think you'll need private schemas(unexposed through REST) for your internal services? I think I should just filter the json schema cache for not including these ones. I don't see how would they be useful for OpenAPI or other spec.

Right now we have a huge schema cache. The goal would be to remove all information from the schema cache that is only needed for the OpenAPI output. The root-spec RPC can then fetch this information directly from the database. The spec info provided by PostgREST should only contain the minimum amount of information - exactly those things, that only PostgREST can give. That includes the relationships of course - but only those that would be exposed/available.

So, I'm sure we'll need to limit the amount of data in spec. But we can only do that after step 2 (removing OpenAPI from core).

@LorenzHenk
Copy link

Do you think you'll need private schemas(unexposed through REST) for your internal services? I think I should just filter the json schema cache for not including these ones. I don't see how would they be useful for OpenAPI or other spec.

We're especially interested in the relationships between views based on relationships of their source tables.
For example (based on the big json)

test.orders_view.consumer -> test.consumers_hidden_view.id
    {
      "relTable": {
        "tableSchema": "test",
        "tableDescription": null,
        "tableInsertable": true,
        "tableName": "orders_view"
      },
      "relFColumns": [
        {
          "colName": "id",
          "colType": "integer",
          "colMaxLen": null,
          "colEnum": [],
          "colDescription": null,
          "colTable": {
            "tableSchema": "test",
            "tableDescription": null,
            "tableInsertable": true,
            "tableName": "consumers_hidden_view"
          },
          "colNullable": true,
          "colFK": null,
          "colDefault": null
        }
      ],
      "relType": "M2O",
      "relColumns": [
        {
          "colName": "consumer",
          "colType": "integer",
          "colMaxLen": null,
          "colEnum": [],
          "colDescription": null,
          "colTable": {
            "tableSchema": "test",
            "tableDescription": null,
            "tableInsertable": true,
            "tableName": "orders_view"
          },
          "colNullable": true,
          "colFK": null,
          "colDefault": null
        }
      ],
      "relFTable": {
        "tableSchema": "test",
        "tableDescription": null,
        "tableInsertable": true,
        "tableName": "consumers_hidden_view"
      },
      "relLink": {
        "constName": "public_orders_consumer_fkey",
        "tag": "Constraint"
      }
    },

We don't need the private information.
I think there can be use-cases where this information is relevant, though (not for us, but for other users of PostgREST).

I don't think you'd want to just dump spec to the user in production in any case. The spec json is provided as additional information to the RPC to allow building the OpenAPI output. So the function really needs to build the output first. And of course, while doing that it should avoid to output any private data.

In our case we're using the information internally for other services, so it doesn't have to be in OpenAPI-compliant format.

@wolfgangwalther
Copy link
Member

We're especially interested in the relationships between views based on relationships of their source tables.
For example (based on the big json)
test.orders_view.consumer -> test.consumers_hidden_view.id

We don't need the private information.
I think there can be use-cases where this information is relevant, though (not for us, but for other users of PostgREST).

I'm not sure about the last point. I think in addition to removing all the info from the schema cache, that we don't need for regular operation (after ripping OpenAPI out of core), we might further limit the json that is passed to the root-spec function to only have the results of our schema analysis. I guess this is basically what you're asking for: We don't need the source-table information (which might be from a private schema), we just need the relationships between public tables/views that are the result of those private FKs, etc. - right?

@steve-chavez
Copy link
Member Author

Let's keep it simple and:
when calling an RPC, see whether it has an unnamed JSONB argument
if yes, pass the spec info in as an extra param to the query.
But we don't need to document this (it's not supported)

Ah, but TBH that is the opposite of simple, it has various special cases.

I see very much value in getting the interface for the db-root-spec RPC done future-safe now.

I think the GUC would be more future-proof since we're not subject to a user-defined function breaking a convention. It's also one less instruction for the user. Our GUCs are well known mechanism for passing variables now, the special argument breaks that consistency.

I don't like the GUC interface at all.

Why? The function can be made more clear with a variable.

create or replace function root_override() returns json as $$
declare
  schema_cache json := current_setting('request.spec', true)::json;
begin
  select schema_cache;
end;$$ language plpgsql;

Likely the openapi needs other GUCs like the JWTs, so I don't see what's bad about that.

@steve-chavez
Copy link
Member Author

We don't need the source-table information (which might be from a private schema), we just need the relationships between public tables/views that are the result of those private FKs, etc. - right?

Yeah, I think we all agree that only the exposed schemas should go into the json schema cache.

So, I'm sure we'll need to limit the amount of data in spec. But we can only do that after step 2 (removing OpenAPI from core).

Also agree about limiting the amount of data. To avoid removing OpenAPI(we can't because there's no path forward for now), I can try filtering some info from the json schema cache.

@wolfgangwalther
Copy link
Member

So, I'm sure we'll need to limit the amount of data in spec. But we can only do that after step 2 (removing OpenAPI from core).

Also agree about limiting the amount of data. To avoid removing OpenAPI(we can't because there's no path forward for now), I can try filtering some info from the json schema cache.

I don't think you actually need to do it. But it would be good to take a good, hard look at the structure right now. If it's just some of stuff that is removed, but the structure of what stays will be the same (i.e. the information is at the same json-path, etc.), then we can design a non-breaking SQL implementation of the OpenAPI output compatible with both pre- and post- "rip it out of core" versions. Optimizing this right now only makes sense if it does potentially simplify the interface a lot, e.g. by removing levels of "nestedness" that become useless later on, or better names for keys etc.

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Apr 8, 2021

Ah, but TBH that is the opposite of simple, it has various special cases.

:D

I see very much value in getting the interface for the db-root-spec RPC done future-safe now.

I think the GUC would be more future-proof since we're not subject to a user-defined function breaking a convention. It's also one less instruction for the user. Our GUCs are well known mechanism for passing variables now, the special argument breaks that consistency.

I don't like the GUC interface at all.

Why? The function can be made more clear with a variable.

[...]

Likely the openapi needs other GUCs like the JWTs, so I don't see what's bad about that.

Imho, we're kind of abusing GUCs a bit already. I'm just opposed to introducing more and more of them, but would like to get to a point, where we don't need to inject them into every transaction anymore.

Here's a couple of arguments against GUCs:

In #1710 (comment) I proposed a backwards-compatible way out of GUC-hell. I admit, it will probably still take quite some time until we get there. But in mean-time, it would be great if were not creating new interfaces that depend on yet another usage of GUCs.

(Note: The whole db-config stuff is different - I really think GUCs are made exactly for that use-case)


Likely the openapi needs other GUCs like the JWTs, so I don't see what's bad about that.

Ah, hm. Instead of using an unnamed JSON argument to the root spec function I wonder whether we actually need to support named arguments passed in from the client at all?

It would be much better if we could implement the same interface down the road as mentioned in #1710 (comment) for any kind of hook.

Could be less hard even simpler (:D) to implement like this for now:

  • Assume a named argument spec JSONB (let's think about the name again, before we finalize this!) for any root-spec proc
  • Pass only { spec: ... } as the request body for the case of a Root Request - and throw away all query string arguments
  • Optionally (for backwards-compatibility) parse the root-spec proc in the schema cache and check for that argument. Don't pass in the argument, if it does not have it.

This would allow us to build the full "context-injection-through-arguments" proposal down the road, without changing the interface.

I don't think we need named arguments for the root endpoint. Do we support those right now? That would be a breaking change then...

Down the road, we'd probably want to implement something like #1421. I already have an idea how to possibly do that with our regular query syntax, although I haven't specced it out in full, yet. We don't need function arguments for that, though.

@steve-chavez
Copy link
Member Author

GUCs are not designed for this
Tom Lane: it might break sometime in the future. But I don't think we'd break it without providing some alternative solution.

I think that when the time comes, we'll migrate all of our GUCs to schema variables. But for now I don't see it as a big issue considering that it will not just break as mentioned above.

GUCs have limitations regarding naming as we realized in #1729 (comment)

Yeah, but I don't see how this is relevant here. I'm not using a weird name.

GUCs have no typing

postgrest-contrib might help here, we could do a wrapper similar to this one, that includes a type. Say a pgrst.schema_cache().

GUCs have a performance-penalty as heavily discussed in #1600
GUCs have scalability issues: related to performance. Every new GUC, that could possibly be used, has a performance-penalty - independent of whether it is actually used or not. I mentioned that somewhere, and was able to measure it as part of #1600.

independent of whether it is actually used or not — :O really? I'm aware of GUC impact, but only when they're used. If you can remind me where is this mentioned, I would consider it a big reason against the guc interface(if the impact is non-negligible).

GUCs always have global scope and could possibly leak sensitive information by user error (easier than other methods, I feel)

All our GUCs are transaction scoped, I don't see how would they leak.

GUCs have strange quirks (NULL vs '' later in the same session, can't find the reference right now)

The postgrest-contrib wrapper would also solve this. This is also a pending issue on docs: PostgREST/postgrest-docs#362. Once that's solved the quirk would be well known.

GUCs are harder to unit-test for functions compared to arguments, e.g. with pgTAP (not by much, but they are)

Hm, don't see why testing a function output(considering the wrapper) would be difficult. If there's something difficult about our GUCs, is that they're transaction scoped(noted on PostgREST/postgrest-docs#384). But this is a whole different issue.

GUCs are mutable - and can potentially be mis-used: You could use set_config('one.of.our.gucs', ...) and cause unpredictable havoc.

We set(and override) all the GUCs(request.jwt, request.headers, etc) once the transaction starts. Of course if the user wants to override one of our GUCs he can do so, but I don't see why or how would that be a problem for us.

I wonder whether we actually need to support named arguments passed in from the client at all?

Yes, and explicitly for #1421. The tableName could be an argument to the function, to filter the output. Other argument could be a schemaName, for a case where db-schema has many schemas.

At this stage I'd consider the root-spec as experimental. I don't want to limit the function in any way. So if the users want to create RAML, API Blueprint, or any other spec(perhaps one that needs a query param), they have full flexibility on how they want to do it.

Could be less hard even simpler (:D) to implement like this for now:

Btw, don't get me wrong. If a harder implementation results in a simpler interface for the user, I'll definitely do it. But I don't see this is the case here. The request.spec is the simplest interface and also simplest implementation.

and throw away all query string arguments

Ah, too limiting. As mentioned above, args would have their use.

@wolfgangwalther
Copy link
Member

GUCs have limitations regarding naming as we realized in #1729 (comment)

Yeah, but I don't see how this is relevant here. I'm not using a weird name.

My rant was targeted at GUCs in general - not only in this case ;).

GUCs have a performance-penalty as heavily discussed in #1600
GUCs have scalability issues: related to performance. Every new GUC, that could possibly be used, has a performance-penalty - independent of whether it is actually used or not. I mentioned that somewhere, and was able to measure it as part of #1600.

independent of whether it is actually used or not — :O really? I'm aware of GUC impact, but only when they're used. If you can remind me where is this mentioned, I would consider it a big reason against the guc interface(if the impact is non-negligible).

With "used" I mean by the user in some function or so. This is about the number of GUCs set on every request.

I just ran a similar test to what I did here: #1600 (comment)

Code here: https://pastebin.com/TvTzkq3Z

I'm comparing set_config on 2, 9 and 18 GUCs:

n plan exec total
2 0.0092 0.0148 0.0240
9 0.0217 0.0336 0.0554
18 0.0472 0.0695 0.1167

The relationship is quite clear, imho.

We should be able to easily try this with both of our loadtest setups, too: Just provide more or less headers in the request, as those are all set as GUCs.

GUCs always have global scope and could possibly leak sensitive information by user error (easier than other methods, I feel)

All our GUCs are transaction scoped, I don't see how would they leak.

What I mean is not across transactions, but within. Let's assume a user somehow dynamically requests GUCs (e.g. because of dynamic headers or so). It's not far-fetched, that they have some kind of injection attack made possible where they are requesting other GUCs than they should. There are some setups where the JWT secret is saved in a GUC...

Yes, this'd be an user error. But if we can reduce the likelyhood of that...

GUCs are mutable - and can potentially be mis-used: You could use set_config('one.of.our.gucs', ...) and cause unpredictable havoc.

We set(and override) all the GUCs(request.jwt, request.headers, etc) once the transaction starts. Of course if the user wants to override one of our GUCs he can do so, but I don't see why or how would that be a problem for us.

Again, that was just a "reducing the likelihood of user error".

@wolfgangwalther
Copy link
Member

I wonder whether we actually need to support named arguments passed in from the client at all?

Yes, and explicitly for #1421. The tableName could be an argument to the function, to filter the output. Other argument could be a schemaName, for a case where db-schema has many schemas.

The problem I have with this approach, is that we'd have to replicate the filtering inside the root-spec function and can't use query string syntax we already have in regular PostgREST. It would be much nicer to re-use that. As I said, I already have some ideas, but would need to write them down first.

At this stage I'd consider the root-spec as experimental. I don't want to limit the function in any way. So if the users want to create RAML, API Blueprint, or any other spec(perhaps one that needs a query param), they have full flexibility on how they want to do it.

That's a valid point. I think my idea of trying to keep the interface consistent from now on, might be a bit far-stretched. If we consider the root-spec interface experimental, then I don't really object to using a GUC for now. Before we have not implemented a SQL OpenApi replacement, we can't really tell whether the interface works or not anyway...

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Apr 12, 2021

I wonder whether we actually need to support named arguments passed in from the client at all?

Yes, and explicitly for #1421. The tableName could be an argument to the function, to filter the output. Other argument could be a schemaName, for a case where db-schema has many schemas.

[...]

and throw away all query string arguments

Ah, too limiting. As mentioned above, args would have their use.

One more note on this: Even if we didn't support named args right now, we could still support them later on. We could just extend the "pytest-like-interface" to the function by having an argument args JSON that would hold all the query string arguments. This would allow us both: Passing in variable amount of information via function arguments, but support query string args at the same time.

@wolfgangwalther
Copy link
Member

We should be able to easily try this with both of our loadtest setups, too: Just provide more or less headers in the request, as those are all set as GUCs.

I did and it shows. Used the tests in #1812:

  • one with no additional header
  • one with 2 headers
  • one with many (21 I think)

This does not map exactly to the number of GUCs set, because vegeta sets a few more headers and PostgREST some more GUCs. But it still gives us a pretty good idea how the number of GUCs impacts performance.

I ran it 2 times:

test min [ms] mean [ms] max [ms] hypothetical req/s
no 1 0,45 0,94 32,03 2242
two 1 0,46 0,99 38,96 2180
many 1 0,62 1,22 35,26 1617
no 2 0,46 1,12 35,96 2180
two 2 0,47 0,82 1,46 2145
many 2 0,62 1,39 35,63 1616

Some notes:

  • "hypothetical req/s" are based on the minimum latency. Hypothetical in the sense: "If we didn't have any other processes running on that machine, this is what we could get."
  • "max latency" just doesn't make sense. It varies wildly and doesn't show anything. It's just a function of how lucky I got in terms of CPU scheduling, etc.
  • "mean latency" includes those huge numbers, too - this shows nicely that it's not useful either.
  • this was just some basic analysis in LibreOffice Calc and I didn't find a good way to calculate the median conditionally - median would probably be a lot better than mean or max.
  • I still think minimal latency is the best indicator, though.

So... the added latency could also result from the fact that we need to parse those additional headers in haskell. But I don't think so - it just aligns too nicely with the pure SQL timings...

The overhead is actually pretty severe - at least for such a basic query. So every GUC we generally add for all requests, will make things slower. The general idea of "only provide request data, when it's asked for" == "throw GUCs out of the window" seems not toooo bad.. WDYT?

Now, this does not really apply to request.spec, I think, because it's only set for root-spec-requests. But in general...

@steve-chavez
Copy link
Member Author

The overhead is actually pretty severe - at least for such a basic query. So every GUC we generally add for all requests, will make things slower.

Wow, yeah, the perf loss is severe. From 2242 req/s to 1617 req/s. I didn't thought it'd be that much. Will try and reproduce as well.

The general idea of "only provide request data, when it's asked for" == "throw GUCs out of the window" seems not toooo bad.. WDYT?

Yes. But I'm thinking more about changing the GUC interface, is not just the headers that would lead to having more GUCs, but also claims, cookies and app settings.

We can change the interface to have a single json and reduce the number of GUCs for every case. It'd be like:

-- Instead of
SELECT current_setting('request.header.origin', true);
-- Have
SELECT (current_setting('request.headers', true)::json)->'origin';

@wolfgangwalther
Copy link
Member

Yes. But I'm thinking more about changing the GUC interface, is not just the headers that would lead to having more GUCs, but also claims, cookies and app settings.

Yes, that's why my proposal was to get rid of the whole GUC interface completely.

We can change the interface to have a single json and reduce the number of GUCs for every case. It'd be like:

-- Instead of
SELECT current_setting('request.header.origin', true);
-- Have
SELECT (current_setting('request.headers', true)::json)->'origin';

That would be a little bit better, but then you still have the same problem when extending the GUC interface. Thinking about adding a new GUC (not related to headers) - always gotta keep a performance trade-off in mind. Really you should then go all the way and do something like

SELECT current_setting('pgrst.context', true)->'request'->'headers'->>'origin';

One GUC to rule them all.

I don't like it.

We don't need to do anything about the GUC interface right now (and especially not for v8). If we treat db-root-spec as experimental, I'm fine with doing it via GUC for now. But I suggest to not change the GUC interface completely, yet, before we have not thorougly discussed my proposal - and maybe I had the time to try to implement it. Yes, it's not the easy way out, but I think it really has upside.

@steve-chavez
Copy link
Member Author

SELECT current_setting('pgrst.context', true)->'request'->'headers'->>'origin';
One GUC to rule them all.

Ah, yes that would be too extreme, too unwieldy for the user. The request.headers::json GUC is less convenient but not that much.

We don't need to do anything about the GUC interface right now (and especially not for v8).
If we treat db-root-spec as experimental, I'm fine with doing it via GUC for now.

Cool. Let's think later about the GUC interface.

@steve-chavez
Copy link
Member Author

I've filtered O2M/M2O/M2M relationships which contained internal tables on one of their ends and M2M relationships that contained a private junction table(related to #1736). Also, for columns that showed an internal FK, I've set that FK to null.

This is the new json output. If you search(ctrl + f) the word "private" on the previous json output, you'll notice that there are matches. The new output has these matches filtered.

There's still one problem though. A function can return an internal type, and this shows on the new json output here.

That one matches to this test function.

create function test.ret_point_3d() returns private.point_3d as $$
select row(7, -3, 4)::private.point_3d;
$$ language sql;

What should we do about that? I'm thinking we should "hide" it and convert it to a generic pg_catalog.record.

@wolfgangwalther
Copy link
Member

I've filtered O2M/M2O/M2M relationships which contained internal tables on one of their ends and M2M relationships that contained a private junction table(related to #1736).

I understand we have those relationships with private tables in our schema cache, because:

  • we still allow embedding via private junction table. If I remember correctly, it's considered a bug - but it's not fixed, yet.
  • we allow requests to different schemas via Accept-Profile

I don't think you need to make any additional effort to filter private junctions out - they will be gone, when the bug is fixed, too? But I assume, there are filtered automatically when filtering the other schemas tables? No idea how you implemented it, I didn't look at the code, yet.

Regarding the different schemas, I wonder whether we should not structure our schema cache in such a way, that we basically have a full schema cache for each schema in db-schemas, that can be accessed via Accept-Profile? I think right now, we hold one big schema cache, right?

Are there any other valid reasons for us having more relationships in the schema cache that I missed right now? And if there are no reasons - but we still do have excessive relationships left.. - should we not filter those out before filling the schema cache?

My conclusion is: If we do this right, we should not need to manually filter out anything. Right?

create function test.ret_point_3d() returns private.point_3d as $$
select row(7, -3, 4)::private.point_3d;
$$ language sql;

What should we do about that? I'm thinking we should "hide" it and convert it to a generic pg_catalog.record.

No, I don't think so. Imho, the current need for filtering results from us having too much information in the schema cache / not having the cache structured in the best way. But the return type of a function is no private information at all. Any decent root endpoint function should give information about the type for table columns, too - and those types might be in private schemas as well.

Basically: Once a schema is exposed, everything in the schema should be "public" information. And that includes all references to other objects in the database. Not the referenced objects, but the references themselves, yes.

Also, for columns that showed an internal FK, I've set that FK to null.

Concluding from above, we don't really need to do that. But there is little to no value in that information - so it surely does not hurt either.

@steve-chavez
Copy link
Member Author

I understand we have those relationships with private tables in our schema cache, because:

  • we still allow embedding via private junction table. If I remember correctly, it's considered a bug - but it's not fixed, yet.
  • we allow requests to different schemas via Accept-Profile

Yes, but most importantly for inferring relationships between views - whose source tables can be private.

Regarding the different schemas, I wonder whether we should not structure our schema cache in such a way, that we basically > have a full schema cache for each schema in db-schemas, that can be accessed via Accept-Profile? I think right now, we hold > one big schema cache, right?

For the OpenAPI spec, the way it works now is that we output a different schema cache depending on Accept-Profile. For the root-spec, the big json contains all the schemas in db-schemas. I thought this would be the best way because we let the user control the schema filtering inside the function

Basically: Once a schema is exposed, everything in the schema should be "public" information. And that includes all references to other objects in the database. Not the referenced objects, but the references themselves, yes.

Hm, yes. I think that would be the most transparent way for us, same for the FK. I guess in a way is not a good design to return a private type from the function then.

I don't think you need to make any additional effort to filter private junctions out - they will be gone, when the bug is fixed, too?

Not necessarily, that bug could be fixed at the DbRequestBuilder level, with a conditional similar to this one:

-- Both relationship ends need to be on the exposed schema
schema == tableSchema relTable && schema == tableSchema relForeignTable &&

My conclusion is: If we do this right, we should not need to manually filter out anything. Right?

Yeah, I think so too. I was trying to keep the change scoped to the request.spec feature, but I'll see if this can be done the right way, at the DbStructure level.

@wolfgangwalther
Copy link
Member

Yes, but most importantly for inferring relationships between views - whose source tables can be private.

Ah, yes. Ideally we should not return the base columns from the view query - and then match them up with FKs that we keep in our schema cache, but we should just return the final view relationships (based on base columns + FKs) from that query directly.

But that's definitely out of scope here.

For the OpenAPI spec, the way it works now is that we output a different schema cache depending on Accept-Profile. For the root-spec, the big json contains all the schemas in db-schemas.

This one, I don't understand. I understand that we give a different OpenAPI output for different profiles. But do we really have a separate schema cache for each already? I thought we only had one big schema cache?

I thought this would be the best way because we let the user control the schema filtering inside the function

I think the user should not be concered with Accept-Profile handling. When the root endpoint is called with one profile, the proc should only receive data that is relevant to this schema. Everything else just makes stuff extremely complicated.

Hm, yes. I think that would be the most transparent way for us, same for the FK. I guess in a way is not a good design to return a private type from the function then.

Not necessarily bad design, I think. I tend to split my api schema into 2 schemas: public and extra. I consider both schemas to be "public" and would expect PostgREST to parse them and also return certain data about them on the root endpoint. The big difference for me is: Only objects in public should create endpoints. It's really annoying to suddenly see your computed columns exposed as RPCs, when you put them in public schema. But still - those computed columns are very much public, they should be used by the user! Down the road, I hope we can display them in the OpenAPI output, too!

I don't think you need to make any additional effort to filter private junctions out - they will be gone, when the bug is fixed, too?

Not necessarily, that bug could be fixed at the DbRequestBuilder level, with a conditional similar to this one:

I see what you mean. However... once that's propsed, I would immediately argue to fix this at the root, I guess ;)

My conclusion is: If we do this right, we should not need to manually filter out anything. Right?

Yeah, I think so too. I was trying to keep the change scoped to the request.spec feature, but I'll see if this can be done the right way, at the DbStructure level.

If it's easily possible - cool. Otherwise, no need to change anything. My point was more meant along the lines of: "If we're going to fix this at other places down the road, you don't really need to get this 100% right, right now". Filtering the huge amount of data right now would be more of a convenience for the user, not a hard-cut "this is exactly the amount of information we want to expose", imho.

@steve-chavez
Copy link
Member Author

I understand that we give a different OpenAPI output for different profiles. But do we really have a separate schema cache for each already? I thought we only had one big schema cache?

My bad. Yes, only one schema cache(DbStructure), I meant a different a different openapi output.

When the root endpoint is called with one profile, the proc should only receive data that is relevant to this schema. Everything else just makes stuff extremely complicated.

Disagree here. If the user wants to get the full schema cache and we add our profile logic, then several requests would be needed, one per each schema, no other choice. Getting the full schema cache would be useful for schema-based versioning(getting all resources versions from v1, v2, etc), for example.

Also as mentioned before, this is experimental and I don't want to limit the possibilities. postgrest-contrib would decide if this is indeed too complicated and needs a change in core.

Down the road, I hope we can display them in the OpenAPI output, too!

Yeah, that would be nice. I think the openapi-in-sql will allow us to focus on these issues better.

If it's easily possible - cool. Otherwise, no need to change anything.

Cool. I'll see if it's easy. Otherwise, the schema cache filtering now is a single function, it'd be easy to drop or change.

@wolfgangwalther
Copy link
Member

Disagree here. If the user wants to get the full schema cache and we add our profile logic, then several requests would be needed, one per each schema, no other choice.

I can see your point for diagnostic questions, especially during development, ...

Getting the full schema cache would be useful for schema-based versioning(getting all resources versions from v1, v2, etc), for example.

... while this would benefit from our profile handling, because the user would provide the api version through it and surely does not want to receive a schema description for multiple versions at the same time, ....

Also as mentioned before, this is experimental and I don't want to limit the possibilities. postgrest-contrib would decide if this is indeed too complicated and needs a change in core.

... but I fully agree here. No objections for now :)

@steve-chavez
Copy link
Member Author

After rebasing on top of #1825, there's no need to do most of the filtering I mentioned above. One detail is still leaking though, the inferred FK from a view column:

CREATE VIEW "articleStars" AS
SELECT article_stars.article_id AS "articleId",
article_stars.user_id AS "userId",
article_stars.created_at AS "createdAt"
FROM private.article_stars;

[
    {
      "colName": "articleId",
      "colType": "integer",
      "colMaxLen": null,
      "colEnum": [],
      "colDescription": null,
      "colTable": {
        "tableSchema": "test",
        "tableDescription": null,
        "tableInsertable": true,
        "tableName": "articleStars"
      },
      "colNullable": true,
      "colFK": {
        "fkCol": {
          "colName": "id",
          "colType": "integer",
          "colMaxLen": null,
          "colEnum": [],
          "colDescription": null,
          "colTable": {
            "tableSchema": "private",
            "tableDescription": null,
            "tableInsertable": true,
            "tableName": "articles"
          },
          "colNullable": false,
          "colFK": null,
          "colDefault": null
        }
      },
      "colDefault": null
    },
    {
      "colName": "userId",
      "colType": "integer",
      "colMaxLen": null,
      "colEnum": [],
      "colDescription": null,
      "colTable": {
        "tableSchema": "test",
        "tableDescription": null,
        "tableInsertable": true,
        "tableName": "articleStars"
      },
      "colNullable": true,
      "colFK": {
        "fkCol": {
          "colName": "id",
          "colType": "integer",
          "colMaxLen": null,
          "colEnum": [],
          "colDescription": null,
          "colTable": {
            "tableSchema": "test",
            "tableDescription": null,
            "tableInsertable": true,
            "tableName": "users"
          },
          "colNullable": false,
          "colFK": null,
          "colDefault": null
        }
      },
      "colDefault": null
    },
    {
      "colName": "createdAt",
      "colType": "timestamp without time zone",
      "colMaxLen": null,
      "colEnum": [],
      "colDescription": null,
      "colTable": {
        "tableSchema": "test",
        "tableDescription": null,
        "tableInsertable": true,
        "tableName": "articleStars"
      },
      "colNullable": true,
      "colFK": null,
      "colDefault": null
    }
]

There you can see how the colFK key brings out the table in private and its tableDescription, among other things.

This colFK field is only being used by OpenAPI to show what columns have an FK. I'd argue that a view doesn't have an FK, so I'm thinking of removing FKs(making the colFk a Nothing) from the views at the DbStructure level, only tables would have them.

@wolfgangwalther WDYT?

@wolfgangwalther
Copy link
Member

This colFK field is only being used by OpenAPI to show what columns have an FK. I'd argue that a view doesn't have an FK, so I'm thinking of removing FKs(making the colFk a Nothing) from the views at the DbStructure level, only tables would have them.

You know the whole concept of colFk doesn't make sense. A single column can have multiple FKs defined - which one is shown here right now? I assume it's random. So I suggest to just remove it completely from tables and views. FKs are not really interesting anyway - the interesting bit are the relationships we infer from those FKs.


Taking a step back for a second:
The only really interesting info to put into request.spec really are dbRelationships and maybe dbPrimaryKeys (because of pks for views, which we derive through base columns).

Everything else (list of tables, columns, procs and version) can easily be queried directly from the database. And once we bring OpenApi to SQL, I would assume we do exactly that - query this kind of data from the catalogs.

Why not just throw those out alltogether for request.spec right away?

@steve-chavez
Copy link
Member Author

You know the whole concept of colFk doesn't make sense. A single column can have multiple FKs defined - which one is shown here right now? I assume it's random. So I suggest to just remove it completely from tables and views. FKs are not really interesting anyway - the interesting bit are the relationships we infer from those FKs.

Yeah, but removing colFK now would break the current OpenAPI, which is not what we want unless we have the new openapi-in-sql ready. I do think we have users of the col </fk> comments in OpenAPI - which are done by colFK.

Everything else (list of tables, columns, procs and version) can easily be queried directly from the database.
Why not just throw those out alltogether for request.spec right away?

Seems too extreme. For example, getting the views capabilities based on instead of triggers on SQL is not that easy, this might be data of interest. Columns might get more interesting if we differentiate virtual columns(ref) and procs will be interesting too, when we allow calling specified aggregate functions through the select param.

The version is definitely the less interesting bit. I think I just stuffed that in DbStructure for convenience, it could be removed.

In general I think it'd be better to make request.spec a mirror of the schema cache(DbStructure).

@steve-chavez
Copy link
Member Author

Having functions return private types on request.spec still looks off to me, just the fact that it shows a private.<tbl> in the return name - looks like a leak. Solving this would be the last step needed for merging this one.

Not necessarily bad design, I think. I tend to split my api schema into 2 schemas: public and extra.

I consider both schemas to be "public" and would expect PostgREST to parse them and also return certain data about them on the root endpoint. The big difference for me is: Only objects in public should create endpoints

(From the above comment)

@wolfgangwalther For that use case(which is indeed valid), would not suffice to look in db-schemas plus db-extra-search-path(I assume you have extra in there) and filter any return type that doesn't belong in those schemas?

pg_catalog should be added to that list as well. Any other type on another schema would be converted to a generic pg_catalog.record type.

If we do the above, I guess we're kind of forcing users to separate their api types into a non-private schema(like extra), which might be a good thing.

WDYT?

@wolfgangwalther
Copy link
Member

Having functions return private types on request.spec still looks off to me, just the fact that it shows a private.<tbl> in the return name - looks like a leak. [...] Any other type on another schema would be converted to a generic pg_catalog.record type.

If we do the above, I guess we're kind of forcing users to separate their api types into a non-private schema(like extra), which might be a good thing.

WDYT?

I still don't like it at all. It's one thing to filter out "primary" database objects, but a totally different thing to modify those objects showing something else. Especially at the request.spec-api layer - you don't really know what people would use the spec data given by PostgREST for. They might need the correct type information.

What if I decided to have some bigger extensions installed in a separate schema, because I re-use them across different parts of the database / maybe different APIs. e.g. postgis in a postgis schema. I don't add that to extra search path - but I sure want to have the information that my functions return postgis types.

If a user wanted to avoid the private.type output, they could easily add a no-op domain, on top of the private type, in the public schema. Although the same could be said about "If a user wanted to avoid the pg_catalog.record output, ..." ;).

@steve-chavez
Copy link
Member Author

What if I decided to have some bigger extensions installed in a separate schema, because I re-use them across different parts of the database / maybe different APIs. e.g. postgis in a postgis schema. I don't add that to extra search path - but I sure want to have the information that my functions return postgis types.

Hm, but if you do that then you'd still need to add the postgis schema in db-extra-search-path, otherwise RPC would give errors because postgis operators/functions are not on postgrest search path.

I still don't like it at all. It's one thing to filter out "primary" database objects, but a totally different thing to modify those objects showing something else
If a user wanted to avoid the private.type output, they could easily add a no-op domain, on top of the private type, in the public schema. Although the same could be said about "If a user wanted to avoid the pg_catalog.record output, ..." ;).

Yeah, I don't like it as well. Also, using the domain as an alias would work.

I guess putting a private type on the function return would be the same as putting a private table on the public schema. So it's up to the user to avoid this.

I'll continue the PR without the return type filtering 👍.

@steve-chavez
Copy link
Member Author

Thinking a bit more about the db-root-spec, this is similar to pre-request in that they're both "hooks" - they modify postgrest regular behavior. Perhaps that could be useful for classifying these configs on the docs.

Also, the name request.spec doesn't look apt, the spec doesn't depend on the request(unlike the other GUC variables). Perhaps we should rename it to root.spec?

@steve-chavez
Copy link
Member Author

Also, the name request.spec doesn't look apt, the spec doesn't depend on the request(unlike the other GUC variables). Perhaps we should rename it to root.spec?

I'll just keep it to request.spec - it could be a temporary thing that it not depends on the request.

Later on, it could be dependent on the Accept-Profile schema. Or also, it could be dependent on the JWT role privileges, similar to how OpenAPI works now. These could be later enhancements that could be enabled by config options.

@steve-chavez steve-chavez force-pushed the db-root-spec branch 2 times, most recently from 263b957 to 40a0393 Compare May 22, 2021 06:07
@steve-chavez
Copy link
Member Author

So I've finished the TODOs above, request.spec is now cached and I've also added a test.

I've also detected a performance problem when testing the request.spec and the current OpenAPI - both tests using the big schema(the production one that we share privately through email).

On the main branch, the big schema makes the OpenAPI output takes like a minute(on my machine) while postgrest consumes the full CPU cores - this for each request to root. Testing it on a browser makes the request succeed despite taking a while, but curl fails with curl: (18) transfer closed with outstanding read data remaining.

(I've noted the above is faster on v7.0.0, but that must be because our schema cache is bigger now - #1625)

The request.spec is worse in this regard, I haven't managed to see the root request completing on the big schema(also consumes full CPU cores). Takes more than a minute and my desktop starts lagging then.

So this makes the current request.spec only fit for small to medium schemas, same for the current OpenAPI. The bottleneck is converting the schema cache to json in Haskell.

I think the way forward would be to somehow store the schema cache in postgrest-contrib and make the conversion to json in postgresql.

Still, I think this PR is mergeable, it's a start for us to work on postgrest-contrib.

@wolfgangwalther WDYT?

@wolfgangwalther
Copy link
Member

[...] request.spec is now cached [...]
[...]
[...] The bottleneck is converting the schema cache to json in Haskell.

Are you sure? I assume the conversion to json is happening in the thread that reads the schema cache - so that should not really block the request on the root endpoint, at least once the schema cache has been loaded?

At the same time I have no problem dumping the big schema with --dump-schema, which does json conversion, too. I used some older nightly version, that doesn't filter the whole schema etc... and the json output for --dump-schema for the big schema is 279 MB big... Still took postgrest only 30 seconds to dump that.

@steve-chavez
Copy link
Member Author

Are you sure? I assume the conversion to json is happening in the thread that reads the schema cache - so that should not really block the request on the root endpoint, at least once the schema cache has been loaded?

Yes. Lazy evaluation is manifesting here, the json schema cache is not evaluated until it's actually requested. When I print the json schema cache, as in steve-chavez@d84c9c3, then the same issue arises - printing it takes too long on the big schema.

At the same time I have no problem dumping the big schema with --dump-schema, which does json conversion, too.

I've just tried --dump-schema on master branch with the big schema: still takes more than a minute and lags my desktop(old toshiba machine 😅). So it's definitely the json conversion.

279 MB big... Still took postgrest only 30 seconds to dump that.

It may not be a big deal on more powerful machines since the json will be cached; but still, using constant 279 MB of memory for storing the json schema cache is not ideal.

@wolfgangwalther Perhaps you could try this branch on the big schema and see how long does it take for the root request to complete?

-- create this function on the big schema
create or replace function root_override() returns json as $$
  select current_setting('request.spec', true)::json;
$$ language sql;

-- add it on the config
db-root-spec="test.root_override"

-- do the root request
curl localhost:3000/

@steve-chavez
Copy link
Member Author

Since the above problem also happens to me for --dump-schema, then I think there's no special perf issue to be solved here.

PR should be good to merge.

Now that PgVersion is not part of DbStructure, Config is a more apt
module for it.

Also rename getDbStructure to queryDbStructure. AppState also had a
getDbStructure function for a record field.
The request.spec GUC contains the schema cache structure in json.

It's only available when the root endpoint(/) is requested and when
db-root-spec is not empty.

Also correct db-root-spec to accept a schema.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Provide information on possible embeddings
3 participants