-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add request.spec
to db-root-spec
#1794
Conversation
So this will replace the OpenAPI endpoint. As far as I understand, the OpenAPI information can be rebuilt via #1686. My concern with this approach is that the OpenAPI endpoint should be publicly available (acts as API documentation), but the schema cache should only be used via our internal services (they contain unfiltered details, which should not be available to everybody - e.g. foreign key table information from other schemas). How would you suggest making the OpenAPI information publicly available while not publicly exposing the schema cache? You mentioned that the function can also take parameters, so one idea is to add an optional parameter example: create or replace function root_override(retrieve_schema_cache boolean DEFAULT FALSE) returns json as $$
select case
when retrieve_schema_cache AND current_setting('request.jwt.claim.is_special_connection', true)::boolean then
current_setting('request.spec', true)::json
else
-- OpenAPI
get_openapi()
end;
$$ language sql; |
@LorenzHenk Great observation. That is true, in the big json example, the relationships include the Do you think you'll need |
Hm. I see very much value in getting the interface for the If we can find a way to pass this via argument to the function instead of GUC, that would go a long way. I don't like the GUC interface at all. How about using an unnamed JSON argument for this? This would avoid most of the issues you described for implementation. Let's keep it simple and:
This would allow to fetch the spec in a regular RPC, too, for now. But we don't need to document this (it's not supported) and can remove it in the future, when we improve the handling here.
I don't think you'd want to just dump
Right now we have a huge schema cache. The goal would be to remove all information from the schema cache that is only needed for the OpenAPI output. The root-spec RPC can then fetch this information directly from the database. The spec info provided by PostgREST should only contain the minimum amount of information - exactly those things, that only PostgREST can give. That includes the relationships of course - but only those that would be exposed/available. So, I'm sure we'll need to limit the amount of data in |
We're especially interested in the relationships between views based on relationships of their source tables.
|
I'm not sure about the last point. I think in addition to removing all the info from the schema cache, that we don't need for regular operation (after ripping OpenAPI out of core), we might further limit the json that is passed to the root-spec function to only have the results of our schema analysis. I guess this is basically what you're asking for: We don't need the source-table information (which might be from a private schema), we just need the relationships between public tables/views that are the result of those private FKs, etc. - right? |
Ah, but TBH that is the opposite of simple, it has various special cases.
I think the GUC would be more future-proof since we're not subject to a user-defined function breaking a convention. It's also one less instruction for the user. Our GUCs are well known mechanism for passing variables now, the special argument breaks that consistency.
Why? The function can be made more clear with a variable. create or replace function root_override() returns json as $$
declare
schema_cache json := current_setting('request.spec', true)::json;
begin
select schema_cache;
end;$$ language plpgsql; Likely the openapi needs other GUCs like the JWTs, so I don't see what's bad about that. |
Yeah, I think we all agree that only the exposed schemas should go into the json schema cache.
Also agree about limiting the amount of data. To avoid removing OpenAPI(we can't because there's no path forward for now), I can try filtering some info from the json schema cache. |
I don't think you actually need to do it. But it would be good to take a good, hard look at the structure right now. If it's just some of stuff that is removed, but the structure of what stays will be the same (i.e. the information is at the same json-path, etc.), then we can design a non-breaking SQL implementation of the OpenAPI output compatible with both pre- and post- "rip it out of core" versions. Optimizing this right now only makes sense if it does potentially simplify the interface a lot, e.g. by removing levels of "nestedness" that become useless later on, or better names for keys etc. |
:D
Imho, we're kind of abusing GUCs a bit already. I'm just opposed to introducing more and more of them, but would like to get to a point, where we don't need to inject them into every transaction anymore. Here's a couple of arguments against GUCs:
In #1710 (comment) I proposed a backwards-compatible way out of GUC-hell. I admit, it will probably still take quite some time until we get there. But in mean-time, it would be great if were not creating new interfaces that depend on yet another usage of GUCs. (Note: The whole
Ah, hm. Instead of using an unnamed JSON argument to the root spec function I wonder whether we actually need to support named arguments passed in from the client at all? It would be much better if we could implement the same interface down the road as mentioned in #1710 (comment) for any kind of hook. Could be
This would allow us to build the full "context-injection-through-arguments" proposal down the road, without changing the interface. I don't think we need named arguments for the root endpoint. Do we support those right now? That would be a breaking change then... Down the road, we'd probably want to implement something like #1421. I already have an idea how to possibly do that with our regular query syntax, although I haven't specced it out in full, yet. We don't need function arguments for that, though. |
I think that when the time comes, we'll migrate all of our GUCs to schema variables. But for now I don't see it as a big issue considering that it will not just break as mentioned above.
Yeah, but I don't see how this is relevant here. I'm not using a weird name.
postgrest-contrib might help here, we could do a wrapper similar to this one, that includes a type. Say a
independent of whether it is actually used or not — :O really? I'm aware of GUC impact, but only when they're used. If you can remind me where is this mentioned, I would consider it a big reason against the guc interface(if the impact is non-negligible).
All our GUCs are transaction scoped, I don't see how would they leak.
The postgrest-contrib wrapper would also solve this. This is also a pending issue on docs: PostgREST/postgrest-docs#362. Once that's solved the quirk would be well known.
Hm, don't see why testing a function output(considering the wrapper) would be difficult. If there's something difficult about our GUCs, is that they're transaction scoped(noted on PostgREST/postgrest-docs#384). But this is a whole different issue.
We set(and override) all the GUCs(
Yes, and explicitly for #1421. The At this stage I'd consider the
Btw, don't get me wrong. If a harder implementation results in a simpler interface for the user, I'll definitely do it. But I don't see this is the case here. The
Ah, too limiting. As mentioned above, args would have their use. |
My rant was targeted at GUCs in general - not only in this case ;).
With "used" I mean by the user in some function or so. This is about the number of GUCs set on every request. I just ran a similar test to what I did here: #1600 (comment) Code here: https://pastebin.com/TvTzkq3Z I'm comparing
The relationship is quite clear, imho. We should be able to easily try this with both of our loadtest setups, too: Just provide more or less headers in the request, as those are all set as GUCs.
What I mean is not across transactions, but within. Let's assume a user somehow dynamically requests GUCs (e.g. because of dynamic headers or so). It's not far-fetched, that they have some kind of injection attack made possible where they are requesting other GUCs than they should. There are some setups where the JWT secret is saved in a GUC... Yes, this'd be an user error. But if we can reduce the likelyhood of that...
Again, that was just a "reducing the likelihood of user error". |
The problem I have with this approach, is that we'd have to replicate the filtering inside the root-spec function and can't use query string syntax we already have in regular PostgREST. It would be much nicer to re-use that. As I said, I already have some ideas, but would need to write them down first.
That's a valid point. I think my idea of trying to keep the interface consistent from now on, might be a bit far-stretched. If we consider the |
One more note on this: Even if we didn't support named args right now, we could still support them later on. We could just extend the "pytest-like-interface" to the function by having an argument |
I did and it shows. Used the tests in #1812:
This does not map exactly to the number of GUCs set, because vegeta sets a few more headers and PostgREST some more GUCs. But it still gives us a pretty good idea how the number of GUCs impacts performance. I ran it 2 times:
Some notes:
So... the added latency could also result from the fact that we need to parse those additional headers in haskell. But I don't think so - it just aligns too nicely with the pure SQL timings... The overhead is actually pretty severe - at least for such a basic query. So every GUC we generally add for all requests, will make things slower. The general idea of "only provide request data, when it's asked for" == "throw GUCs out of the window" seems not toooo bad.. WDYT? Now, this does not really apply to |
Wow, yeah, the perf loss is severe. From
Yes. But I'm thinking more about changing the GUC interface, is not just the headers that would lead to having more GUCs, but also claims, cookies and app settings. We can change the interface to have a single json and reduce the number of GUCs for every case. It'd be like: -- Instead of
SELECT current_setting('request.header.origin', true);
-- Have
SELECT (current_setting('request.headers', true)::json)->'origin'; |
Yes, that's why my proposal was to get rid of the whole GUC interface completely.
That would be a little bit better, but then you still have the same problem when extending the GUC interface. Thinking about adding a new GUC (not related to headers) - always gotta keep a performance trade-off in mind. Really you should then go all the way and do something like SELECT current_setting('pgrst.context', true)->'request'->'headers'->>'origin'; One GUC to rule them all. I don't like it. We don't need to do anything about the GUC interface right now (and especially not for v8). If we treat |
Ah, yes that would be too extreme, too unwieldy for the user. The
Cool. Let's think later about the GUC interface. |
3945c35
to
b8230da
Compare
8537bda
to
6b04491
Compare
I've filtered O2M/M2O/M2M relationships which contained internal tables on one of their ends and M2M relationships that contained a private junction table(related to #1736). Also, for columns that showed an internal FK, I've set that FK to null. This is the new json output. If you search(ctrl + f) the word There's still one problem though. A function can return an internal type, and this shows on the new json output here. That one matches to this test function. postgrest/test/fixtures/schema.sql Lines 1048 to 1050 in 915d568
What should we do about that? I'm thinking we should "hide" it and convert it to a generic |
I understand we have those relationships with private tables in our schema cache, because:
I don't think you need to make any additional effort to filter private junctions out - they will be gone, when the bug is fixed, too? But I assume, there are filtered automatically when filtering the other schemas tables? No idea how you implemented it, I didn't look at the code, yet. Regarding the different schemas, I wonder whether we should not structure our schema cache in such a way, that we basically have a full schema cache for each schema in Are there any other valid reasons for us having more relationships in the schema cache that I missed right now? And if there are no reasons - but we still do have excessive relationships left.. - should we not filter those out before filling the schema cache? My conclusion is: If we do this right, we should not need to manually filter out anything. Right?
No, I don't think so. Imho, the current need for filtering results from us having too much information in the schema cache / not having the cache structured in the best way. But the return type of a function is no private information at all. Any decent root endpoint function should give information about the type for table columns, too - and those types might be in private schemas as well. Basically: Once a schema is exposed, everything in the schema should be "public" information. And that includes all references to other objects in the database. Not the referenced objects, but the references themselves, yes.
Concluding from above, we don't really need to do that. But there is little to no value in that information - so it surely does not hurt either. |
Yes, but most importantly for inferring relationships between views - whose source tables can be private.
For the OpenAPI spec, the way it works now is that we output a different schema cache depending on
Hm, yes. I think that would be the most transparent way for us, same for the FK. I guess in a way is not a good design to return a private type from the function then.
Not necessarily, that bug could be fixed at the DbRequestBuilder level, with a conditional similar to this one: postgrest/src/PostgREST/Request/DbRequestBuilder.hs Lines 177 to 178 in 915d568
Yeah, I think so too. I was trying to keep the change scoped to the |
Ah, yes. Ideally we should not return the base columns from the view query - and then match them up with FKs that we keep in our schema cache, but we should just return the final view relationships (based on base columns + FKs) from that query directly. But that's definitely out of scope here.
This one, I don't understand. I understand that we give a different OpenAPI output for different profiles. But do we really have a separate schema cache for each already? I thought we only had one big schema cache?
I think the user should not be concered with
Not necessarily bad design, I think. I tend to split my api schema into 2 schemas:
I see what you mean. However... once that's propsed, I would immediately argue to fix this at the root, I guess ;)
If it's easily possible - cool. Otherwise, no need to change anything. My point was more meant along the lines of: "If we're going to fix this at other places down the road, you don't really need to get this 100% right, right now". Filtering the huge amount of data right now would be more of a convenience for the user, not a hard-cut "this is exactly the amount of information we want to expose", imho. |
My bad. Yes, only one schema cache(DbStructure), I meant a different a different openapi output.
Disagree here. If the user wants to get the full schema cache and we add our profile logic, then several requests would be needed, one per each schema, no other choice. Getting the full schema cache would be useful for schema-based versioning(getting all resources versions from v1, v2, etc), for example. Also as mentioned before, this is experimental and I don't want to limit the possibilities. postgrest-contrib would decide if this is indeed too complicated and needs a change in core.
Yeah, that would be nice. I think the openapi-in-sql will allow us to focus on these issues better.
Cool. I'll see if it's easy. Otherwise, the schema cache filtering now is a single function, it'd be easy to drop or change. |
I can see your point for diagnostic questions, especially during development, ...
... while this would benefit from our profile handling, because the user would provide the api version through it and surely does not want to receive a schema description for multiple versions at the same time, ....
... but I fully agree here. No objections for now :) |
6b04491
to
5132a01
Compare
After rebasing on top of #1825, there's no need to do most of the filtering I mentioned above. One detail is still leaking though, the inferred FK from a view column: postgrest/test/fixtures/schema.sql Lines 775 to 779 in f6b3a5c
[
{
"colName": "articleId",
"colType": "integer",
"colMaxLen": null,
"colEnum": [],
"colDescription": null,
"colTable": {
"tableSchema": "test",
"tableDescription": null,
"tableInsertable": true,
"tableName": "articleStars"
},
"colNullable": true,
"colFK": {
"fkCol": {
"colName": "id",
"colType": "integer",
"colMaxLen": null,
"colEnum": [],
"colDescription": null,
"colTable": {
"tableSchema": "private",
"tableDescription": null,
"tableInsertable": true,
"tableName": "articles"
},
"colNullable": false,
"colFK": null,
"colDefault": null
}
},
"colDefault": null
},
{
"colName": "userId",
"colType": "integer",
"colMaxLen": null,
"colEnum": [],
"colDescription": null,
"colTable": {
"tableSchema": "test",
"tableDescription": null,
"tableInsertable": true,
"tableName": "articleStars"
},
"colNullable": true,
"colFK": {
"fkCol": {
"colName": "id",
"colType": "integer",
"colMaxLen": null,
"colEnum": [],
"colDescription": null,
"colTable": {
"tableSchema": "test",
"tableDescription": null,
"tableInsertable": true,
"tableName": "users"
},
"colNullable": false,
"colFK": null,
"colDefault": null
}
},
"colDefault": null
},
{
"colName": "createdAt",
"colType": "timestamp without time zone",
"colMaxLen": null,
"colEnum": [],
"colDescription": null,
"colTable": {
"tableSchema": "test",
"tableDescription": null,
"tableInsertable": true,
"tableName": "articleStars"
},
"colNullable": true,
"colFK": null,
"colDefault": null
}
] There you can see how the This @wolfgangwalther WDYT? |
You know the whole concept of Taking a step back for a second: Everything else (list of tables, columns, procs and version) can easily be queried directly from the database. And once we bring OpenApi to SQL, I would assume we do exactly that - query this kind of data from the catalogs. Why not just throw those out alltogether for |
Yeah, but removing
Seems too extreme. For example, getting the views capabilities based on The version is definitely the less interesting bit. I think I just stuffed that in In general I think it'd be better to make |
Having functions return private types on
(From the above comment) @wolfgangwalther For that use case(which is indeed valid), would not suffice to look in
If we do the above, I guess we're kind of forcing users to separate their api types into a non-private schema(like WDYT? |
I still don't like it at all. It's one thing to filter out "primary" database objects, but a totally different thing to modify those objects showing something else. Especially at the request.spec-api layer - you don't really know what people would use the spec data given by PostgREST for. They might need the correct type information. What if I decided to have some bigger extensions installed in a separate schema, because I re-use them across different parts of the database / maybe different APIs. e.g. postgis in a If a user wanted to avoid the |
Hm, but if you do that then you'd still need to add the
Yeah, I don't like it as well. Also, using the domain as an alias would work. I guess putting a private type on the function return would be the same as putting a private table on the public schema. So it's up to the user to avoid this. I'll continue the PR without the return type filtering 👍. |
Thinking a bit more about the Also, the name |
I'll just keep it to Later on, it could be dependent on the |
263b957
to
40a0393
Compare
So I've finished the TODOs above, I've also detected a performance problem when testing the On the main branch, the big schema makes the OpenAPI output takes like a minute(on my machine) while postgrest consumes the full CPU cores - this for each request to root. Testing it on a browser makes the request succeed despite taking a while, but curl fails with (I've noted the above is faster on v7.0.0, but that must be because our schema cache is bigger now - #1625) The So this makes the current I think the way forward would be to somehow store the schema cache in Still, I think this PR is mergeable, it's a start for us to work on @wolfgangwalther WDYT? |
Are you sure? I assume the conversion to json is happening in the thread that reads the schema cache - so that should not really block the request on the root endpoint, at least once the schema cache has been loaded? At the same time I have no problem dumping the big schema with |
Yes. Lazy evaluation is manifesting here, the json schema cache is not evaluated until it's actually requested. When I print the json schema cache, as in steve-chavez@d84c9c3, then the same issue arises - printing it takes too long on the big schema.
I've just tried
It may not be a big deal on more powerful machines since the json will be cached; but still, using constant 279 MB of memory for storing the json schema cache is not ideal. @wolfgangwalther Perhaps you could try this branch on the big schema and see how long does it take for the root request to complete?
|
Since the above problem also happens to me for PR should be good to merge. |
Now that PgVersion is not part of DbStructure, Config is a more apt module for it. Also rename getDbStructure to queryDbStructure. AppState also had a getDbStructure function for a record field.
The request.spec GUC contains the schema cache structure in json. It's only available when the root endpoint(/) is requested and when db-root-spec is not empty. Also correct db-root-spec to accept a schema.
40a0393
to
170454c
Compare
Aims to close #1731. Partially addresses #1698.
When
db-root-spec
is used and when the root endpoint(/
) is requested, this will add arequest.spec
GUC for that transaction.The
request.spec
GUC is PostgREST schema cache in plain json(no OpenAPI), it contains the relationships between tables/views we infer, among other things.Example usage:
Doing a
GET /
, will obtain the following big json(this one uses our test suite fixtures), thedbRelations
key contains the relationships, therelType
is the cardinality.Requesting
root_override
is done as regular RPC, the function can contain parameters, return any type, filters can be applied, etc. The function can internally omit certain fields or add new ones.Caveat
db-root-spec
disables the default OpenAPI output.TODO
request.spec
json. It would be wasteful to convert thedbStructure
to json on each request.request.spec
GUC. Keep the name(as decided below).Implementation notes