-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: add jwks rpc and http api #18035
Conversation
nomad/keyring_endpoint.go
Outdated
if keyMeta.Inactive() { | ||
// Skip inactive keys | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back to the RFC on this and it's not totally clear to me how this will work with root key rotation and identities being pulled from the client asynchronously. Suppose the following scenario:
- Allocation is placed with identity signed with root key A
- Root key is rotated to key B
- Allocation makes a request to third party with identity signed by A
- Third-party hasn't seen this key before, and makes the JKWS request but the public key A is missing
I think we still need to show the public key for any key that can possibly have an unexpired JWT? If we cap the maximum JWT expiry to the root key rotation window (30d), that would suggest we need to display the active key + at least one previous key.
(I might suggest there's no harm in showing all the inactive keys but because the root key is used to encrypt Variables too and we can't GC them without a full rekey, his could be a fairly large set over time.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back to the RFC on this and it's not totally clear to me how this will work
Yeah when writing the RFC I assumed inactive
meant invalid and that is not the case: inactive
keys are valid.
When fixing this up I was going to switch this logic to only skip deprecated
keys, but it appears that state is deprecated: https://github.com/hashicorp/nomad/blob/v1.6.1/nomad/structs/keyring.go#L61-L64
JWKS must include the key for any credential that should be considered valid.
JWKS should not include keys for any credentials that should already be expired.
I can think of 2 paths forward, but both will require a new pending
key status:
pending
keys
Where "creation" is cluster creation (or at least when they first upgrade).
- At
creation + rotation_threshold / 2
the leader should create a new key with thepending
status. Pending keys must never be used for signing! - At
creation + rotation_threshold
the currentactive
key becomesinactive
, and the currentpending
key becomesactive
. - For
rotation_threshold / 2
amount of time there is nopending
key (we want to limit how early we create it a bit because the longer a root key exists, the more likely it is to get leaked). - Goto 1
This will be our pre-publishing mechanism like Vault implemented in hashicorp/vault#12414 (although the description makes it sound caching related, the code is pre-creating the next key).
A. Redefine Key GC
This approach adjusts our existing root_key_gc_threshold
to account for JWT's new expiration behavior:
- Maintain the existing
root_key_rotation_threshold = "30 days"
- Change
root_key_gc_threshold
to30 days
from1 hour
. - If
gc_threshold < rotation_threshold
, then forcegc_threshold = rotation_threshold
and spew a big warning in server logs. - Validate that
identity.ttl <= gc_threshold
.
This means the JWKS endpoint can expose all non-deprecated keys (and if someday deprecated keys go away, so can that if
statement here).
Users who manually configured their rotation_threshold > root_key_gc_threshold
will get the gc_threshold
forcibly bumped on upgrade and a warning in their logs. This sucks as it's changing behavior that a user specifically selected, but I think it's our best choice because:
- JWT expiration fundamentally impacts how the keyring is used, so I think it's appropriate to adapt the keyring's operation to properly handle that.
- Functionally there should be no user visible changes: variables will still work as before. JWTs will work as before.
nomad operator root keyring rotate
will still work.inactive
keys will just exist for much longer.
B. New Key Status
Instead of changing the GC threshold we could add an invalid
or expired
status for keys in addition to the new pending
status both approaches require. (Simply un-deprecating deprecated
might also be sufficient. I'm unclear on the plan/story there.)
- When a key is rotated its status still becomes
inactive
. These keys are still valid and therefore included in this endpoint. - After
rotation_threshold
passes the key becomesinvalid
(or whatever name we choose). These keys are invalid and therefore excluded from this endpoint. - After
gc_threshold
elapses we GC invalid keys, not inactive keys.
The pro here is that we don't change anyone's settings out from underneath them...
...the con is that we do that by just injecting a completely new intermediate state they have no control over. 😅
Functionally the approaches seem to the same to me. I think the new status (B) might be more user friendly as it makes invalid keys observable instead of gc'ing them away for good. However, new states offer opportunities for new bugs! If we forget that we now store untrusted invalid keys and accidently use one somewhere, that's a potentially serious problem.
For this reason I think I have a slight preference for (A).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional background to consider when chosing between these two options:
- Because it's such an expensive operation we don't currently, by default, ever re-key Variables. The periodic rotation just rotates the key for new Variables we encrypt, and even
nomad operator root keyring rotate
requires a-full
flag to kick off a re-key. So keys aren't going to get GC'd unless an org's Variables are ephemeral or get frequently updated (the updates will get encrypted with the active key). - The original intent of
deprecated
was that after all Variables for a key were re-keyed and Workload Identities expired, the key could be safely deleted manually. I can't find in the RFC or the original PRs why we didn't simply delete the key at that point but it might have been paranoia/debuggability. - We intentionally punted on trying to figure out a better solution to the full re-key because as you noted at the time a 30d rotation means we end up with 120 keys after 10 years, and that's just not enough to care about.
The pending
key state sounds great. The rotation_threshold / 2
does mean we're publishing 15 days in advance by default, which seems like a lot, but that's a nitpick based only on vibes and I'm fine with it if Vault was.
Option A. Redefine Key GC changes the threshold but otherwise leaves the existing GC behavior intact. That includes not re-keying by default and leaving the inactive
keys around for potentially quite a long while.
Option B. New Key Status is arguably rolling back to what we were originally trying to do with the deprecated
state, except with a working and reasonable expiration policy on Allocations because the JWTs expire now instead of living the lifetime of the Allocation. But the key can't really be considered deprecated
until we've done a re-key, and that makes its value more dubious.
A third option would be to have an expired
state (same as Option B) and when the GC finds keys in the expired state it kicks off a very slow re-key of all the Variables using that key, and then the expired
key can be deleted. But we could also do the exact same thing with the inactive
state by checking the rotation threshold first to ensure no JWTs are using the key.
So all that being said and one option not being "obviously" better than another, I'd advocate for option A with a follow-up down the road to do a slow re-key of Variables using inactive
keys that are older than the GC threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pending key state sounds great. The
rotation_threshold / 2
does mean we're publishing 15 days in advance by default, which seems like a lot, but that's a nitpick based only on vibes and I'm fine with it if Vault was.
Good point. I think we'll probably want probably want to see if we can keep rotations on the order of single-digit-days.
To be clear the / 2
was just a nice round number to go for. My goal was to keep it as simple, proportional, and deterministic as possible without just forcing every little decision on users via config parameters. We could really easily put users in the position of:
Ok so this alloc's client disconnected 3 days ago, has a max disconnect of 6 days, root keys are rotated every 14 days, and my identities have ttl's of 4 days so.........
That scenario is like 4 time-based config parameters already and doesn't cover JWKS or the behavior of 3rd party systems... so I kind of hope to just keep renewal logic proportional.
All that to say I'll make a note to revisit this when the server rotation PR goes up.
I'd advocate for option A with a follow-up down the road to do a slow re-key of Variables using inactive keys that are older than the GC threshold.
Thanks for all of the context. It is interesting how distinct the root Variable symmetric key is from the root Workload Identity asymmetric key... we could also "just" decouple those since Variables are stateful and Workload Identities are not (soon). As usual "just" is papering over the huge amount of work and complexity that would entail, so I doubt that's the right option.
That being said leaving keys-used-for-Variables-but-not-for-valid-WIs around indefinitely seems fine as long as we prune them from JWKS reasonably.
So yeah, I think we have some wiggle room here both in initial (upcoming!) implementation as well as how to evolve it over time.
switch alg := pubKey.Algorithm; alg { | ||
case structs.PubKeyAlgEdDSA: | ||
// Convert public key bytes to an ed25519 public key | ||
jwk.Key = ed25519.PublicKey(pubKey.PublicKey) | ||
default: | ||
s.logger.Warn("unknown public key algorithm. server is likely newer than client", "alg", alg) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this error message is suggesting that the request came to a Nomad client and was forwarded to a server that was newer, but the text isn't super clear and it's not necessarily the case (ex. the request was sent to a Nomad server that hadn't been upgraded and then that was forwarded to the leader which was).
What if the RPC returned both the []byte
and the string representation needed by downstream consumers, so that we didn't have to worry about this conversion at all? I know this isn't how we'd normally handle this kind of problem, but it seems like it'd cut out a lot of upgrade-related worries in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I struggled with what responsibilities to put in the RPC vs HTTP handlers. If I understand your proposal it would mean the KeyringPublicKey
struct the RPC handler returns would get changed to something like:
type KeyringPublicKey struct {
KeyID string `json:"kid"`
// Always "OKP" (octet key pair) currently
KeyType string `json:"kty"`
PublicKey []byte `json:"x"`
// Always "EdDSA" currently
Algorithm string `json:"crv"`
// Always "Ed25519" currently, must match alg
Curve string `json:"alg"`
// Always "sig" currently
Use string `json:"use"`
// Purely for Nomad's cache control use. Not part of JWK spec.
CreateTime int64 `json:"-"`
}
Then we would just JSON marshal struct { Keys []KeyringPublicKey }
in the HTTP handler and not even touch go-jose.JSONWebKey
.
Reason 1 Against: Too DIY
I avoided this initially because it felt too DIY. The nice thing about go-jose.JSONWebKey
is that it knows how to convert key material to the appropriate algorithm (agl), curve (crv), and type (kty) fields. In the future if we add other key types, as long as we updated the mapping of Algorithm -> Go Type
, go-jose would ensure it was marshalled properly.
That being said go-jose is "saving" us 2 fields currently (kty
and crv
)... and crv
we're effectively setting ourselves by doing the type conversion (that as you know is easy to forget!).
So my "too DIY to skip go-jose" argument seems weak at best...
Reason 2 Against: Validating JWTs in the Agent
But there's another reason to do the conversion here: if we ever want to validate JWTs in the Agent, the Agent needs to do the conversion we're doing here. There's no way to avoid converting Key []byte -> ed25519.PublicKey
at some point in order to use it.
The auth middleware used for the Task API and the WhoAmI RPC would both benefit from being able to validate JWTs using these PublicKeys. In my other branch I have a Keyring.ListPublic -> PublicKeyCache
conversion implemented so the JWKS endpoint and auth middleware could share the underlying public keys in their native Go types.
I think this is reason enough to keep go-jose.JSONWebKey
around even though the type conversions are gross and error prone? I think we can keep the problematic conversion to 1 or 2 places in code.
Alternative: JSONWebKeySet internally
Plan: just have the RPC return go-jose.JSONWebKeySet
and just let the HTTP handler serialize it.
Pro: All of the key handling is the encrypter
and keyring_endpoint
files.
Con:
I don't think it's safe to send go-jose's JSONWebKey
struct over our msgpack RPC. Maybe it is, and I could test it, but even if it is today I'm worried a change in either go-jose, msgpack, or even Go's crypto library could break things.
My main concern is JSONWebKey.Key any
type confusion: go-jose relies on the key's concrete type to properly emit the right JWKS JSON (through a custom marshal implementation and internal struct). If the RPC/msgpack layer were to lose the concrete type information, go-jose may guess that it's a symmetric encryption key and cause a really confusing error (that I caused the other day and you helped me debug!).
I'm also a little worried that if we started using X509 certificates in the future that the x509.Certificate
might not roundtrip cleanly through msgpack, so we'd silently change things like timestamps?
I don't know... these may be unreasonable concerns, but explicit is better than implicit
seems doubly true to me when it comes to cryptographic materials and Key interface{}
is the definition of implicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand your proposal it would mean the KeyringPublicKey struct the RPC handler returns would get changed to something like:
...
Yeah that's pretty much what I was thinking. To be clear, I'm less concerned about where in the code the conversion is happening than avoiding weird cross-version problems in doing that conversion with info we got from the server. But...
The auth middleware used for the Task API and the WhoAmI RPC would both benefit from being able to validate JWTs using these PublicKeys.
I think I'm probably sold on the basis of this alone. We know we're likely to want to do this in the near-ish future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about my comment about the text of the error message though?
// JWKS Handler | ||
s.mux.HandleFunc("/.well-known/jwks.json", s.wrap(s.JWKSRequest)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically the HTTP server will be protected with mTLS, which won't be reachable by third parties. So this probably needs to get served on a separate HTTP listener. (On its own port, I suppose?) This PR is already a nice size so we don't have to solve that in this PR, but just wanted to make sure we didn't forget about that problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah in the RFC (internal only, sorry folks) I call this out as a problem.
The Task API (+Workload Identity) gets around the mTLS problem for things running in Nomad... which our friends Consul and Vault usually are not.
...so then you're left with the option of running an mTLS terminating proxy just for this endpoint which is a hassle to say the least. How the proxy and the JWT's issuer
field interacted could be problematic as well, although I don't think for Consul and Vault because their JWT auth methods allow you to specify the JWKSURL explicitly and only seem to use the issuer
claim for mapping.
That being said this is only a problem for folks running with tls.verify_https_client = true
explicitly as it defaults to false
even with mTLS enabled. I think this is still an ok default and recommendation as the HTTP endpoints will still be protected by (asymmetric) TLS, and the Consul and Vault JWT auth methods support setting a custom CA cert for that.
.......even after all of that we should probably add a "Only TLS, Not mTLS" HTTP listener for this, the eventual OIDC discovery endpoint, and maybe other "low risk" endpoints like metrics.
While I feel like the situation has fairly reasonable options, I think it's clearly too complicated. Somebody hardening their cluster by setting tls.verify_https_client = true
one day breaking their Consul integration due to the JWT auth method lacking a client certificate would just be hellish to debug. Definitely a : (╯°□°)╯︵ ┻━┻ situation I want to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep forgetting about tls.verify_https_client = false
being the default. Even Vault's equivalent is set to false.
We definitely don't need to solve this in this PR for sure 😀
Co-authored-by: Tim Gross <[email protected]>
// When we fallback to the min wait it means we miss the expiration, but this | ||
// is necessary to prevent stampedes after outages and partitions. | ||
must.GreaterEq(t, exp.Sub(now()), renew) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have this comment in the docstring for ExpiryToRenewTime
rather than the test, because the "why" wasn't clear looking at the code.
// If the expiration is in the past or less than the min wait, then the min | ||
// wait time will be used with jitter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But doesn't this behavior floor the result at 15min? If so, when would the client ever renew if the time to renew is always 15min in the future? Is the assumption that clients will call this function (or something that calls it, like the JWKS endpoint) once and then set a timer on that value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the assumption that clients will call this function (or something that calls it, like the JWKS endpoint) once and then set a timer on that value?
Yes, but oof you're right this function has a sort of Fence Post Problem: if you call it before each fetch to determine whether it's time to fetch: you'll never renew!
However if you call it after you fetch it will give you a reasonable time to fetch in. JWKS is the latter, so I think it's safe:
- Consul fetches JWKS v1 response, Cache-Control=15 minutes
- Consul uses JWKS v1 for 15 minutes
- Consul fetches JWKS v2 response, Cache-Control=...
If Consul was using something like ExpiryToRenewTime prior to each fetch I think we'd have the problem you mention.
I'll see if there's a safer way to write this function when I post the next PR that references it. It's possible we don't want to actually share this code at all and can just make it a private function for JWKS to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also just realized I named it "...to renew time" when it returns a duration which is a it silly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍 I've left a couple remaining questions but I don't think there's anything blocking here at this point.
Yanked out of #17434
JWKS endpoint to expose the public keys used to sign workload identity JWTs. The ultimate purpose of this endpoint is to enable integrations with 3rd party JWT auth methods such as Consul's and Vault's.
Not documenting this yet, but have a note on the project board. It's not really possible to use safely to use until we add audiences, expiration, and rotation to workload identities, so I'd like to leave it "hidden" or at least "unsupported" until we can offer a secure experience.