Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: add jwks rpc and http api #18035

Merged
merged 6 commits into from
Jul 27, 2023
Merged

core: add jwks rpc and http api #18035

merged 6 commits into from
Jul 27, 2023

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Jul 22, 2023

Yanked out of #17434

JWKS endpoint to expose the public keys used to sign workload identity JWTs. The ultimate purpose of this endpoint is to enable integrations with 3rd party JWT auth methods such as Consul's and Vault's.

Not documenting this yet, but have a note on the project board. It's not really possible to use safely to use until we add audiences, expiration, and rotation to workload identities, so I'd like to leave it "hidden" or at least "unsupported" until we can offer a secure experience.

nomad/keyring_endpoint.go Show resolved Hide resolved
Comment on lines 407 to 410
if keyMeta.Inactive() {
// Skip inactive keys
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to the RFC on this and it's not totally clear to me how this will work with root key rotation and identities being pulled from the client asynchronously. Suppose the following scenario:

  • Allocation is placed with identity signed with root key A
  • Root key is rotated to key B
  • Allocation makes a request to third party with identity signed by A
  • Third-party hasn't seen this key before, and makes the JKWS request but the public key A is missing

I think we still need to show the public key for any key that can possibly have an unexpired JWT? If we cap the maximum JWT expiry to the root key rotation window (30d), that would suggest we need to display the active key + at least one previous key.

(I might suggest there's no harm in showing all the inactive keys but because the root key is used to encrypt Variables too and we can't GC them without a full rekey, his could be a fairly large set over time.)

Copy link
Member Author

@schmichael schmichael Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to the RFC on this and it's not totally clear to me how this will work

Yeah when writing the RFC I assumed inactive meant invalid and that is not the case: inactive keys are valid.

When fixing this up I was going to switch this logic to only skip deprecated keys, but it appears that state is deprecated: https://github.com/hashicorp/nomad/blob/v1.6.1/nomad/structs/keyring.go#L61-L64

JWKS must include the key for any credential that should be considered valid.

JWKS should not include keys for any credentials that should already be expired.

I can think of 2 paths forward, but both will require a new pending key status:

pending keys

Where "creation" is cluster creation (or at least when they first upgrade).

  1. At creation + rotation_threshold / 2 the leader should create a new key with the pending status. Pending keys must never be used for signing!
  2. At creation + rotation_threshold the current active key becomes inactive, and the current pending key becomes active.
  3. For rotation_threshold / 2 amount of time there is no pending key (we want to limit how early we create it a bit because the longer a root key exists, the more likely it is to get leaked).
  4. Goto 1

This will be our pre-publishing mechanism like Vault implemented in hashicorp/vault#12414 (although the description makes it sound caching related, the code is pre-creating the next key).

A. Redefine Key GC

This approach adjusts our existing root_key_gc_threshold to account for JWT's new expiration behavior:

  1. Maintain the existing root_key_rotation_threshold = "30 days"
  2. Change root_key_gc_threshold to 30 days from 1 hour.
  3. If gc_threshold < rotation_threshold, then force gc_threshold = rotation_threshold and spew a big warning in server logs.
  4. Validate that identity.ttl <= gc_threshold.

This means the JWKS endpoint can expose all non-deprecated keys (and if someday deprecated keys go away, so can that if statement here).

Users who manually configured their rotation_threshold > root_key_gc_threshold will get the gc_threshold forcibly bumped on upgrade and a warning in their logs. This sucks as it's changing behavior that a user specifically selected, but I think it's our best choice because:

  1. JWT expiration fundamentally impacts how the keyring is used, so I think it's appropriate to adapt the keyring's operation to properly handle that.
  2. Functionally there should be no user visible changes: variables will still work as before. JWTs will work as before. nomad operator root keyring rotate will still work. inactive keys will just exist for much longer.

B. New Key Status

Instead of changing the GC threshold we could add an invalid or expired status for keys in addition to the new pending status both approaches require. (Simply un-deprecating deprecated might also be sufficient. I'm unclear on the plan/story there.)

  1. When a key is rotated its status still becomes inactive. These keys are still valid and therefore included in this endpoint.
  2. After rotation_threshold passes the key becomes invalid (or whatever name we choose). These keys are invalid and therefore excluded from this endpoint.
  3. After gc_threshold elapses we GC invalid keys, not inactive keys.

The pro here is that we don't change anyone's settings out from underneath them...

...the con is that we do that by just injecting a completely new intermediate state they have no control over. 😅

Functionally the approaches seem to the same to me. I think the new status (B) might be more user friendly as it makes invalid keys observable instead of gc'ing them away for good. However, new states offer opportunities for new bugs! If we forget that we now store untrusted invalid keys and accidently use one somewhere, that's a potentially serious problem.

For this reason I think I have a slight preference for (A).

Copy link
Member

@tgross tgross Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional background to consider when chosing between these two options:

  • Because it's such an expensive operation we don't currently, by default, ever re-key Variables. The periodic rotation just rotates the key for new Variables we encrypt, and even nomad operator root keyring rotate requires a -full flag to kick off a re-key. So keys aren't going to get GC'd unless an org's Variables are ephemeral or get frequently updated (the updates will get encrypted with the active key).
  • The original intent of deprecated was that after all Variables for a key were re-keyed and Workload Identities expired, the key could be safely deleted manually. I can't find in the RFC or the original PRs why we didn't simply delete the key at that point but it might have been paranoia/debuggability.
  • We intentionally punted on trying to figure out a better solution to the full re-key because as you noted at the time a 30d rotation means we end up with 120 keys after 10 years, and that's just not enough to care about.

The pending key state sounds great. The rotation_threshold / 2 does mean we're publishing 15 days in advance by default, which seems like a lot, but that's a nitpick based only on vibes and I'm fine with it if Vault was.

Option A. Redefine Key GC changes the threshold but otherwise leaves the existing GC behavior intact. That includes not re-keying by default and leaving the inactive keys around for potentially quite a long while.

Option B. New Key Status is arguably rolling back to what we were originally trying to do with the deprecated state, except with a working and reasonable expiration policy on Allocations because the JWTs expire now instead of living the lifetime of the Allocation. But the key can't really be considered deprecated until we've done a re-key, and that makes its value more dubious.

A third option would be to have an expired state (same as Option B) and when the GC finds keys in the expired state it kicks off a very slow re-key of all the Variables using that key, and then the expired key can be deleted. But we could also do the exact same thing with the inactive state by checking the rotation threshold first to ensure no JWTs are using the key.

So all that being said and one option not being "obviously" better than another, I'd advocate for option A with a follow-up down the road to do a slow re-key of Variables using inactive keys that are older than the GC threshold.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pending key state sounds great. The rotation_threshold / 2 does mean we're publishing 15 days in advance by default, which seems like a lot, but that's a nitpick based only on vibes and I'm fine with it if Vault was.

Good point. I think we'll probably want probably want to see if we can keep rotations on the order of single-digit-days.

To be clear the / 2 was just a nice round number to go for. My goal was to keep it as simple, proportional, and deterministic as possible without just forcing every little decision on users via config parameters. We could really easily put users in the position of:

Ok so this alloc's client disconnected 3 days ago, has a max disconnect of 6 days, root keys are rotated every 14 days, and my identities have ttl's of 4 days so.........

That scenario is like 4 time-based config parameters already and doesn't cover JWKS or the behavior of 3rd party systems... so I kind of hope to just keep renewal logic proportional.

All that to say I'll make a note to revisit this when the server rotation PR goes up.

I'd advocate for option A with a follow-up down the road to do a slow re-key of Variables using inactive keys that are older than the GC threshold.

Thanks for all of the context. It is interesting how distinct the root Variable symmetric key is from the root Workload Identity asymmetric key... we could also "just" decouple those since Variables are stateful and Workload Identities are not (soon). As usual "just" is papering over the huge amount of work and complexity that would entail, so I doubt that's the right option.

That being said leaving keys-used-for-Variables-but-not-for-valid-WIs around indefinitely seems fine as long as we prune them from JWKS reasonably.

So yeah, I think we have some wiggle room here both in initial (upcoming!) implementation as well as how to evolve it over time.

nomad/structs/keyring.go Outdated Show resolved Hide resolved
Comment on lines +55 to +61
switch alg := pubKey.Algorithm; alg {
case structs.PubKeyAlgEdDSA:
// Convert public key bytes to an ed25519 public key
jwk.Key = ed25519.PublicKey(pubKey.PublicKey)
default:
s.logger.Warn("unknown public key algorithm. server is likely newer than client", "alg", alg)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this error message is suggesting that the request came to a Nomad client and was forwarded to a server that was newer, but the text isn't super clear and it's not necessarily the case (ex. the request was sent to a Nomad server that hadn't been upgraded and then that was forwarded to the leader which was).

What if the RPC returned both the []byte and the string representation needed by downstream consumers, so that we didn't have to worry about this conversion at all? I know this isn't how we'd normally handle this kind of problem, but it seems like it'd cut out a lot of upgrade-related worries in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I struggled with what responsibilities to put in the RPC vs HTTP handlers. If I understand your proposal it would mean the KeyringPublicKey struct the RPC handler returns would get changed to something like:

type KeyringPublicKey struct {
	KeyID     string `json:"kid"`

	// Always "OKP" (octet key pair) currently
	KeyType     string `json:"kty"`

	PublicKey []byte `json:"x"`

	// Always "EdDSA" currently
	Algorithm string `json:"crv"`

	// Always "Ed25519" currently, must match alg
	Curve string `json:"alg"`

	// Always "sig" currently
	Use string `json:"use"`

	// Purely for Nomad's cache control use. Not part of JWK spec.
	CreateTime int64 `json:"-"`
}

Then we would just JSON marshal struct { Keys []KeyringPublicKey } in the HTTP handler and not even touch go-jose.JSONWebKey.

Reason 1 Against: Too DIY

I avoided this initially because it felt too DIY. The nice thing about go-jose.JSONWebKey is that it knows how to convert key material to the appropriate algorithm (agl), curve (crv), and type (kty) fields. In the future if we add other key types, as long as we updated the mapping of Algorithm -> Go Type, go-jose would ensure it was marshalled properly.

That being said go-jose is "saving" us 2 fields currently (kty and crv)... and crv we're effectively setting ourselves by doing the type conversion (that as you know is easy to forget!).

So my "too DIY to skip go-jose" argument seems weak at best...

Reason 2 Against: Validating JWTs in the Agent

But there's another reason to do the conversion here: if we ever want to validate JWTs in the Agent, the Agent needs to do the conversion we're doing here. There's no way to avoid converting Key []byte -> ed25519.PublicKey at some point in order to use it.

The auth middleware used for the Task API and the WhoAmI RPC would both benefit from being able to validate JWTs using these PublicKeys. In my other branch I have a Keyring.ListPublic -> PublicKeyCache conversion implemented so the JWKS endpoint and auth middleware could share the underlying public keys in their native Go types.

I think this is reason enough to keep go-jose.JSONWebKey around even though the type conversions are gross and error prone? I think we can keep the problematic conversion to 1 or 2 places in code.

Alternative: JSONWebKeySet internally

Plan: just have the RPC return go-jose.JSONWebKeySet and just let the HTTP handler serialize it.

Pro: All of the key handling is the encrypter and keyring_endpoint files.

Con:

I don't think it's safe to send go-jose's JSONWebKey struct over our msgpack RPC. Maybe it is, and I could test it, but even if it is today I'm worried a change in either go-jose, msgpack, or even Go's crypto library could break things.

My main concern is JSONWebKey.Key any type confusion: go-jose relies on the key's concrete type to properly emit the right JWKS JSON (through a custom marshal implementation and internal struct). If the RPC/msgpack layer were to lose the concrete type information, go-jose may guess that it's a symmetric encryption key and cause a really confusing error (that I caused the other day and you helped me debug!).

I'm also a little worried that if we started using X509 certificates in the future that the x509.Certificate might not roundtrip cleanly through msgpack, so we'd silently change things like timestamps?

I don't know... these may be unreasonable concerns, but explicit is better than implicit seems doubly true to me when it comes to cryptographic materials and Key interface{} is the definition of implicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your proposal it would mean the KeyringPublicKey struct the RPC handler returns would get changed to something like:
...

Yeah that's pretty much what I was thinking. To be clear, I'm less concerned about where in the code the conversion is happening than avoiding weird cross-version problems in doing that conversion with info we got from the server. But...

The auth middleware used for the Task API and the WhoAmI RPC would both benefit from being able to validate JWTs using these PublicKeys.

I think I'm probably sold on the basis of this alone. We know we're likely to want to do this in the near-ish future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about my comment about the text of the error message though?

Comment on lines +504 to +505
// JWKS Handler
s.mux.HandleFunc("/.well-known/jwks.json", s.wrap(s.JWKSRequest))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically the HTTP server will be protected with mTLS, which won't be reachable by third parties. So this probably needs to get served on a separate HTTP listener. (On its own port, I suppose?) This PR is already a nice size so we don't have to solve that in this PR, but just wanted to make sure we didn't forget about that problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah in the RFC (internal only, sorry folks) I call this out as a problem.

The Task API (+Workload Identity) gets around the mTLS problem for things running in Nomad... which our friends Consul and Vault usually are not.

...so then you're left with the option of running an mTLS terminating proxy just for this endpoint which is a hassle to say the least. How the proxy and the JWT's issuer field interacted could be problematic as well, although I don't think for Consul and Vault because their JWT auth methods allow you to specify the JWKSURL explicitly and only seem to use the issuer claim for mapping.

That being said this is only a problem for folks running with tls.verify_https_client = true explicitly as it defaults to false even with mTLS enabled. I think this is still an ok default and recommendation as the HTTP endpoints will still be protected by (asymmetric) TLS, and the Consul and Vault JWT auth methods support setting a custom CA cert for that.

.......even after all of that we should probably add a "Only TLS, Not mTLS" HTTP listener for this, the eventual OIDC discovery endpoint, and maybe other "low risk" endpoints like metrics.

While I feel like the situation has fairly reasonable options, I think it's clearly too complicated. Somebody hardening their cluster by setting tls.verify_https_client = true one day breaking their Consul integration due to the JWT auth method lacking a client certificate would just be hellish to debug. Definitely a : (╯°□°)╯︵ ┻━┻ situation I want to avoid.

Copy link
Member

@tgross tgross Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep forgetting about tls.verify_https_client = false being the default. Even Vault's equivalent is set to false.

We definitely don't need to solve this in this PR for sure 😀

Comment on lines +58 to +60
// When we fallback to the min wait it means we miss the expiration, but this
// is necessary to prevent stampedes after outages and partitions.
must.GreaterEq(t, exp.Sub(now()), renew)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have this comment in the docstring for ExpiryToRenewTime rather than the test, because the "why" wasn't clear looking at the code.

Comment on lines +15 to +16
// If the expiration is in the past or less than the min wait, then the min
// wait time will be used with jitter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doesn't this behavior floor the result at 15min? If so, when would the client ever renew if the time to renew is always 15min in the future? Is the assumption that clients will call this function (or something that calls it, like the JWKS endpoint) once and then set a timer on that value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the assumption that clients will call this function (or something that calls it, like the JWKS endpoint) once and then set a timer on that value?

Yes, but oof you're right this function has a sort of Fence Post Problem: if you call it before each fetch to determine whether it's time to fetch: you'll never renew!

However if you call it after you fetch it will give you a reasonable time to fetch in. JWKS is the latter, so I think it's safe:

  1. Consul fetches JWKS v1 response, Cache-Control=15 minutes
  2. Consul uses JWKS v1 for 15 minutes
  3. Consul fetches JWKS v2 response, Cache-Control=...

If Consul was using something like ExpiryToRenewTime prior to each fetch I think we'd have the problem you mention.

I'll see if there's a safer way to write this function when I post the next PR that references it. It's possible we don't want to actually share this code at all and can just make it a private function for JWKS to use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also just realized I named it "...to renew time" when it returns a duration which is a it silly)

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍 I've left a couple remaining questions but I don't think there's anything blocking here at this point.

@schmichael schmichael merged commit d14362e into main Jul 27, 2023
21 checks passed
@schmichael schmichael deleted the f-jwks branch July 27, 2023 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants