Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making Cert-to-CRL bindings fully static #7094

Open
Tracked by #7312
aarongable opened this issue Sep 22, 2023 · 14 comments
Open
Tracked by #7312

Making Cert-to-CRL bindings fully static #7094

aarongable opened this issue Sep 22, 2023 · 14 comments
Assignees

Comments

@aarongable
Copy link
Contributor

Today, our certs are assorted into CRL shards at the time those shards are generated. The decision of which shard a cert will end up in is based on: the number of shards, the configured "chunk width", and the cert's notAfter timestamp. This is good, and prevents us from moving certs between shards unnecessarily.

However, it is not good enough for us to be able to include CRL URLs directly in our certs. In order to do that, the mapping needs to be fully static.

#7007 was a necessary prerequisite for this work.

Our design is as follows:


Phase 1: Populating the revokedCertificates Table

We create a new database table with the following schema:

CREATE TABLE `revokedCertificates` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`issuerID` bigint(20) NOT NULL,
`shardIdx` bigint(20) NOT NULL,
`notAfter` datetime NOT NULL,
`serial` varchar(255) NOT NULL,
`revokedDate` datetime NOT NULL,
`revokedReason` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `notAfter_idx` (`notAfter`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
PARTITION BY RANGE(id)
(PARTITION p_start VALUES LESS THAN (MAXVALUE));

This table has all of the information necessary to look up revoked certificates quickly by shard, and all of the information necessary to populate a Revoked Certificate entry in a CRL. All columns are NOT NULL because rows will only be inserted when a certificate is actually revoked, and all of this information is available.

We augment the SA’s RevokeCertificate method to take a new parameter, the CRL shard to which the certificate belongs. If it receives a request which contains a shard index, it writes a row to the new revokedCertificates table in addition to its current update to the certificateStatus table.

We augment the RA with the same CRL bucketing configuration parameters as the crl-updater has, specifically NumShards and ShardWidth, which control which shard a certificate falls into. We also add to the RA a version of the shard computation code which can quickly determine the correct shard number given the config values above and a certificate’s notAfter date. When the RA processes a revocation request, it will compute the shard index for that certificate, and include it in the request to the SA, triggering the codepath described above.


Phase 2: Using the revokedCertificates Table

When Phase 1 has been deployed for more than 90 days, and we are sure that all unexpired revoked certificates have rows in the revokedCertificates table, we can begin reading from it instead of reading from the certificateStatus table.

We will augment the SA’s GetRevokedCerts method to take a new parameter, the CRL shard which is being queried. If it receives a request which contains a shard index, it will query the revokedCertificates table instead of the certificateStatus table, and will include the shard index as an additional WHERE clause in the query.

We then greatly simplify crl-updater’s query mechanism. Rather than using its configuration parameters to compute complex sets of ExpiresBefore and ExpiresAfter dates that contribute to each shard, it will instead send a single query per shard, including that shard ID in the request to the SA. The request will still include ExpiresAfter and ExpiresBefore timestamps, because the crl-updater is the only entity which knows how far back our look-back needs to go and what timestamp is being used as the cutoff for this CRL.

@aarongable
Copy link
Contributor Author

Implementation plan:

  1. Create the table, fully implement the updated versions of RevokeCertificate, UpdateRevokedCertificate, and GetRevokedCertificates. None of the new code will be active until other components begin supplying shard indices in their requests.
  2. Wait one deployment cycle, for the table to be created and the gRPC methods to be deployed.
  3. Add code to the RA (and admin-revoker?) to compute and supply a ShardIdx when revoking a certificate. This will activate the RevokeCertificate and UpdateRevokedCertificate code paths.
  4. Wait 100 days to ensure that all unexpired revoked certificates are represented in the new table.
  5. Add code to the crl-updater to request revoked certs by shard, rather than by chunk boundaries. This will activate the GetRevokedCertificates code path.
  6. Clean up old code paths.

This bug tracks all stages of this implementation plan.

@aarongable aarongable self-assigned this Sep 22, 2023
aarongable added a commit that referenced this issue Oct 2, 2023
Add a new "revokedCertificates" table to the database schema. This table
is similar to the existing "certificateStatus" table in many ways, but
the idea is that it will only have rows added to it when certificates
are revoked, not when they're issued. Thus, it will grow many orders of
magnitude slower than the certificateStatus table does. Eventually, it
will replace that table entirely.

The one column that revokedCertificates adds is the new "ShardIdx"
column, which is the CRL shard in which the revoked certificate will
appear. This way we can assign certificates to CRL shards at the time
they are revoked, and guarantee that they will never move to a different
shard even if we change the number of shards we produce. This will
eventually allow us to put CRL URLs directly into our certificates,
replacing OCSP URLs.

Add new logic to the SA's RevokeCertificate and UpdateRevokedCertificate
methods to handle this new table. If these methods receive a request
which specifies a CRL shard (our CRL shards are 1-indexed, so shard 0
does not exist), then they will ensure that the new revocation status is
written into both the certificateStatus and revokedCertificates tables.
This logic will not function until the RA is updated to take advantage
of it, so it is not a risk for it to appear in Boulder before the new
table has been created.

Also add new logic to the SA's GetRevokedCertificates method. Similar to
the above, this reads from the new table if the ShardIdx field is
supplied in the request message. This code will not operate until the
crl-updater is updated to include this field. We will not perform this
update for a minimum of 100 days after this code is deployed, to ensure
that all unexpired revoked certificates are present in the
revokedCertificates table.

Part of #7094
@aarongable
Copy link
Contributor Author

IN-9706 has been file to track creating the new table in prod. I've begun work on the RA code which cannot land until after the table actually exists.

aarongable added a commit that referenced this issue Nov 1, 2023
Add a new clock argument to the test-only ThrowAwayCert function, and
use that clock to generate reasonable notBefore and notAfter timestamps
in the resulting throwaway test cert. This is necessary to easily test
functions which rely on the expiration timestamp of the certificate,
such as upcoming work about computing CRL shards.

Part of #7094
aarongable added a commit that referenced this issue Nov 2, 2023
Make crl.GetChunkAtTIme an exported function, so that it can be used by
the RA in a future PR without making that PR bigger and harder to
review. Also move it from being a method to an independent function
which takes two new arguments to compensate for the loss of its
receiver.

Also move some tests from batch_test to updater_test, where they should
have been in the first place.

Part of #7094
@aarongable
Copy link
Contributor Author

We've always known that this project would involve touching the CA's CRL issuance code. However, I incorrectly believed that that work would be able to be put off until it was time to actually include CRL URLs in our certs. Comments on #7133 have made it clear that no, that work should actually happen now. I'm tracking that work in #7159, which blocks this bug.

aarongable added a commit that referenced this issue Feb 13, 2024
Move the CRL issuance logic -- building an x509.RevocationList template,
populating it with correctly-built extensions, linting it, and actually
signing it -- out of the //ca package and into the //issuance package.
This means that the CA's CRL code no longer needs to be able to reach
inside the issuance package to access its issuers and certificates (and
those fields will be able to be made private after the same is done for
OCSP issuance).

Additionally, improve the configuration of CRL issuance, create
additional checks on CRL's ThisUpdate and NextUpdate fields, and make it
possible for a CRL to contain two IssuingDistributionPoint URIs so that
we can migrate to shorter addresses.

IN-10045 tracks the corresponding production changes.

Fixes #7159
Part of #7296
Part of #7294
Part of #7094
Part of #7100
maksimsavrilov pushed a commit to plesk/boulder that referenced this issue Feb 14, 2024
Move the CRL issuance logic -- building an x509.RevocationList template,
populating it with correctly-built extensions, linting it, and actually
signing it -- out of the //ca package and into the //issuance package.
This means that the CA's CRL code no longer needs to be able to reach
inside the issuance package to access its issuers and certificates (and
those fields will be able to be made private after the same is done for
OCSP issuance).

Additionally, improve the configuration of CRL issuance, create
additional checks on CRL's ThisUpdate and NextUpdate fields, and make it
possible for a CRL to contain two IssuingDistributionPoint URIs so that
we can migrate to shorter addresses.

IN-10045 tracks the corresponding production changes.

Fixes letsencrypt#7159
Part of letsencrypt#7296
Part of letsencrypt#7294
Part of letsencrypt#7094
Part of letsencrypt#7100
@aarongable
Copy link
Contributor Author

This is not a hard computer science problem, but it is a hard engineering problem. Deciding the separation of responsibilities between the RA and the SA, encapsulating logic, preventing config from being duplicated between too many different services, etc.

Current plan:

  • RA gets configured with number of CRL shards
  • At issuance time, RA determines shard randomly, and informs CA of which shard the cert should be in
  • CA uses this to bake the CRL URL into the cert
  • At revocation time, RA parses the CRL URL out of the cert and writes the shard number into the revokedCerts table for the crl-updater to use

@jsha
Copy link
Contributor

jsha commented Jan 7, 2025

Current plan:

  • RA gets configured with number of CRL shards
  • At issuance time, RA determines shard randomly, and informs CA of which shard the cert should be in
  • CA uses this to bake the CRL URL into the cert
  • At revocation time, RA parses the CRL URL out of the cert and writes the shard number into the revokedCerts table for the crl-updater to use

How does this interact with certificates issued under the current scheme that uses temporal sharding?

Proposal: at CRL generation time, check for certificates by both issuance time and shardIdx.

Specifically, modify GetRevokedCerts (

boulder/sa/saro.go

Lines 1062 to 1068 in 7137605

func (ssa *SQLStorageAuthorityRO) GetRevokedCerts(req *sapb.GetRevokedCertsRequest, stream grpc.ServerStreamingServer[corepb.CRLEntry]) error {
if req.ShardIdx != 0 {
return ssa.getRevokedCertsFromRevokedCertificatesTable(req, stream)
} else {
return ssa.getRevokedCertsFromCertificateStatusTable(req, stream)
}
}
) so that instead of doing something completely different when shardIdx is present, it does something additional (checks the revokedCertificates table) and combines the results.

Once that's deployed, modify crl-updater to unconditionally pass ShardIdx in its call to GetRevokedCerts.

Then add the code to recognize embedded CRL URLs at revocation time and write to revokedCertificates.

Then add the code to generate embedded CRL URLs at issuance time.

@aarongable
Copy link
Contributor Author

I believe that plan would naively result in some certificates being included in multiple CRL shards -- they'd appear in one shard due to temporal sharding, and appear in a different shard due to the RevokedCerts table (if they were revoked after the RA starts supplying a shard number at revocation time).

The original plan was to have the RA supply a shard number, then wait for all certs that were revoked without a shard number to expire (so we have 100% coverage), then flip the flag. This would cause certs to move between shards, but they'd never be duplicated. And it would take 3 months to bake.

I think that your idea works, and works much faster, at the cost of temporarily inflating our CRL sizes. I'm not personally aware of any compliance issue with a cert appearing in multiple shards of a sharded CRL, as long as the data in each entry is identical.

@jsha
Copy link
Contributor

jsha commented Jan 7, 2025

Yes, it would definitely result in most revoked certificate showing up in two shards - its temporal shard and its explicitly assigned shard. I also think this is okay.

If we want to avoid that, another option: when assigning explicit shards (at issuance time), assign them from a range of numbers that don't overlap with the temporal sharding range at all. This would require temporarily doubling the number of shards we publish in CCADB. This has the advantage that it might be a little more obvious to debug. "This shard number is > 2000, so we expect it to be filled with certificates that had an explicit shard assigned at issuance time."

@aarongable
Copy link
Contributor Author

I think simply inflating our current shards is the simpler and better solution. I'm doing some digging now, but so far nothing has popped up to indicate this would be a compliance issue.

@jsha
Copy link
Contributor

jsha commented Jan 10, 2025

#7918 is out for review.

Of course, having made that change and updated the tests, I'm now wondering: perhaps it would be better to expose a new SA RPC: GetRevokedCertsByShard, which only returns certificates by their explicit shard. Then the crl-updater could be responsible for calling both endpoints and merging the results. This would have some advantages:

Any possible memory bloat would be taken by the crl-updater (which already has some memory bloat), not the SA.

Also, it would let crl-updater get clever about when the "last" temporal shard is - that is, the last span of time that covers certificates without a CRLDP.

@aarongable
Copy link
Contributor Author

I do like the idea of having crl-updater take the memory bloat, rather than the SA.

It could be done even without the new method: GetRevokedCerts's behavior could be conditioned on the presence of the ShardIdx field, doing only temporal if it is missing and only explicit if it is provided.

@jsha
Copy link
Contributor

jsha commented Jan 13, 2025

It could be done even without the new method

I'm choosing to do it with the new method, because the set of fields required for GetRevokedCertsByShard is significantly different, and it's risky to have a set of fields that are sometimes required and sometimes not, conditioned on a different field.

@jsha
Copy link
Contributor

jsha commented Jan 14, 2025

The current plan has these parts:

jsha added a commit that referenced this issue Jan 22, 2025
In RA.RevokedCertificate, if the certificate being revoked has a
crlDistributionPoints extension, parse the URL and pass the appropriate
shard to the SA.

This required some changes to the `admin` tool. When a malformed
certificate is revoked, we don't have a parsed copy of the certificate
to extract a CRL URL from. So, specifically when a malformed certificate
is being revoked, allow specifying a CRL shard. Because different
certificates will have different shards, require one-at-a-time
revocation for malformed certificates.

To support that refactoring, move the serial-cleaning functionality
earlier in the `admin` tool's flow.

Also, split out one of the cases handled by the `revokeCertificate`
helper in the RA. For admin malformed revocations, we need to accept a
human-specified ShardIdx, so call the SA directly in that case (and skip
stat increment since admin revocations aren't useful for metrics). This
allows `revokeCertificate` to be a more helpful helper, by extracting
serial, issuer ID, and CRL shard automatically from an
`*x509.Certificate`.

Note: we don't yet issue certificates with the crlDistributionPoints
extension, so this code will not be active until we start doing so.

Part of #7094.
jsha added a commit that referenced this issue Jan 22, 2025
The SA had some logic (not yet in use) to return revoked certificates
either by temporal sharding (if `req.ShardIdx` is zero) or by explicit
sharding (if `req.ShardIdx` is nonzero).

This PR splits the function into two. The existing `GetRevokedCerts`
always does temporal sharding. The new `GetRevokedCertsByShard` always
does explicit sharding. Eventually only `GetRevokedCertsByShard` will be
necessary. This change was discussed in
#7094 (comment)
and is a precursor to having the crl-updater call both methods, so we
can merge the results when generating CRLs.
@jsha
Copy link
Contributor

jsha commented Jan 24, 2025

For #7974 I needed the CA to know at issuance time how many CRL shards it has.

Right now, the config for number of CRL shards is in the crl-updater, and it's global - all CAs have the same number of shards.

In the CA, it made sense to configure the number of CRL shards as part of issuance.IssuerConfig (there's no global config in the CA for all issuers).

But that does mean the CA can have different numbers of shards configured for different issuers. Perhaps even more importantly, the crl-updater can disagree with the CA about how many shards there are.

One possible solution: each time the crl-updater calls UpdateCRL, it sends metadata containing the current number of shards. If the issuer's config disagrees, the CA errors. This would cause the CRL updater to error out CRL generation; we could detect that with metrics monitoring (for some new metric), or with end-to-end monitoring that CRLs are being updated promptly. Or we could cause the crl-updater to crash on this specific kind of error.

@jsha
Copy link
Contributor

jsha commented Jan 24, 2025

One problem with the current design: during the transition period between temporally sharded certificates and explicitly sharded certificates, the explicitly sharded certificates will typically show up on two CRLs: their explicit shard and their temporal shard (which is typically different).

Proposal: Use the CA's serialPrefix to separate temporally sharded and explicitly sharded certificate.

  • crl-updater gets a config field, temporallyShardedPrefixes, containing a list of single-byte prefixes that have been used to issue certificates with no explicit shard (that is, no CRLDP extension).
  • When crl-updater is building a CRL, it queries by both explicit shard and temporal shard. When processing the results for the temporal shard query, it discards any serial numbers whose single-byte prefix isn't in temporallyShardedPrefixes.
  • When enabling explicit sharding at the CA (that is, setting the Issuer.CRLShards config to nonzero for all issuers), also change the CA's serialPrefix value to some previously-unused number.

This has some nice advantages. It allows us to avoid duplicating entries, even across different shards. Conceptually it allows a cleaner separation of certificates: they are either temporally or explicitly sharded, based on how they were issued. Operationally it makes it easy to see whether a given serial is temporally sharded or explicitly sharded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants