Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version of Trillian Kubernetes application #302

Closed
wants to merge 40 commits into from

Conversation

RJPercival
Copy link

I've adapted the Kubernetes manifests from https://github.com/google/trillian/tree/master/examples/deployment/kubernetes to meet the GCP Marketplace requirements. Please take a look and let me know what you think. The readme in particular I think needs more work, and make app/verify seems to have a problem deleting the resources at the end. However, I've tested the deployed app and it appears to work correctly according to our integration tests.

@RJPercival
Copy link
Author

@khajduczenia: Do you have any initial thoughts?

k8s/trillian/Makefile Outdated Show resolved Hide resolved
k8s/trillian/Makefile Outdated Show resolved Hide resolved
@@ -0,0 +1,15 @@
apiVersion: v1
kind: List
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of using the "kind: List" in the resources definitions? I am afraid that without listing the List type explicitly in the Application resource definition, the resource will not be properly removed when the application is removed from a K8s cluster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes it possible to use YAML anchors to deduplicate labels. They aren't actually used in this file, but they're used in most of the others, so I've used a List here for consistency. Anchors only work within a document, so using the "---" separator instead of List breaks the anchors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't add List to the app resource definition because:

  1. the set_ownership.py script crashes (because it doesn't have metadata, see Fix crash in set_ownership.py when resource has no metadata marketplace-k8s-app-tools#293)
  2. the deployer fails with Error from server (Forbidden): lists is forbidden: User "system:serviceaccount:apptest-2e3bcar0:apptest-2e3bcar0-deployer-sa" cannot list lists at the cluster scope

The problem seems to be the assumption that the List exists as a resource in Kubernetes. Instead, I think the Kubernetes server discards the List once it has processed all of the contained resources, so it doesn't need to be annotated with ownership info or deleted at the end.

I think I'll just duplicate the labels since it appears the marketplace tooling isn't able to handle List properly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lists removed.

k8s/trillian/manifest/etcd.yaml Outdated Show resolved Hide resolved
app.kubernetes.io/name: "$NAME"
app.kubernetes.io/component: logserver
annotations:
cloud.google.com/load-balancer-type: "Internal"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of Service will not be portable across other K8s implementations than GKE. Are there any other requirements limiting the portability of the application to GKE?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so, no. I think this is still compatible with other K8s implementations, you just won't get this feature (which is a nice-to-have, rather than required to use Trillian).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right - if this is not required for Trillian, we are fine with having this annotation here.

minimum: 1
IMAGE_ETCD_OPERATOR:
type: string
default: quay.io/coreos/etcd-operator:v0.9.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss the option to republish the image on GCR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be republished by the click-to-deploy team, or would you prefer the Trillian team to republish this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the easiest option would be to republish the image just as we do it in the Makefile, as for ubuntu image in Elasticsearch app.

k8s/trillian/manifest/etcd.yaml Outdated Show resolved Hide resolved
k8s/trillian/schema.yaml Show resolved Hide resolved
title: Etcd cluster size
default: 3
minimum: 3
ETCD_VERSION:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider having a regex for this field?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added one to restrict ETCD_VERSION to 3.x.x, since anything other than v3 is likely to be incompatible.

@@ -0,0 +1,176 @@
# Overview
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add sections about 1) backup & restore, 2) upgrades and 3) uninstallation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on adding these sections now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.build/trillian/db_server: .build/var/REGISTRY \
.build/var/TAG \
| .build/trillian
docker pull gcr.io/trillian-opensource-ci/db_server:$(TAG)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please take a look at the versioning of images at gcr.io/trillian-opensource-ci? It looks like we have there tags matching the commits, not semantic versions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have both - there's a tag for every commit to master, as well as a tag for every released version.

@wgrzelak
Copy link
Contributor

@RJPercival could you addressed the comments?

@RJPercival
Copy link
Author

@wgrzelak: Comments addressed.

Rob Percival added 19 commits February 25, 2019 10:11
No need to pull an image if it is already cached. This is only helpful if
the tag being used is "latest", but we shouldn't recommend that.
Should ensure that the cluster grows to accommodate Trillian regardless
of the number of replicas chosen (within reason).
Eliminates the need to communicate a GCP service account's private key
to the logserver and logsigner containers. Metrics can still be obtained
by setting up a Prometheus server and configuring it to monitor the
/metrics endpoints.
A machine with more CPUs than default is required to meet the requirements
of some pods. A minimum of 3 nodes is required.
Previously, there was a mix of braces and not-braces. This makes it
consistent.
This can be overridden using the TAG parameter.
The recommended cluster size is 3-7, but the FAQ gives failure tolerance
information up to 9, so allow that large a cluster (see
https://coreos.com/etcd/docs/3.2.13/faq.html#deployment).
A pool of 3 n1-standard-2 nodes are required to host Trillian with the
default number of replicas.
The etcd-operator only needs to manage an Etcd cluster for the namespace
used by Trillian.
Limit the Etcd version to 3.x.x, as anything other than v3 is likely to
be incompatible.
Marketplace tools do not work properly with List, so removing it.
Rob Percival added 20 commits February 25, 2019 11:37
This flag has been removed from the binary.
Makes it more resilient to etcd pods restarting.
This contains things like the MySQL root password, which is still useful
after uninstallation because the MySQL database is preserved on a
persistent volume that isn't deleted by uninstallation. This makes it
possible to re-install Trillian without losing access to the database.
The deployer does not wait for etcd-operator to install the CRD for
EtcdCluster before trying to use it, and so fails. Unfortunately, by
removing EtcdCluster from componentKinds, ownership information will not
be setup on the EtcdCluster and so it won't be uninstalled along with
Trillian.

GoogleCloudPlatform/marketplace-k8s-app-tools#303
tracks a feature request for supporting Kubernetes operators that install
a CRD.
…odes

This minimises the number that are lost when a node goes down, improving
cluster resilience.
It was missing a "v" prefix.
The documentation never mentions how to create an expanded manifest file,
so it shouldn't rely on it. Just direct users to use `kubectl delete`
instead.
Otherwise, a single node being restarted (e.g. during a Kubernetes upgrade)
can kill the cluster via loss of quorum.
The etcd-operator creates some services on which it isn't possible to set
labels, so an extra command is required to delete them.
@RJPercival
Copy link
Author

It looks like there's still a problem with using etcd-operator, in that the first installation on a fresh cluster will fail because the etcd-operator CRD hasn't been installed at the time the deployer is running. A re-installation works, because the CRD is installed at that point. Presumably my only options are to either customise the deployer or wait for GoogleCloudPlatform/marketplace-k8s-app-tools#303 to be addressed?

@RJPercival
Copy link
Author

RJPercival commented Feb 27, 2019

It looks like something has gone wrong with some GitHub triggers?

@huyhg
Copy link
Contributor

huyhg commented Feb 28, 2019

/gcbrun

@RJPercival
Copy link
Author

I've fixed the missing db_server image tag that caused the Cloud Build to fail.

@wgrzelak wgrzelak added the kind/k8s Related to the Kubernetes application label Apr 11, 2019
@RJPercival
Copy link
Author

I'm leaving Google tomorrow so I think this is going to be abandoned.

@RJPercival RJPercival closed this Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes kind/k8s Related to the Kubernetes application
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants